Deploying Python ML Models with Bodywork

Posted on Tue 01 December 2020 in machine-learning-engineering • Tagged with python, machine-learning, mlops, kubernetes, bodywork


Solutions to Machine Learning (ML) tasks are often developed within Jupyter notebooks. Once a candidate solution is found, you are then faced with an altogether different problem - how to engineer the solution into your product and how to maintain the performance of the solution as new instances of data are …

Continue reading

Best Practices for PySpark ETL Projects

Posted on Sun 28 July 2019 in data-engineering • Tagged with data-engineering, data-processing, apache-spark, python


I have often lent heavily on Apache Spark and the SparkSQL APIs for operationalising any type of batch data-processing ‘job’, within a production environment where handling fluctuating volumes of data reliably and consistently are on-going business concerns. These batch data-processing jobs may involve nothing more than joining data sources and …

Continue reading

Stochastic Process Calibration using Bayesian Inference & Probabilistic Programs

Posted on Fri 18 January 2019 in data-science • Tagged with probabilistic-programming, python, pymc3, quant-finance, stochastic-processes


Stochastic processes are used extensively throughout quantitative finance - for example, to simulate asset prices in risk models that aim to estimate key risk metrics such as Value-at-Risk (VaR), Expected Shortfall (ES) and Potential Future Exposure (PFE). Estimating the parameters of a stochastic processes - referred to as ‘calibration’ in the parlance …

Continue reading

Deploying Python ML Models with Flask, Docker and Kubernetes

Posted on Thu 10 January 2019 in machine-learning-engineering • Tagged with python, machine-learning, machine-learning-operations, kubernetes


  • 17th August 2019 - updated to reflect changes in the Kubernetes API and Seldon Core.
  • 14th December 2020 - the work in this post forms the basis of the Bodywork MLOps framework - read about it here.

A common pattern for deploying Machine Learning (ML) models into production environments - e.g. ML models …

Continue reading

Bayesian Regression in PYMC3 using MCMC & Variational Inference

Posted on Wed 07 November 2018 in data-science • Tagged with machine-learning, probabilistic-programming, python, pymc3


Conducting a Bayesian data analysis - e.g. estimating a Bayesian linear regression model - will usually require some form of Probabilistic Programming Language (PPL), unless analytical approaches (e.g. based on conjugate prior models), are appropriate for the task at hand. More often than not, PPLs implement Markov Chain Monte Carlo …

Continue reading

Machine Learning Pipelines for R

Posted on Mon 08 May 2017 in r • Tagged with machine-learning, data-processing


Building machine learning and statistical models often requires pre- and post-transformation of the input and/or response variables, prior to training (or fitting) the models. For example, a model may require training on the logarithm of the response and input variables. As a consequence, fitting and then generating predictions from …

Continue reading

elasticsearchr - a Lightweight Elasticsearch Client for R

Posted on Mon 28 November 2016 in r • Tagged with data-processing, data-stores


Elasticsearch is a distributed NoSQL document store search-engine and column-oriented database, whose fast (near real-time) reads and powerful aggregation engine make it an excellent choice as an ‘analytics database’ for R&D, production-use or both. Installation is simple, it ships with default settings that allow it to work effectively out-of-the-box …

Continue reading

Asynchronous and Distributed Programming in R with the Future Package

Posted on Wed 02 November 2016 in r • Tagged with data-processing, high-performance-computing


Every now and again someone comes along and writes an R package that I consider to be a ‘game changer’ for the language and it’s application to Data Science. For example, I consider dplyr one such package as it has made data munging/manipulation that more intuitive and more …

Continue reading

An R Function for Generating Authenticated URLs to Private Web Sites Hosted on AWS S3

Posted on Mon 19 September 2016 in r • Tagged with AWS


Quite often I want to share simple (static) web pages with other colleagues or clients. For example, I may have written a report using R Markdown and rendered it to HTML. AWS S3 can easily host such a simple web page (e.g. see here), but it cannot, however, offer …

Continue reading

Building a Data Science Platform for R&D, Part 4 - Apache Zeppelin & Scala Notebooks

Posted on Mon 29 August 2016 in data-science • Tagged with AWS, data-processing


Parts one, two and three of this series of posts have taken us from creating an account on AWS to loading and interacting with data in Spark via R and R Studio. My vision of a Data Science platform for R&D is nearly complete - the only outstanding component is …

Continue reading

Building a Data Science Platform for R&D, Part 3 - R, R Studio Server, SparkR & Sparklyr

Posted on Mon 22 August 2016 in data-science • Tagged with AWS, data-processing, apache-spark


Part 1 and Part 2 of this series dealt with setting up AWS, loading data into S3, deploying a Spark cluster and using it to access our data. In this part we will deploy R and R Studio Server to our Spark cluster’s master node and use it to …

Continue reading

Building a Data Science Platform for R&D, Part 2 - Deploying Spark on AWS using Flintrock

Posted on Thu 18 August 2016 in data-science • Tagged with AWS, data-processing, apache-spark


Part 1 in this series of blog posts describes how to setup AWS with some basic security and then load data into S3. This post walks-through the process of setting up a Spark cluster on AWS and accessing our S3 data from within Spark.

A key part of my vision …

Continue reading

Building a Data Science Platform for R&D, Part 1 - Setting-Up AWS

Posted on Tue 16 August 2016 in data-science • Tagged with AWS, data-processing


Here’s my vision: I get into the office and switch-on my laptop; then I start-up my Spark cluster; I interact with it via RStudio to exploring a new dataset a client uploaded overnight; after getting a handle on what I want to do with it, I prototype an ETL …

Continue reading