Best Practices for Engineering ML Pipelines - Part 2

Posted on Mon 07 November 2022 in machine-learning-engineering • Tagged with python, machine-learning, mlops, kubernetes, bodywork

ml-pipeline-engineering

This is the second part in a series of articles demonstrating best practices for engineering ML pipelines and deploying them to production. In the first part we focused on project setup - everything from codebase structure to configuring a CI/CD pipeline and making an initial deployment of a skeleton pipeline …


Continue reading

Best Practices for Engineering ML Pipelines - Part 1

Posted on Wed 03 March 2021 in machine-learning-engineering • Tagged with python, machine-learning, mlops, kubernetes, bodywork

ml-pipeline-engineering

The is the first in a series of articles demonstrating how to engineer a machine learning pipeline and deploy it to a production environment. We’re going to assume that a solution to a ML problem already exists within a Jupyter notebook, and that our task is to engineer this …


Continue reading

Deploying ML Models with Bodywork

Posted on Tue 01 December 2020 in machine-learning-engineering

Tags: python, machine-learning, mlops, kubernetes, bodywork

bodywork_logo

Solutions to ML problems are usually first developed in Jupyter notebooks. We are then faced with an altogether different problem - how to engineer these notebook solutions into your products and systems and continue to maintain their performance through time, after new data is generated …


Continue reading

Best Practices for PySpark ETL Projects

Posted on Sun 28 July 2019 in data-engineering • Tagged with data-engineering, data-processing, apache-spark, python

png

I have often lent heavily on Apache Spark and the SparkSQL APIs for operationalising any type of batch data-processing ‘job’, within a production environment where handling fluctuating volumes of data reliably and consistently are on-going business concerns. These batch data-processing jobs may involve nothing more than joining data sources and …


Continue reading

Stochastic Process Calibration using Bayesian Inference & Probabilistic Programs

Posted on Fri 18 January 2019 in data-science • Tagged with probabilistic-programming, python, pymc3, quant-finance, stochastic-processes

jpeg

Stochastic processes are used extensively throughout quantitative finance - for example, to simulate asset prices in risk models that aim to estimate key risk metrics such as Value-at-Risk (VaR), Expected Shortfall (ES) and Potential Future Exposure (PFE). Estimating the parameters of a stochastic processes - referred to as ‘calibration’ in the parlance …


Continue reading

Deploying Python ML Models with Flask, Docker and Kubernetes

Posted on Thu 10 January 2019 in machine-learning-engineering • Tagged with python, machine-learning, machine-learning-operations, kubernetes

jpeg

  • 17th August 2019 - updated to reflect changes in the Kubernetes API and Seldon Core.
  • 14th December 2020 - the work in this post forms the basis of the Bodywork MLOps tool - read about it here.

A common pattern for deploying Machine Learning (ML) models into production environments - e.g. ML models …


Continue reading

Bayesian Regression in PYMC3 using MCMC & Variational Inference

Posted on Wed 07 November 2018 in data-science • Tagged with machine-learning, probabilistic-programming, python, pymc3

jpeg

Conducting a Bayesian data analysis - e.g. estimating a Bayesian linear regression model - will usually require some form of Probabilistic Programming Language (PPL), unless analytical approaches (e.g. based on conjugate prior models), are appropriate for the task at hand. More often than not, PPLs implement Markov Chain Monte Carlo …


Continue reading

Machine Learning Pipelines for R

Posted on Mon 08 May 2017 in r • Tagged with machine-learning, data-processing

pipes

Building machine learning and statistical models often requires pre- and post-transformation of the input and/or response variables, prior to training (or fitting) the models. For example, a model may require training on the logarithm of the response and input variables. As a consequence, fitting and then generating predictions from …


Continue reading

elasticsearchr - a Lightweight Elasticsearch Client for R

Posted on Mon 28 November 2016 in r • Tagged with data-processing, data-stores

elasticsearchr

Elasticsearch is a distributed NoSQL document store search-engine and column-oriented database, whose fast (near real-time) reads and powerful aggregation engine make it an excellent choice as an ‘analytics database’ for R&D, production-use or both. Installation is simple, it ships with default settings that allow it to work effectively out-of-the-box …


Continue reading

Asynchronous and Distributed Programming in R with the Future Package

Posted on Wed 02 November 2016 in r • Tagged with data-processing, high-performance-computing

Future!

Every now and again someone comes along and writes an R package that I consider to be a ‘game changer’ for the language and it’s application to Data Science. For example, I consider dplyr one such package as it has made data munging/manipulation that more intuitive and more …


Continue reading

An R Function for Generating Authenticated URLs to Private Web Sites Hosted on AWS S3

Posted on Mon 19 September 2016 in r • Tagged with AWS

crypto

Quite often I want to share simple (static) web pages with other colleagues or clients. For example, I may have written a report using R Markdown and rendered it to HTML. AWS S3 can easily host such a simple web page (e.g. see here), but it cannot, however, offer …


Continue reading

Building a Data Science Platform for R&D, Part 4 - Apache Zeppelin & Scala Notebooks

Posted on Mon 29 August 2016 in data-science • Tagged with AWS, data-processing

zeppelin

Parts one, two and three of this series of posts have taken us from creating an account on AWS to loading and interacting with data in Spark via R and R Studio. My vision of a Data Science platform for R&D is nearly complete - the only outstanding component is …


Continue reading

Building a Data Science Platform for R&D, Part 3 - R, R Studio Server, SparkR & Sparklyr

Posted on Mon 22 August 2016 in data-science • Tagged with AWS, data-processing, apache-spark

Alt

Part 1 and Part 2 of this series dealt with setting up AWS, loading data into S3, deploying a Spark cluster and using it to access our data. In this part we will deploy R and R Studio Server to our Spark cluster’s master node and use it to …


Continue reading

Building a Data Science Platform for R&D, Part 2 - Deploying Spark on AWS using Flintrock

Posted on Thu 18 August 2016 in data-science • Tagged with AWS, data-processing, apache-spark

Alt

Part 1 in this series of blog posts describes how to setup AWS with some basic security and then load data into S3. This post walks-through the process of setting up a Spark cluster on AWS and accessing our S3 data from within Spark.

A key part of my vision …


Continue reading

Building a Data Science Platform for R&D, Part 1 - Setting-Up AWS

Posted on Tue 16 August 2016 in data-science • Tagged with AWS, data-processing

Alt

Here’s my vision: I get into the office and switch-on my laptop; then I start-up my Spark cluster; I interact with it via RStudio to exploring a new dataset a client uploaded overnight; after getting a handle on what I want to do with it, I prototype an ETL …


Continue reading