It Recently Occurred to Me…

Posted on Sun 09 September 2018 in data-science • Tagged with musings

… I should get back into writing/blogging things that come to mind. For example, that type annotations and type-checking with mypy are great tools for data scientists as well as day-to-day developers - e.g.,

def circle_area(radius: float) -> float:
    pi = 3.141
    return pi * (r ** 2)

Makes it so much …


Continue reading

Machine Learning Pipelines for R

Posted on Mon 08 May 2017 in r • Tagged with machine-learning, data-processing

pipes

Building machine learning and statistical models often requires pre- and post-transformation of the input and/or response variables, prior to training (or fitting) the models. For example, a model may require training on the logarithm of the response and input variables. As a consequence, fitting and then generating predictions from …


Continue reading

elasticsearchr - a Lightweight Elasticsearch Client for R

Posted on Mon 28 November 2016 in r • Tagged with data-processing, data-stores

elasticsearchr

Elasticsearch is a distributed NoSQL document store search-engine and column-oriented database, whose fast (near real-time) reads and powerful aggregation engine make it an excellent choice as an ‘analytics database’ for R&D, production-use or both. Installation is simple, it ships with default settings that allow it to work effectively out-of-the-box …


Continue reading

Asynchronous and Distributed Programming in R with the Future Package

Posted on Wed 02 November 2016 in r • Tagged with data-processing, high-performance-computing

Future!

Every now and again someone comes along and writes an R package that I consider to be a ‘game changer’ for the language and it’s application to Data Science. For example, I consider dplyr one such package as it has made data munging/manipulation that more intuitive and more …


Continue reading

An R Function for Generating Authenticated URLs to Private Web Sites Hosted on AWS S3

Posted on Mon 19 September 2016 in r • Tagged with AWS

crypto

Quite often I want to share simple (static) web pages with other colleagues or clients. For example, I may have written a report using R Markdown and rendered it to HTML. AWS S3 can easily host such a simple web page (e.g. see here), but it cannot, however, offer …


Continue reading

Building a Data Science Platform for R&D, Part 4 - Apache Zeppelin & Scala Notebooks

Posted on Mon 29 August 2016 in data-science • Tagged with AWS, data-processing

zeppelin

Parts one, two and three of this series of posts have taken us from creating an account on AWS to loading and interacting with data in Spark via R and R Studio. My vision of a Data Science platform for R&D is nearly complete - the only outstanding component is …


Continue reading

Building a Data Science Platform for R&D, Part 3 - R, R Studio Server, SparkR & Sparklyr

Posted on Mon 22 August 2016 in data-science • Tagged with AWS, data-processing, apache-spark

Alt

Part 1 and Part 2 of this series dealt with setting up AWS, loading data into S3, deploying a Spark cluster and using it to access our data. In this part we will deploy R and R Studio Server to our Spark cluster’s master node and use it to …


Continue reading

Building a Data Science Platform for R&D, Part 2 - Deploying Spark on AWS using Flintrock

Posted on Thu 18 August 2016 in data-science • Tagged with AWS, data-processing, apache-spark

Alt

Part 1 in this series of blog posts describes how to setup AWS with some basic security and then load data into S3. This post walks-through the process of setting up a Spark cluster on AWS and accessing our S3 data from within Spark.

A key part of my vision …


Continue reading

Building a Data Science Platform for R&D, Part 1 - Setting-Up AWS

Posted on Tue 16 August 2016 in data-science • Tagged with AWS, data-processing

Alt

Here’s my vision: I get into the office and switch-on my laptop; then I start-up my Spark cluster; I interact with it via RStudio to exploring a new dataset a client uploaded overnight; after getting a handle on what I want to do with it, I prototype an ETL …


Continue reading