Building machine learning and statistical models often requires pre- and post-transformation of the input and/or response variables, prior to training (or fitting) the models. For example, a model may require training on the logarithm of the response and input variables. As a consequence, fitting and then generating predictions from these models requires repeated application of transformation and inverse-transformation functions – to go from the domain of the original input variables to the domain of the original output variables (via the model). This is usually quite a laborious and repetitive process that leads to messy code and notebooks.
pipeliner package aims to provide an elegant solution to these issues by implementing a common interface and workflow with which it is possible to:
- define transformation and inverse-transformation functions;
- fit a model on training data; and then,
- generate a prediction (or model-scoring) function that automatically applies the entire pipeline of transformations and inverse-transformations to the inputs and outputs of the inner-model and its predicted values (or scores).
Elasticsearch is a distributed NoSQL document store search-engine and column-oriented database, whose fast (near real-time) reads and powerful aggregation engine make it an excellent choice as an ‘analytics database’ for R&D, production-use or both. Installation is simple, it ships with default settings that allow it to work effectively out-of-the-box, and all interaction is made via a set of intuitive and extremely well documented RESTful APIs. I’ve been using it for two years now and I am evangelical.
elasticsearchr package implements a simple Domain-Specific Language (DSL) for indexing, deleting, querying, sorting and aggregating data in Elasticsearch, from within R. The main purpose of this package is to remove the labour involved with assembling HTTP requests to Elasticsearch’s REST APIs and parsing the responses. Instead, users of this package need only send and receive data frames to Elasticsearch resources. Users needing richer functionality are encouraged to investigate the excellent
elastic package from the good people at rOpenSci.
Every now and again someone comes along and writes an R package that I consider to be a ‘game changer’ for the language and it’s application to Data Science. For example, I consider dplyr one such package as it has made data munging/manipulation that more intuitive and more productive than it had been before. Although I only first read about it at the beginning of this week, my instinct tells me that in Henrik Bengtsson’s future package we might have another such game-changing R package.
The future package provides an API for futures (or promises) in R. To quote Wikipedia, a future or promise is,
… a proxy for a result that is initially unknown, usually because the computation of its value is yet incomplete.
Quite often I want to share simple (static) web pages with other colleagues or clients. For example, I may have written a report using R Markdown and rendered it to HTML. AWS S3 can easily host such a simple web page (e.g. see here), but it cannot, however, offer any authentication to prevent anyone from accessing potentially sensitive information.
Yegor Bugayenko has created an external service S3Auth.com that stands in the way of any S3 hosted web site, but this is a little too much for my needs. All I want to achieve is to limit access to specific S3 resources that will be largely transient in nature. A viable and simple solution is to use ‘query string request authentication’ that is described in detail here. I must confess to not really understanding what was going on here, until I had dug around on the web to see what others have been up to.
This blog post describes a simple R function for generating authenticated and ephemeral URLs to private S3 resources (including web pages) that only the holders of the URL can access.
Parts one, two and three of this series of posts have taken us from creating an account on AWS to loading and interacting with data in Spark via R and R Studio. My vision of a Data Science platform for R&D is nearly complete – the only outstanding component is the ability to interact (REPL-style) with Spark using code written in Scala and to run this on some sort of scheduled basis. So, for this last part I am going to focus on getting Apache Zeppelin up-and-running.
Zeppelin is a notebook server in a similar vein as the Jupyter or Beaker notebooks (and very similar to those available on Databricks). Code is submitted and executed in ‘chunks’ with interim output (e.g. charts and tables) displayed after it has been computed. Where Zeppelin differs from the other, is its first-class support for Spark and it’s ability to run notebooks (and thereby ETL process) on a schedule (in essence it uses
chron for scheduling and execution).
Part 1 and Part 2 of this series dealt with setting up AWS, loading data into S3, deploying a Spark cluster and using it to access our data. In this part we will deploy R and R Studio Server to our Spark cluster’s master node and use it to serve my favorite R IDE: R Studio.
We will then install and configure both the Sparklyr and [SparkR][sparkR] packages for connecting and interacting with Spark and our data. After this, we will be on our way to interacting with and computing on large-scale data as if it were sitting on our laptops.
Part 1 in this series of blog posts describes how to setup AWS with some basic security and then load data into S3. This post walks-through the process of setting up a Spark cluster on AWS and accessing our S3 data from within Spark.
A key part of my vision for a Spark-based R&D platform is being able to launch, stop, start and then connect to a cluster from my laptop. By this I mean that I don’t want to have to directly interact with AWS every time I want to switch my cluster on or off. Versions of Spark prior to v2 had a folder in the home directory,
/ec2, containing scripts for doing exactly this from the terminal. I was perturbed to find this folder missing in Spark 2.0 and ‘Amazon EC2’ missing from the ‘Deploying’ menu of the official Spark documentation. It appears that these scripts have not been actively maintained and as such they’ve been moved to a separate GitHub repo for the foreseeable future. I spent a little bit of time trying to get them to work, but ultimately they do not support v2 of Spark as yet. They also don’t allow you the flexibility of choosing which version of Hadoop to install along with Spark and this can cause headaches when it comes to accessing data on S3 (a bit more on this later).
I’m very keen on using Spark 2.0 so I needed an alternative solution. Manually firing-up VMs on EC2 and installing Spark and Hadoop on each node was out of the question, as was an ascent of the AWS DevOps learning-curve required to automate such a process. This sort of thing is not part of my day-job and I don’t have the time otherwise. So I turned to Google and was very happy to stumble upon the Flintrock project on GitHub. Its still in its infancy, but using it I managed to achieve everything I could do with the old Spark ec2 scripts, but with far greater flexibility and speed. It is really rather good and I will be using it for Spark cluster management.
Here’s my vision: I get into the office and switch-on my laptop; then I start-up my Spark cluster; I interact with it via RStudio to exploring a new dataset a client uploaded overnight; after getting a handle on what I want to do with it, I prototype an ETL and/or model-building process in Scala by using Zeppelin and I might even ask it to run every hour to see how it fairs.
In all likelihood this is going to be more than one day’s work, but you get the idea – I want a workspace that lets me use production-scale technologies to test ideas and processes that are a small step away from being handed-over to someone who can put them into production.