Best Practices for PySpark ETL Projects

Posted on Sun 28 July 2019 in data-engineering • Tagged with data-engineering, data-processing, apache-spark, python


I have often lent heavily on Apache Spark and the SparkSQL APIs for operationalising any type of batch data-processing ‘job’, within a production environment where handling fluctuating volumes of data reliably and consistently are on-going business concerns. These batch data-processing jobs may involve nothing more than joining data sources and …

Continue reading