Linting notebooks with GitHub Actions

Over the past year, GitHub Actions has become my go-to tool for CI/CD for software projects, data projects, the source of this blog, and even my CV! I’ve added workflows...

Filling missing timestamps using Amazon Athena

Working with moderate numbers of sparse time series is pretty straightforward using standard tools like pandas, but increasingly I have to work with large numbers of sparse time series in...

Accessing Hadoop configuration from PySpark

Normally, you don’t need to access the underlying Hadoop configuration when you’re using PySpark but, just in case you do, you can access it like this: from pyspark import SparkSession...

Using custom Spark clusters (and interactive PySpark!) with SageMaker

Amazon EMR seems like the natural choice for running production Spark clusters on AWS, but it’s not so suited for development because it doesn’t support interactive PySpark sessions (at least...

Using custom Python versions with SageMaker

Currently (as of March 2020), SageMaker supports Python 3.6 kernels only, which means you can’t run any 3.7 or 3.8 code out of the box. Fortunately, there’s an easy (though...