Slides and code from my RebelCon talk
For anyone who is interested, I've just posted slides for my talk here. You can also check out the source code for the pipeline I demonstrated on GitHub.…
Working with moderate numbers of sparse time series is pretty straightforward using standard tools like pandas, but increasingly I have to work with large numbers of sparse time series in AWS Athena / PrestoDB and so those handy pandas functions aren't always available.…
Normally, you don't need to access the underlying Hadoop configuration when you're using PySpark but, just in case you do, you can access it like this: from pyspark import SparkSession ... # Extract the configuration spark = SparkSession.builder.getOrCreate() hadoop_config = spark._jsc.hadoopConfiguration() # Set a new config value hadoop_config.set(…
Amazon EMR seems like the natural choice for running production Spark clusters on AWS, but it's not so suited for development because it doesn't support interactive PySpark sessions.…
Currently, SageMaker supports Python 3.6 kernels only, which means you can't run any 3.7 or 3.8 code out of the box. Fortunately, there's an easy fix.…
SageMaker uses conda for package management, which complicates things if you manage packages for your project with pipenv. One quick workaround is to use pipenv to generate a requirements.txt file and pipe the output to pip which then modifies the active conda environment.…
Docker has a nice build context feature that lets you specify a git repository URL as the location to build an image from. So, for instance, you can roll your own Redis image like this: docker build -t myredis:5 https://github.com/docker-library/redis.git#master:5.0 But…
For anyone who is interested, I've just posted slides for my talk here. You can also check out the source code for the pipeline I demonstrated on GitHub.…
Here's a nice shortcut for bash that exports all environment variables from a dotenv (.env) file into the current terminal session.…
The Apache Airflow UI is nice to look at, but it's a pretty clunky way to manage your pipeline configuration. One alternative is to store your DAG configuration in YAML and use it to set the default configuration in the Airflow database when the DAG is first run.…
The Cork Open Data Dashboard is a visualisation and analysis tool for several of the datasets on Cork City's open data portal, built with open source tools including Docker, InfluxDB and Grafana. The dashboard is essentially an expanded and refined version of the parking dashboard I developed last year, and…