Accessing Hadoop configuration from PySpark

Normally, you don't need to access the underlying Hadoop configuration when you're using PySpark but, just in case you do, you can access it like this: from pyspark import SparkSession ... # Extract the configuration spark = SparkSession.builder.getOrCreate() hadoop_config = spark._jsc.hadoopConfiguration() # Set a new config value hadoop_config.set(…

Using pipenv with SageMaker

SageMaker uses conda for package management, which complicates things if you manage packages for your project with pipenv. One quick workaround is to use pipenv to generate a requirements.txt file and pipe the output to pip which then modifies the active conda environment.…

Building Docker images from private GitHub repos

Docker has a nice build context feature that lets you specify a git repository URL as the location to build an image from. So, for instance, you can roll your own Redis image like this: docker build -t myredis:5 https://github.com/docker-library/redis.git#master:5.0 But…

Configuring Airflow DAGs with YAML

The Apache Airflow UI is nice to look at, but it's a pretty clunky way to manage your pipeline configuration. One alternative is to store your DAG configuration in YAML and use it to set the default configuration in the Airflow database when the DAG is first run.…