GoDataDriven Open Source Contribution for Q4 2019

Fokko Driesprong/
07 February, 2020
In the last quarter of 2019, the GoDataDriven team has contributed again to many open source projects. At GoDataDriven we love with open source. Open sources enables us to fix bugs by our own, and go through the code if we can't find anything in the documentation (looking at you Airflow). This removes blockers for our customers and enables them to continue their quest to become more data driven. At the same time, we try to make the world a better place by patching bugs, fixing security issues and implement features.

Takeoff

Takeoff is a deployment tool, initially developed at Schiphol Airport and is fully open source. It helps you to automate and create fully reproducible deployments on your favorite cloud. For more information, please refer to the excellent blog by the authors.

Our recently joined Daniel Heres took a swing at Takeoff, bumping into some typo's in the documentation, and took the time to fix these:

Daniel van der Ende also took the time to improve the docs:

Finally, he also found the time to squash a couple of bugs and add some functionality:

Data Build Tool (DBT)

While getting familiar with the codebase of DBT, I encountered some minor code smells, and decided to add some annotations to make the code more readable:

Furthermore, as Apache Spark lovers, we took the time to extend the support for it:

Apache Airflow

Many clients of ours use Airflow to orchestrate their data pipelines. Recently, there has been an effort to enable to store state at a task level. The first idea was to store this in the xcom table. Xcom stands for cross communication, and is used for sharing inter-task state. With some minor tweaks we could also use this for intra-stask state sharing. The main issue with state is, that some executions become non-idempotent by definition. This is something that we're still working on, and details can be found in the AIP.

Apache Parquet

Parquet is the de-facto file format for OLAP workloads on data lakes. We're working on getting ready for Java 11, so some dependencies were updated:

Apache Avro

In Avro we've discovered a regression bug that was uncovered by the integration tests of Apache Iceberg. This was introduced after the big refactor of the schema resolution logic. This bug took me around three days to find, and it was fixed in a single line:

After fixing that bug we're starting the release process of Avro 1.9.2, and this includes updating some of the dependencies to the latest version, to make sure that we're up to date:

Fix the CI:

Apache Spark

For Apache Spark, we've done some security patches:

And GoDataDriven is now mentioned on the Powered By section of the Spark website:

Apache Iceberg (Incubating)

While playing around with Iceberg, and getting familiar with it, I noticed that the docs were incomplete. This naturally resulted into a PR:

In addition, I fixed arbitrary issues that were reported by the code smell detector in order to get more familiar with the code-base:

Other

And some various fixes:

Subscribe to our newsletter

Stay up to date on the latest insights and best-practices by registering for the GoDataDriven newsletter.

Learn Online Today, Apply Tomorrow

Find the right online course to level up your game whether you’re a data scientist, data engineer, or analytics translator!

Data & AI Training Guide 2020

Download the GoDataDriven brochure for a complete overview of available training sessions and data engineering, data science, and analytics translator learning journeys.