In the last quarter of 2019, the GoDataDriven team has contributed again to many open source projects. At GoDataDriven we love with open source. Open sources enables us to fix bugs by our own, and go through the code if we can't find anything in the documentation (looking at you Airflow). This removes blockers for our customers and enables them to continue their quest to become more data driven. At the same time, we try to make the world a better place by patching bugs, fixing security issues and implement features.
Takeoff is a deployment tool, initially developed at Schiphol Airport and is fully open source. It helps you to automate and create fully reproducible deployments on your favorite cloud. For more information, please refer to the excellent blog by the authors.
Our recently joined Daniel Heres took a swing at Takeoff, bumping into some typo's in the documentation, and took the time to fix these:
- Rename Runway -> Takeoff
- Fix link in documentation
- Some renaming, changes in schema
- azure_service_principal -> service_principal
- Fix typo
- Fix for example code in takeoff plugins documentation
Daniel van der Ende also took the time to improve the docs:
Finally, he also found the time to squash a couple of bugs and add some functionality:
- Fix bug in latest docker tag
- Bugfix: use correct client for azure provider
- Fix base64 encoding of k8s templated values
- Add ability to pass custom values to k8s jinja
Data Build Tool (DBT)
While getting familiar with the codebase of DBT, I encountered some minor code smells, and decided to add some annotations to make the code more readable:
Furthermore, as Apache Spark lovers, we took the time to extend the support for it:
Many clients of ours use Airflow to orchestrate their data pipelines. Recently, there has been an effort to enable to store state at a task level. The first idea was to store this in the xcom table. Xcom stands for cross communication, and is used for sharing inter-task state. With some minor tweaks we could also use this for intra-stask state sharing. The main issue with state is, that some executions become non-idempotent by definition. This is something that we're still working on, and details can be found in the AIP.
Parquet is the de-facto file format for OLAP workloads on data lakes. We're working on getting ready for Java 11, so some dependencies were updated:
In Avro we've discovered a regression bug that was uncovered by the integration tests of Apache Iceberg. This was introduced after the big refactor of the schema resolution logic. This bug took me around three days to find, and it was fixed in a single line:
After fixing that bug we're starting the release process of Avro 1.9.2, and this includes updating some of the dependencies to the latest version, to make sure that we're up to date:
- AVRO-2586: Bump spotless-maven-plugin from 1.24.1 to 1.25.1
- AVRO-2585: Bump jetty.version from 9.4.20.v20190813 to 9.4.21.v20190926
- AVRO-2584: Bump netty-codec-http2 from 4.1.39.Final to 4.1.42.Final
- AVRO-2582: Bump protobuf-java from 3.9.1 to 3.10.0
- AVRO-2583: Bump grpc.version from 1.23.0 to 1.24.0
Fix the CI:
For Apache Spark, we've done some security patches:
[SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas. Picked up a stale PR to add read Avro files with a custom schema.
And GoDataDriven is now mentioned on the Powered By section of the Spark website:
Apache Iceberg (Incubating)
While playing around with Iceberg, and getting familiar with it, I noticed that the docs were incomplete. This naturally resulted into a PR:
In addition, I fixed arbitrary issues that were reported by the code smell detector in order to get more familiar with the code-base:
And some various fixes:
- Bump spark and java versions
- Our Kris Geuzebroek took the time to fix an issue on his docker-kafka image.
- Fixed a link on the Databricks containers repository
- Add Fokko as committer
- Fix the CI by updating the db2 test
- Presto Bump Apache Avro to 1.9.1 and Consolidate and bump the snappy-java version