Looking back at 2021, it's hard to call it an ordinary year. The data space however, is where we as a company are able to make some predictions for the year to come. In this blog post, I share 10 data tools to watch in 2022.
Data Build Tool (DBT)
I discovered DBT around 2019. I really liked the clear vision the project had. By layering sql views/tables on top of eachother, and automatically generating docs/lineage based on those layers. We could remove a lot of complexity from our transformation pipelines. While at the same time providing the business with a clear description of the data layers we were producing.
The combination of SQL + Spark seemed particulary promising. Although, it needed some love. We started contributing some fixes to the
dbt-spark connector which we needed to actually be able to use it in client projects. Back then, our very own Fokko Driesprong did most of the work. But more recently, Cor Zuurmond started to contribute to the
dbt-sqlserver connector. Allowing us to use dbt in combination with MSSql server. And Daniel Heres initiated the
DBT has reached version 1.0 a couple weeks ago. This marks the beginning of the next phase of this project. And with that I can only assume much broader adoption of the tool, more features in DBT cloud (their SaaS offering), and better integration with other tools.
Go to https://www.getdbt.com/ for more information.
Next up is Soda. We actually started talking to Soda around the time we discovered DBT. I remember talking to Maarten Masschelein, and thinking how can we start using this tool. Soda had a new approach to data quality. Which, in contrast to some of the other open-source data quality tooling, had a clear focus on presenting data quality metrics to business users and allowing them to define alerts etc. Something I felt was lacking in the other options. Luckily, the open-source component (soda-core), was still very developer friendly. This "two faced approach" is probably what I like best about Soda.
We have (officially) partnered with Soda a couple weeks ago, and have co-developed
soda-spark which allows you to integrate Soda into your Spark environment. Our very own Cor Zuurmond developed this package together with Vijay Kiran (from Soda).
Go to https://www.soda.io/ for more information.
I guess by now Databricks doesn't need an introduction. We make use of Databricks at customers a lot. Its a very nice managed Spark solution. So why is Databricks on this list? For 2022 they will introduce 2 very exiting new products.
First is Delta Live Tables, which introduces a new method to define data transformation pipelines. I particularly like the SQL approach, (which feels somewhat similar to DBT), but I would only recommend it for streaming usecases (which DBT doesn't support). Streaming pipelines currently are still complicated to implement/maintain. And this is where Delta Live Tables shines with the continuous pipelines concept. Wherein tables are updated as soon as the input data changes.
The second product to watch is Unity Catalog. Internally, were experimenting with the Delta Sharing server. In particular, Kris Geusebroek is. Delta Sharing holds a lot of promise, which I feel will replace the Hive Metastore we all know. But missing in the opensource sharing server are ACLs. That's what Unity Catalog promises to add. Being able to define ACLs on tables/views etc in a single place would really help deployments in bigger corporate environments.
In the opensource world, integration with legacy source systems has always been a difficult topic. Whereas in the past, we typically developed custom hooks/operators for Airflow, we still would like move away from the costs associated to them. As custom integrations prove difficult to maintain in the long run. Commercial offerings such as Fivetran work really well, but have limited customization possibilities. Airbyte is aiming to solve this by bringing the best of both worlds. Being opensource, but still providing a Saas offering.
The opensource component is under rapid development. With new sources and destinations almost being released on a daily basis. The tool itself needs some work still, as authentication is for instance still missing. However, as they raised $150m in a Series-B on the 20th of December, I'm confident that that will soon change.
Go to https://airbyte.io/ for more information.
Data Catalogs are a bit of a difficult one. There are quite a few opensource offerings available, but often have huge dependecies on Kafka, HBase, Neo4j or others. I'm still looking for a light-weight solution. And for now have settled on Marquez. I would use Marquez in situations where the data docs of DBT are not sufficient anymore. Marquez is actively contributing to the OpenLineage project, which aims to introduce a standard for tooling to export/import data lineage. OpenLineage has integrations for Spark, DBT, and Airflow. But creating a new integration doesn't seem to complicated.
Go to https://marquezproject.ai/ for more information.
In the opensource reporting space, I like Metabase best. It allows users to define questions. Which are basically a method to define business KPIs. Each question can have different visualisations, and hence can be reused between reports.
Metabase itself has a query editor, allowing SQL novices to also create queries, and allows for embedding into internal/external portals. Also, it looks great, which is (I feel) very important for a BI/Dashboarding tool.
Go to https://www.metabase.com/ for more information.
Monitoring is an important feature in any data platform. I'm typically not that happy with the cloud native options for monitoring. Usually those feel difficult to setup, and moreover costly to run. As an example, Azure Log Analytics costs €2,6 per GB vs €0,17 if I store a GB in a Azure Blob store. That's a hefty premium.
In Azure, Datadog reads metrics directly from services without the need for Log Analytics. Also, setting it up takes less than 5 minutes. It's dirt easy. Then once the data is available in Datadog, you can create dashboards yourself, or choose one of the pre-made dashboards. Which are also a real timesaver.
Go to https://www.datadoghq.com/ for more information.
For our advanced scheduling needs, we as a company often choose Airflow as our tool of choice. This is still true, because if I compare Airflow to Prefect or Dagster the hooks and operators give it a clear advantage. However, with Airbyte developing nicely, I think you will only need one operator (the AirbyteOperator) in the future. And hence, my prediction for 2022 is Prefect not Airflow.
Looking at Prefect, I think the "Verified by Prefect" partner integrations are really nice. Moreover, the development of Orion (Prefect 2.0) seems to further improve the usability and the developer experience while developing your dags.
Go to https://www.prefect.io/ for more information.
I'll end this blog post with two outsiders I'm rooting for. First up is DataFusion. I started working with Spark in 2015, it's now 2021 (almost 2022) and I'm still using Spark. Although I don't have a lot of complaints, it would be nice to have a successor of Spark, and see what a different approach could bring us (performance wise). If I were to make a bet right now, I would choose DataFusion. It's a query engine, written in Rust, build around Apache Arrow. I think that checks most of the boxes. Our colleague Daniel Heres is actively contributing to it and the project itself is very active.
Go to https://arrow.apache.org/datafusion/ for more information.
The second outsider is Polars. It has much of the same ingredients as DataFusion (Rust, Apache Arrow), but aims to replace Pandas. I get the same vibe as with DataFusion, lot's of commits/activity in the repos. And it seems incredibly fast. Can't wait to see where this one will go.
Be sure to check out the blogpost of Vincent Warmerdam on Polars.
Go to https://www.pola.rs/ for more information.