Ever since the initial blogpost of Zhamak Dehghani the idea of creating a decentralized data platform instead of a single central one has gained a lot of traction. The paradigm shift she introduced, was triggered by the observation that while domain driven design heavily influenced the way we design operational systems, central data platforms kept being developed as centralized monoliths. In this blogpost I'll do a review on the two blogposts Zhamak wrote, plus share my own thoughts on her idea after reading through the blogposts.
Architectural failure modes of a centralized data platform
Zhamak introduces the "why" of Data Mesh by describing the three main architectural failure modes of a monolithic data platform. The first failure mode is "centralized and monolithic", a single platform which needs to ingest data from all corners of the organization; convert that data into something trustworthy, and finally serve the data to a diverse set of consumers. Smaller organisations might make this work, but in larger organizations the sheer amount of datasets and need for rapid experimentation will put huge pressure on the centralized data platform.
A centralized and monolithic data platform
The next failure mode is "coupled pipeline decomposition" wherein a single pipeline is split over multiple teams in order to make the development more scalable. While this allows for some scale, it will also slow develivery of new data, as the stages are highly coupled. And can only be worked on in a waterfall-like fashion.
And finally, "siloed and hyper-specialized ownership" which describes teams of data platform engineers attempting to create usable data while having very little understanding of the source systems it is based on. In reality, this results in data platform teams being overstretched while consumers are fighting for the "top spot" on the backlog.
The Great Data Divide
In her second blogpost Zhamak additionally stresses the differences between operational and analytical data. The former being data which is being stored in databases backing operational systems (eg, microservices). The latter, data being used to give insight in the performance of the company over time. The gap between the two types of data is crossed with the use of ETL/ELT pipelines.
The differences between operation and analytical data, also results in differences in access patterns, use cases, personas of data users, and technology used to manage these datatypes. However, this difference in technology used, should not lead to separation of organization, teams, and people who work on them. As is the case in a centralized data platform.
The paradigm shift
By applying principles from Eric Evans's book Domain-Driven Design Zhamak introduced the Data Mesh concept in her first blogpost. A Data Mesh is a decentralized data platform which keeps the ownership of data in the domains.
Zhamak describes this as concept as the reverse of a centralized data platform. In which it's all about data locality and ownership. Data doesn't flow outside a domain into a centralized platform, but is hosted and served by the domains themselves.
A successful Data Mesh
In order to achieve any scale, Zhamak states that any Data Mesh implementation should implement four underpinning principles; 1) domain-oriented decentralized data ownership and architecture, 2) data as a product, 3) self-serve data infrastructure as a platform, and 4) federated computational governance. These principles should be considered as collectively necessary and sufficient.
As introduced before, Data Mesh is founded on the belief that domain ownership is crucial to support continuous change and scalability. Data Mesh achieves this by moving the responsibility to people which are closest to the data. By following the seams of the organization, Data Mesh, localizes the impact of changes to the domain.
In practice this will result in domains not only providing an operational API, but also an analytical one. Zhamak gives an example of a podcasts domain which provides an API to "create a new podcast episode" (an operational API), but also an API which provides the number of listeners to an podcast over time (an analytical API). Both are maintained by the same domain/team.
Domain Data as a Product
In order for data to be considered a product, Zhamak outlines a few basic qualities each should implement. A data product should be discoverable, addressable, trustworthy, self-describing, inter-operable, and finally secure. Consumers of that data should be treated as customers.
A new role should be introduced by companies called "domain data product owner". Which is responsible for objective measures (KPIs) describing the performance of a data product. Eg, measuring the happiness of its customers. Using the KPIs, the Domain Data teams should strive to make their products the best they can be. With clear APIs, understandable documentation, and by closely tracking quality and adoption KPIs.
Architecturally, a data product is similar to a architectural quantum. Eg, the smallest unit which can be independently deployed. In order to achieve this, a data product is the combination of code, data and metadata, and infrastructure. And as such, in order to be a single deployable unit, a Data Mesh should enable teams with the capabilities to do so.
Self-serve data platform
Leading to the self-serve data platform. A fair bit of infrastructure is required to run a data product. The knowledge required in order to build/maintain this infrastructure would difficult to replicate across all domains. Zhamak address this with a principle called the self-serve data platform.
This self-serve data platform implements one or more platform planes, which end-users interact with in order to deploy their data product. Zhamak describes three planes; data infrastructure provisioning plane, data product developer experience plane, and the data mesh supervision plane. The infrastructure plane allows users to provision new infra. The data product plane implements standard solutions to run data products. Eq, allowing users to deploy a SQL query as a data product. The data mesh supervision plane provides global services. Eg, tools which allow users to discover new data products, which implement governance, data quality monitoring etc.
Federated computational governance
The last principal is that of governance. Herein, Zhamak describes that some global/standardization needs to be enforced in the mesh, and what shall be left to the domains to decide. This group has a difficult job, as it needs to strike a balance between centralization and decentralization. In her second blogpost she has a table comparing centralized and data mesh governance.
The concept as introduced by Zhamak is a powerful one. Highlights for me are:
- The central data platform team(s) are attempting to solve an unsolvable problem. Providing thrustworthy data to consumers without having any/limited domain knowledge themselves.
- Ownership of data is kept in the domains. Localizing the impact of changes.
- A self-service data platform enables teams to independently deploy data products. Changing the role of the central data platform team to one which is developing a product, instead of providing data.
- Governance is maintained by defining standardization of input/outputs. Allowing for uniform access to decentralized data.
However it also leaves me with some additional questions:
- How small should a data product be? Could it be as small as a single SQL transformation?
- In the supervision plane, features like data quality monitoring, cataloging, access management should be implemented. But who should do this?
- Schema changes. How to deal with those? Will a breaking schema change lead to a new data product?
- Should alternative implementations of a data product (engine) be allowed? Or basically, what should you govern/decide centrally and what should you leave to the domains.