We’re looking for a Data Engineer
With your knowledge of software engineering, cloud, clusters, queues, containers, streaming, or graphs you work on the backbone of the data infrastructure.
As a team lead, I am a big fan of using workflow management methodologies like Kanban or Agile to plan and oversee the work that my team and I have to do.
However, in the data domain, it is not always straightforward what tasks to create and how to estimate the effort required due to uncertainty and complexity surrounding the data sets that you need to work with.
This blog will give you concrete tips on how to structure your data engineering projects when using the Agile methodology. In particular, I hope to help you manage your data engineering workflow better by outlining a standard approach for creating new data pipelines that work no matter the data set or technologies used.
The standard engineering approach that we will describe in this blog is as follows:
- Start with a so-called ‘spike’ to create a detailed design of the new data pipeline.
- Split the implementation of the pipeline into three stories that have clear deliverables.
- Estimate the amount of work required for each story using story points.
Sticking to the proposed workflow should help you deliver data pipelines reliably without surprises for you and your stakeholders.
The problem: unknowns around the source data
When building a new data pipeline for your stakeholders, you want to give the stakeholders an idea of how long it will take and what resources are required to make it happen. With data, this is not always easy to estimate upfront due to two main issues: uncertainty and complexity.
Uncertainty, because when integrating a new data source you often don’t know upfront how the data looks like. Additionally, you might have no experience yet with the source system that you need to fetch the data from. That means you probably don’t know yet whether there are default connectors available, or whether you need to build something custom.
Aside from technical uncertainties, there might also be confusion from the stakeholders end. Are the requirements from your stakeholders crystal clear to you? Do you know which columns they want? How frequently you should fetch data, and if they require an initial load? We often find that it takes quite some communication to fully understand when their needs would be fulfilled.
Complexity arises because your source data can have all kinds of different formats (CSV, JSON, XML, …). If you are unlucky, you might even find multiple arrays nested in each other. Assuming your goal as an engineer is to create a “clean” layer of data, you will need to carefully think about how to tidy up your data such that it will be easily consumable by your data analysts or data scientists in the clean data layer (who should then thank you on your knees for doing this properly!).
Uncertainty and complexity make it difficult to envision how your solution should look like. Therefore, it is nearly impossible to estimate upfront how much work is required to do the job.
Start with a spike to clear the confusion
Our solution for dealing with these two issues is to start each new data pipeline with a so-called spike. The goal of the spike is to document a pipeline design that answers the questions in the following table:
|Which data sets exactly do we require from the source system?||Overview of exact tables to fetch|
|What is the data format of these data sets and how will we parse it?||Mapping on how to convert source format to the desired cleaned data format. For example, how to go from nested JSON to unnested Delta tables.|
|How will we connect with the source system?||Plan on whether to use an API, Middleware or for example SFTP server to retrieve the data.|
|What frequency are we going to fetch data with?||For example: weekly, daily, hourly, real-time.|
|Do we need an initial load? If so, how many years of historical data?||Known size of the initial load and how to get it.|
|Are we dealing with privacy or security sensitive (PII) data?||An overview of sensitive data attributes and how to treat them approved by the legal and security councils.|
The answers to the above questions will clear all uncertainty around the new data pipeline that you want to implement. Having the design then allows you to clearly identify the user stories and subtasks AND to estimate their effort.
We often use one sprint (of 2 weeks) to deal with the spike (alongside other tasks). We do not estimate the effort required for the spike, as we are dealing with the introduced unknowns.
Before the spike starts, it is best to notify your stakeholders that you will need input from them. This way you can avoid long lead times wherein you and your team have to wait for inputs or access rights from various stakeholders like the business users, the source owner and/or the enterprise architect. Plan meetings in advance to keep the momentum going during the sprint.
Break down the pipeline implementation
If you know how you want to build your pipeline, you can then refine the user stories for the next sprint where you will actually build the pipeline.
The difficulty is that you don’t want to create stories that are too big, as they will not get finished in the sprint. Dragging unfinished stories to the next sprint is annoying for both your team and your stakeholders as you are not delivering value.
On the contrary, you also don’t want your stories to be too small as the overhead of keeping the administration will slow you down.
A story should reflect a piece of work that will deliver a particular value. Hence, we always break up our pipeline implementation into the following 3 stories:
- Implement incremental load data pipeline.
- Implement full load data pipeline.
- Implement transformation from raw to clean data.
Each of these stories has a clear deliverable. For example, when the incremental load pipeline is finished you will be ingesting new and updated data records at the desired frequency (daily, hourly, real-time). With the full load data pipeline on the contrary, you will be able to fetch the entire data dump from the source system. This is useful for getting historical data in your destination storage. Upon completing the third story, you will have a pipeline that transforms and saves incoming raw data nicely into a tidy format in your “clean” data layer.
Your definition of done determines when a story is completely finished. In our case, it means that the code for the pipeline is version controlled, peer reviewed, documented, and tested. If the story is approved in the Development environment, it will then be rolled out to Acceptance and Production through a CI/CD process.
If we feel like a story is still pretty big, we sometimes create subtasks to break it down. The subtasks don’t need the same level of detailed documentation as the story, and serve mainly for your own convenience as a developer; i.e. to keep track of what you are working on. For example, if an integration requires connections to multiple API endpoints, you might want to create a subtask for each of the endpoints.
Estimating the work
To plan your stories in a future sprint, the final thing that you need to do is estimate the difficulty of each story. Story points are a nice way in agile project management that will provide an abstract measure of the time it will take to complete a story. Using story points, it will become easier for you as a team to compare tasks between each other and you will get an idea of how many points you as a team are able to handle in a single sprint.
Together with your team, for each story you decide on a number from the Fibonacci series (1, 2, 3, 5, 8, 13, 21…) as the estimated difficulty. If a story is deemed 13 points or more, it is too big for a 2 week sprint and you should try breaking the story up.
We see two main variables that influence the number of story points for a story: the complexity of the to-be implemented pipeline and simply the amount of time required to do the job; i.e. it will take more time to ingest a source system with just a few structured tables, versus a source system with multiple and unstructured data formats.
We consequently use the following table to determine our story points:
|Complexity||Time Required||Story Points|
We noticed that when we stick to the approach outlined in this blog, our stories are at maximum 8 points and we are able to deliver them consistently and reliably.
This blog outlines an approach to plan, refine and estimate data engineering work focused on adding new data pipelines.
We propose to always start with a spike wherein your clear uncertainties that originate from having no experience with either the source system or the data set and its format. The goal of this spike is to come up with a pipeline design that answers critical questions like how to connect with the source, how often you want to pull data, and whether an initial load is required.
Armed with this information you can break up the pipeline implementation into the suggested stories which are not too big and deliver clear value. Sticking to this workflow will help you reliably deliver data pipelines without surprises for you and your stakeholders.
If you want us to help you speed up your Data Engineering practice, feel free to reach out. We can help with technical hands-on work, as well as capability building, and training of your Data Engineering professionals.