Data & AI Training Guide 2021
Download the GoDataDriven brochure for a complete overview of available training sessions and data engineering, data science, data analyst and analytics translator learning journeys.
We often get asked by clients what vendor solution(s) we propose for their data science needs. In this blog post, I try to summarize why the answer is (almost always) some open source tool.
We make it no secret that GoDataDriven loves open source. But why is that?
Open source software doesn't do it all
This need some explanation. When people develop and release an open source product or project, they probably thought that what they did was interesting for the community at large. Their goal is achieved when people start using their software. But, especially in data science, they don't pretend to do everything for you.
Maybe they are good at data wrangling. Maybe they're good at scheduling. Maybe they are good at large scale data processing.
Since they are not trying to do it all, they are (or should be) extremely flexible when it comes to interfacing with other part of your infrastructure, pipelines, or tools.
That means that using a particular tool does not force you to change all the other tools you are using.
This is also important when it comes to the tools you use in your workflow.
Take for example git: I firmly believe git, or any other distributed version control system, is fundamental when it comes to develop and productionize data science solutions.
Be it going back in time, be it integrating with your build system, be it working together on the same feature at once.
On the other hand, many commercial vendors1 try to be your one stop shop solution for all your data science need. They handle your code, they handle your data management, they handle your data wrangling, they handle your models, and they handle how you put your models into production. And they do not offer or even integrate with any mature version control systems.
You might argue, even when you're not factoring price, that this approach is more attractive as you don't have to think about many moving parts. However it takes away a lot of freedom: if a new, better in some way, system comes out for data wrangling, most of the time you cannot replace just that: it is in the vendor's best interest to usually keep you2 in their platform.
That means there is no (straightforward) way to mix and match with one stop shops.
Open source relies on mature software engineering principles
When I first mention this, people are puzzled, and wonder what the heck I'm talking about. I usually end up explaining why software engineering is important for data science.
It boils down to data science needing software engineering discipline.
Since open source projects rely on software engineering principles, it is obvious that they are architected so that, when you make use of them, you can also follow these principles.
I will make an example, just in case your head is spinning. When I'm using a tool like Airflow, I can use git for all my import pipelines. Airflow's goal is to orchestrate the pipelines, not to take over my workflow. And as Airflow is using git for its development, it makes sense for them to allow their users to also do that.
Another example is Spark. When you write Spark code, you should really write the accompanying unit tests with it. Spark provides you with the machinery to write these tests, because, guess what, Spark is also extensively tested.3
People used to open source tools might be baffled that some vendors do not integrate with your version control system or that do not allow writing unit tests. But these are out there. Mathematica, for example, added a unit testing framework only in version 10. SAS and SPSS, to the best of my knowledge, do not offer any unit testing framework.
Open source gives you more power
Hence more responsibility! Except in cases where companies start to provide support and premium features on top of an open source project, such as the Spark-Databricks, Kafka-Confluent, and Cassandra-Datastax combo's, there is no support for open source projects. This means that if there is a bug or feature that you would like to see fixed or implemented, you are at the mercy of the maintainers of the project. This is the responsibility part.
But... is it bad? Do vendors implement new features or fix bugs at your request and when you need them? How much money do you have to pay your database vendor to give you what you want?
If the project is open source, the people working for you can fix it and then you can decide if you want to contribute it back to the community.
It might seem scary at first: modifying software! However, if you think about it, your entire company runs on software, and some, if not most, of that software has been written in house. It could be your reporting software, your website, the SQL query that your DBAs wrote, etc. Eventually it's all just code!
By now you are probably starting to understand the multitude of posts about our open source contributions. We find new things that we want from the tools that we use, we implement them, and we give back to the community. So the famous Spiderman quote should really be the other way around when it comes to open source software:
Now comes the part where the objectors will say that they do not have the human resources available to fix bugs or implement new features. What they often fail to see, however, is that the money you are saving by not going vendor-driven4 can be much better invested in good people and in giving them more time to do this sort of things. Don't assume your employees or colleagues are not interested in doing this: they probably choose the open source tool in the first place and I bet they'd love to give back to the community.
All it takes is to overcome that fear that accompainies us every time we do something for the first time.
Besides: people are almost certainly the most important asset your company has5: Would you rather empower them or empower your vendor?
Open source accelerates innovation
No vendor likes to talk about this one. Before knowing the viability of what you want to accomplish (a new algorithm, a new data storage option, etc.) you need to start talking to the "sales" department of the potential vendor: how much is it going to cost you before you can start. When everything has been arranged, the fun can begin. The algorithm starts to take form, the results are validated, you are ready to take it to production.
Wait! The license you have is not really suited for production. You need another license. Ugh!
Or let's say that you get a new exciting data source. You can't wait to connect it to the rest of the data. A new database is created but... you cannot create new databases as the limit for the current subscription has been reached. Contact sales to upgrade!
I have heard this over and over at many different companies. People can't innovate as everything has been set up so rigidly: every piece of the infrastructure is in a different kingdom. You can get though, but there'll be no energy left to innovate by the time you're done.
Open source is transparent
With transparent I mean you can see what is going on and act accordingly. This is relevant for two things:
- Quality assurance.
Documentation seems silly, but ask programmers what they document. Usually documentation explains how to use an API. But they don't say much about the (software) architecture in which it should be used.
A recent example from Heap analytics illustrates this problem the best: the company was ingesting thousands of events per day to enable real time analytics on customers data. When inserting new records in the database, the process would look like this: one record for customer A, then one for customer B, then C, then D, then A again, etc. At random.
The database solution Heap uses, PostgreSQL, caches a part of the data when you create indices for these records6. However this random pattern of data ingestion (customer A-B-C-D-A-???) was
causing the cache for a particular customer to be practically immediately evicted.
This random access pattern does not matter at most scales. This behavior was not reported in the documentation, as the database was doing the thing you would expect from a good database: write the data in the asked order and, on top of that, was caching to speed things up.
For Heap, however, this was an issue: it meant that many customers had to wait up to an hour to see their data in their dashboard as the caching mechanism wasn't really working.
An engineer, therefore, decided to take a deeper look. He ended up reading PostgreSQL source code and found what the issue was.
He then implemented a client side fix (PostgreSQL was, at the end of the day, not the culprit) that saves Heap millions of dollars.7
Again, if you're an objector of open source you will go and say: but we have support from our vendor.
Yes, this is true. But the more archaic the edge case is, the closer you have to get to the source code. And I can assure you that the first and second line support of your vendor do not even have access to or have the skills to understand the source code of the product you are using!
Once you get to third line, they will probably need to understand your use case, maybe they want to even take a peek at your database (then NDA's need to be signed, the engineers will probably be in the US, and the list goes on). Once a solution has been found, maybe 3 months have passed, and part of your customers already left as your product was slow.
Another tricky one! This is biased towards scientific software, the Python scientific stack in general, and scikit-learn in particular.
Scikit-learn is a library of machine learning algorithms written in Python. The algorithms are written by scientific researchers all over the world. Every algorithm that gets in scikit-learn is thoroughly vexed by other researchers.
When you use the code, you can sleep safe, knowing this. But, if you suffer from insomnia or paranoia, you can open up a text editor, download scikit-learn, and take a look.
This is not only true for scikit-learn: each open source project can be inspected to see if it does what it claims it does.
The next relevant question is: what is the advice of GoDataDriven? Well, the meme says it all: we apply an open source first, vendors second approach. If the vendor solution is much better than what an open source solution offers, then we choose the vendor. With two conditions though! There is no data lock-in and there's no code lock-in. This avoids a 'Hotel California'-type of situation wherein "You can check out anytime you like, but you can never leave".
Want to know more about what it means to work like that? We're hiring!
- I realize this is not very specific. ↩
- You could say locking you. ↩
- Here I really do mean using these tools for your data science needs, not contributing to
- Sorry, I couldn't resist the dad joke. ↩
- To have is not the right verb here, but you see what I mean. ↩
- This is not super accurate: you need to read the whole post to understand the fine details! ↩
- Or so he claims. ↩