Training schedule
Join waiting listIN-COMPANY TRAINING PROGRAMS
Contact Gert-Jan Steltenpool, if you want to know more about custom data & AI training for your teams. He’ll be happy to help you!
Check out more
Optimizing Apache Spark & Tuning Best Practices
Processing data efficiently can be challenging as it scales up. Building up from the experience we built at the largest Apache Spark users in the world, we give you an in-depth overview of the do’s and don’ts of one of the most popular analytics engines out there.
Clients we've helped
What you'll learn
Fundamentals
-
Spark execution model: Driver/Executors
-
Spark resource managers (YARN, MESOS, K8s)
-
Understanding RDDs/DataFrames APIs and bindings
-
Difference between Actions and Transformations
-
How to read the Query plan (Physical/Logical)
Spark internals
-
Spark Memory model
-
Understanding persistence (caching)
-
Catalyst optimizer and Tungsten project
-
Shuffle service and how is shuffle operation executed
-
Concept of fair scheduling and pools
-
Java and Kryo serializer
-
Step into JVM world: what you need to know about GC when running Spark applications
Spark optimisation: main problems and issues
-
The most common memory problems
-
Benefit of using early filtering
-
Understanding partition and predicate filtering
-
Join optimisation
-
Combating Data skew (preprocessing, broadcasting, salting)
-
Understanding shuffle partitions: how to tackle memory/disk spill
-
Downside of using UDF’s
-
Executor idle timeout
-
Data formats examples
Moving to production
-
Debugging / troubleshooting
-
Productionizing your Spark application
-
Dynamic allocation and dynamic partitioning
-
Profiling your Spark application (Sparklint)
-
JVM profiler
learning journey
Data Engineering Learning Journey
This online course is perfect for
Data and Machine Learning Engineers who deal with transformation of large volumes of data and need production-quality code.
Expert Data Scientists can also participate: they will learn how to get the most performance out of Spark and how simple tweaks can increase the performance dramatically.
What will you learn during Optimizing Apache Spark & Tuning Best Practices?
After this training, you will have learned how Apache Spark works internally, the best practices to write performant code, and have acquired essential skills necessary to debug and tweak your Spark applications.

Oleksandra Bovkun
Data EngineerBefore joining GoDataDriven, Oleksandra worked as a Data modeler and Information/API modeler. Oleksandra combined a strong knowledge of SQL, database, and data modeling with well-developed understanding of conceptual design principles. Oleksandra has solid experience with ETL processes and data preparation techniques, including data management and data architecture.
Oleksandra has worked extensively with different SQL and NoSQL databases (including Oracle, PostgreSQL, SQL Server, Azure Database, Neo4j, and Elasticsearch), both as developer and DBA, but also with programming languages (Python, Scala) and data modeling tools. Oleksandra has professional knowledge of Agile development processes and the Scrum methodology and is comfortable with gathering requirements and prioritizing backlogs.
Clients include: Knab