Optimizing Apache Spark & Tuning Best Practices
Processing data efficiently can be challenging as it scales up. Building up from the experience we built at the largest Apache Spark users in the world, we give you an in-depth overview of the do’s and don’ts of one of the most popular analytics engines out there.
What you'll learn
Spark execution model: Driver/Executors
Spark resource managers (YARN, MESOS, K8s)
Understanding RDDs/DataFrames APIs and bindings
Difference between Actions and Transformations
How to read the Query plan (Physical/Logical)
Spark Memory model
Understanding persistence (caching)
Catalyst optimizer and Tungsten project
Shuffle service and how is shuffle operation executed
Concept of fair scheduling and pools
Java and Kryo serializer
Step into JVM world: what you need to know about GC when running Spark applications
Spark optimisation: main problems and issues
The most common memory problems
Benefit of using early filtering
Understanding partition and predicate filtering
Combating Data skew (preprocessing, broadcasting, salting)
Understanding shuffle partitions: how to tackle memory/disk spill
Downside of using UDF’s
Executor idle timeout
Data formats examples
Moving to production
Debugging / troubleshooting
Productionizing your Spark application
Dynamic allocation and dynamic partitioning
Profiling your Spark application (Sparklint)
This online course is perfect for
Data and Machine Learning Engineers who deal with transformation of large volumes of data and need production-quality code.
Expert Data Scientists can also participate: they will learn how to get the most performance out of Spark and how simple tweaks can increase the performance dramatically.
What will you learn during Optimizing Apache Spark & Tuning Best Practices?
After this training, you will have learned how Apache Spark works internally, the best practices to write performant code, and have acquired essential skills necessary to debug and tweak your Spark applications.
The Right Format For Your Preferred Learning Style
At GoDataDriven we offer four distinct training modalities:
- In-Classroom & In-Company Training
- Online, Instructor-Led Training
- Hybrid and Blended Learning
- Self-Paced Training