Optimizing Apache Spark & Tuning Best Practices

English | 2-day
Book now Download brochure
Xebia Academy

Optimizing Apache Spark & Tuning Best Practices

Processing data efficiently can be challenging as it scales up. Building up from the experience we built at the largest Apache Spark users in the world, we give you an in-depth overview of the do’s and don’ts of one of the most popular analytics engines out there.

What you'll learn


  • Spark execution model: Driver/Executors

  • Spark resource managers (YARN, MESOS, K8s)

  • Understanding RDDs/DataFrames APIs and bindings

  • Difference between Actions and Transformations

  • How to read the Query plan (Physical/Logical)

Spark internals

  • Spark Memory model

  • Understanding persistence (caching)

  • Catalyst optimizer and Tungsten project

  • Shuffle service and how is shuffle operation executed

  • Concept of fair scheduling and pools

  • Java and Kryo serializer

  • Step into JVM world: what you need to know about GC when running Spark applications

Spark optimisation: main problems and issues

  • The most common memory problems

  • Benefit of using early filtering

  • Understanding partition and predicate filtering

  • Join optimisation

  • Combating Data skew (preprocessing, broadcasting, salting)

  • Understanding shuffle partitions: how to tackle memory/disk spill

  • Downside of using UDF’s

  • Executor idle timeout

  • Data formats examples

Moving to production

  • Debugging / troubleshooting

  • Productionizing your Spark application

  • Dynamic allocation and dynamic partitioning

  • Profiling your Spark application (Sparklint)

  • JVM profiler

This online course is perfect for

Data and Machine Learning Engineers who deal with transformation of large volumes of data and need production-quality code.

Expert Data Scientists can also participate: they will learn how to get the most performance out of Spark and how simple tweaks can increase the performance dramatically.

What will you learn during Optimizing Apache Spark & Tuning Best Practices?

After this training, you will have learned how Apache Spark works internally, the best practices to write performant code, and have acquired essential skills necessary to debug and tweak your Spark applications.

The Right Format For Your Preferred Learning Style

At GoDataDriven we offer four distinct training modalities:

  • In-Classroom & In-Company Training
  • Online, Instructor-Led Training
  • Hybrid and Blended Learning
  • Self-Paced Training

Learn more about our training modalities

Clients we've helped

  • ING Bank
  • Ahold Delhaize
  • Quby