Training schedule

28 Oct - 29 Oct, 2021
Virtual / English
€1295

IN-COMPANY TRAINING PROGRAMS

Contact Giovanni Lanzani, if you want to know more about custom data & AI training for your teams. He’ll be happy to help you!
Check out more

Optimizing Apache Spark & Tuning Best Practices

Processing data efficiently can be challenging as it scales up. Building up from the experience we built at the largest Apache Spark users in the world, we give you an in-depth overview of the do’s and don’ts of one of the most popular analytics engines out there.

Get a free introduction by the trainer on Thursday September 2th about this training.

Clients we've helped

  • DSM is a GoDataDriven customer
  • Dupont - GoDataDriven customer
  • Logo-Booking.com
  • lego-logo
  • Airbus-logo
  • Merck-logo
  • Ahold Delhaize logo
  • Credit-Suisse-Logo
  • Shell-Logo
  • ING Bank
  • Danone logo
  • Nike-logo
  • tomtom_logo
  • Verizon-logo

What you'll learn

Fundamentals

  • Spark execution model: Driver/Executors

  • Spark resource managers (YARN, MESOS, K8s)

  • Understanding RDDs/DataFrames APIs and bindings

  • Difference between Actions and Transformations

  • How to read the Query plan (Physical/Logical)

Spark internals

  • Spark Memory model

  • Understanding persistence (caching)

  • Catalyst optimizer and Tungsten project

  • Shuffle service and how is shuffle operation executed

  • Concept of fair scheduling and pools

  • Java and Kryo serializer

  • Step into JVM world: what you need to know about GC when running Spark applications

Spark optimisation: main problems and issues

  • The most common memory problems

  • Benefit of using early filtering

  • Understanding partition and predicate filtering

  • Join optimisation

  • Combating Data skew (preprocessing, broadcasting, salting)

  • Understanding shuffle partitions: how to tackle memory/disk spill

  • Downside of using UDF’s

  • Executor idle timeout

  • Data formats examples

Moving to production

  • Debugging / troubleshooting

  • Productionizing your Spark application

  • Dynamic allocation and dynamic partitioning

  • Profiling your Spark application (Sparklint)

  • JVM profiler

learning journey

Data Engineering Learning Journey

This online course is perfect for

Data and Machine Learning Engineers who deal with transformation of large volumes of data and need production-quality code.

Expert Data Scientists can also participate: they will learn how to get the most performance out of Spark and how simple tweaks can increase the performance dramatically.

What will you learn during Optimizing Apache Spark & Tuning Best Practices?

After this training, you will have learned how Apache Spark works internally, the best practices to write performant code, and have acquired essential skills necessary to debug and tweak your Spark applications.

meet your trainer

Roman Ivanov

Machine Learning Engineer
Flexible delivery

The Right Format For Your Preferred Learning Style

In-Classroom & In-Company Training
Online, Instructor-Led Training
Hybrid and Blended Learning
Self-Paced Training
Get in touch with the experts

Have any questions?

Contact Giovanni Lanzani, our Managing Director of Learning and Development, if you want to know more. He’ll be happy to help you!

Call me back

You can reach him out by phone as well at +31 6 51 20 6163

Course: Optimizing Apache Spark & Tuning Best Practices

Book now