Training schedule

Join waiting list

IN-COMPANY TRAINING PROGRAMS

Contact Gert-Jan Steltenpool, if you want to know more about custom data & AI training for your teams. He’ll be happy to help you!
Check out more

Optimizing Apache Spark & Tuning Best Practices

Processing data efficiently can be challenging as it scales up. Building up from the experience we built at the largest Apache Spark users in the world, we give you an in-depth overview of the do’s and don’ts of one of the most popular analytics engines out there.

Clients we've helped

  • DSM is a GoDataDriven customer
  • Dupont - GoDataDriven customer
  • Logo-Booking.com
  • lego-logo
  • Airbus-logo
  • Merck-logo
  • Ahold Delhaize
  • Credit-Suisse-Logo
  • Shell-Logo
  • ING Bank
  • Danone logo
  • Nike-logo
  • tomtom_logo
  • Verizon-logo

What you'll learn

Fundamentals

  • Spark execution model: Driver/Executors

  • Spark resource managers (YARN, MESOS, K8s)

  • Understanding RDDs/DataFrames APIs and bindings

  • Difference between Actions and Transformations

  • How to read the Query plan (Physical/Logical)

Spark internals

  • Spark Memory model

  • Understanding persistence (caching)

  • Catalyst optimizer and Tungsten project

  • Shuffle service and how is shuffle operation executed

  • Concept of fair scheduling and pools

  • Java and Kryo serializer

  • Step into JVM world: what you need to know about GC when running Spark applications

Spark optimisation: main problems and issues

  • The most common memory problems

  • Benefit of using early filtering

  • Understanding partition and predicate filtering

  • Join optimisation

  • Combating Data skew (preprocessing, broadcasting, salting)

  • Understanding shuffle partitions: how to tackle memory/disk spill

  • Downside of using UDF’s

  • Executor idle timeout

  • Data formats examples

Moving to production

  • Debugging / troubleshooting

  • Productionizing your Spark application

  • Dynamic allocation and dynamic partitioning

  • Profiling your Spark application (Sparklint)

  • JVM profiler

learning journey

Data Engineering Learning Journey

This online course is perfect for

Data and Machine Learning Engineers who deal with transformation of large volumes of data and need production-quality code.

Expert Data Scientists can also participate: they will learn how to get the most performance out of Spark and how simple tweaks can increase the performance dramatically.

What will you learn during Optimizing Apache Spark & Tuning Best Practices?

After this training, you will have learned how Apache Spark works internally, the best practices to write performant code, and have acquired essential skills necessary to debug and tweak your Spark applications.

meet your trainer

Oleksandra Bovkun

Data Engineer

Before joining GoDataDriven, Oleksandra worked as a Data modeler and Information/API modeler. Oleksandra combined a strong knowledge of SQL, database, and data modeling with well-developed understanding of conceptual design principles. Oleksandra has solid experience with ETL processes and data preparation techniques, including data management and data architecture.

Oleksandra has worked extensively with different SQL and NoSQL databases (including Oracle, PostgreSQL, SQL Server, Azure Database, Neo4j, and Elasticsearch), both as developer and DBA, but also with programming languages (Python, Scala) and data modeling tools. Oleksandra has professional knowledge of Agile development processes and the Scrum methodology and is comfortable with gathering requirements and prioritizing backlogs.

Clients include: Knab

Flexible delivery

The Right Format For Your Preferred Learning Style

In-Classroom & In-Company Training
Online, Instructor-Led Training
Hybrid and Blended Learning
Self-Paced Training
Get in touch with the experts

Have any questions?

Contact Gert-Jan Steltenpool, the sales director of GoDataDriven Academy if you want to know more. He’ll be happy to help you!

You can reach him by phone as well at +31 6 4214 0783

Course: Optimizing Apache Spark & Tuning Best Practices

Book now