Data Science Learning Journey

Data Science with Spark Training

English | 3 days
Book now Download brochure
Xebia Academy
Data Science - Senior

Data Science with Spark


Apache Spark is a powerful open-source processing engine built around speed, ease of use, and advanced analytics. Through our experienced consultants, you can learn to unlock its full potential and master this challenging tool yourself.

“I liked every aspect of this training and would like to thank the trainers. They did an excellent job of explaining how to use Spark for data science. This is the fourth GoDataDriven training I’ve followed. All were great, but this was the best one so far.” —Data Scientist, Knab

What you'll learn

Spark basics

  • Spark execution
  • SparkSession
  • Transformations vs. actions
  • Laziness and lineage: how Spark optimizes code
  • How to use the Spark UI
  • Advanced Spark
  • How to apply partitioning and how Spark reads and writes data
  • Shuffling, narrow wide operations, and their impact on performance
  • The catalyst optimizer
  • About scheduling and job execution
  • About caching and persistence levels


  • The basic concepts
  • All about Spark DataFrames and pandas DataFrames
  • How to load and save DataFrames
  • The functions API
  • How to join data
  • User-defined functions and pandas’ user-defined functions (with performance implications)
  • Window operations

  • Machine Learning with Spark
  • Pre-processing data and feature engineering
  • Model selection
  • Pipeline API
  • Advanced topics

Spark structured streaming

  • Structured streaming
  • Machine Learning & streaming
  • Sources and sink
  • Windows & aggregations
  • Checkpointing & watermarking
  • Fault tolerance & Kafka
  • Kafka as a source and as a sink


The program consists of both theory and hands-on exercises.

Day 1:

  • Spark basics
  • Advanced Spark
  • DataFrames

Day 2:

  • Window functions

Day 3:

  • Spark structured streaming
  • Integrating Apache Spark with Apache Kafka

You will be redirected to the Xebia Academy Website for registration

Climbing a steep Python and Machine Learning curve in three days. This would have taken me months on my own.

FD Mediagroep Data Scientist

This training is perfect for

Anyone working in an organization that uses Apache Spark and wants to get the most out of it. The training is not limited to Data Scientists who wish to scale their projects. Data Engineers, Data Analysts, Software Programmers, and Database Administrators who want to exploit Apache Spark will also benefit from this course. Prior experience with Python or software programming is required. Experience with database languages such as SQL and pandas is helpful, but not required.

What will you learn during this training?

Gain the theoretical knowledge, hands-on experience, and best practices you need to get the most out of Apache Spark. After completing the training, you will be able to use Apache Spark for data science at scale confidently.

Data Science Skills

The Data Science Learney Journey

The training courses in this journey teach you new skills and empower you to stay ahead professionally. We offer solid fundamentals that apply to practical Python courses, whether you are a beginner or an advanced user. We also offer courses on Spark, R, and Deep Learning.

The Right Format For Your Preferred Learning Style

At GoDataDriven we offer four distinct training modalities:

  • In-Classroom & In-Company Training
  • Online, Instructor-Led Training
  • Hybrid and Blended Learning
  • Self-Paced Training

Learn more about our training modalities

Clients we've helped

  • ING Bank
  • Ahold Delhaize
  • Quby

Pick a date and start learning

Online, instructor-led - English
December 2, 2020 - December 4, 2020
€1795 Book now
Online, instructor-led - English
March 15, 2021 - March 17, 2021
€1795 Book now
Online, instructor-led - English
June 2, 2021 - June 4, 2020
€1795 Book now