Data Science with Spark
Apache Spark is a powerful open-source processing engine built around speed, ease of use, and advanced analytics. Through our experienced consultants, you can learn to unlock its full potential and master this challenging tool yourself.
“I liked every aspect of this training and would like to thank the trainers. They did an excellent job of explaining how to use Spark for data science. This is the fourth GoDataDriven training I’ve followed. All were great, but this was the best one so far.” —Data Scientist, Knab
What you'll learn
- Spark execution
- Transformations vs. actions
- Laziness and lineage: how Spark optimizes code
- How to use the Spark UI
- Advanced Spark
- How to apply partitioning and how Spark reads and writes data
- Shuffling, narrow wide operations, and their impact on performance
- The catalyst optimizer
- About scheduling and job execution
- About caching and persistence levels
- The basic concepts
- All about Spark DataFrames and pandas DataFrames
- How to load and save DataFrames
- The functions API
- How to join data
- User-defined functions and pandas’ user-defined functions (with performance implications)
- Window operations
- Machine Learning with Spark
- Pre-processing data and feature engineering
- Model selection
- Pipeline API
- Advanced topics
Spark structured streaming
- Structured streaming
- Machine Learning & streaming
- Sources and sink
- Windows & aggregations
- Checkpointing & watermarking
- Fault tolerance & Kafka
- Kafka as a source and as a sink
The program consists of both theory and hands-on exercises.
- Spark basics
- Advanced Spark
- Window functions
- Spark structured streaming
- Integrating Apache Spark with Apache Kafka
You will be redirected to the Xebia Academy Website for registration
Climbing a steep Python and Machine Learning curve in three days. This would have taken me months on my own.
This training is perfect for
Anyone working in an organization that uses Apache Spark and wants to get the most out of it. The training is not limited to Data Scientists who wish to scale their projects. Data Engineers, Data Analysts, Software Programmers, and Database Administrators who want to exploit Apache Spark will also benefit from this course. Prior experience with Python or software programming is required. Experience with database languages such as SQL and pandas is helpful, but not required.
What will you learn during this training?
Gain the theoretical knowledge, hands-on experience, and best practices you need to get the most out of Apache Spark. After completing the training, you will be able to use Apache Spark for data science at scale confidently.
Data Science Skills
The Data Science Learney Journey
The training courses in this journey teach you new skills and empower you to stay ahead professionally. We offer solid fundamentals that apply to practical Python courses, whether you are a beginner or an advanced user. We also offer courses on Spark, R, and Deep Learning.
The Right Format For Your Preferred Learning Style
At GoDataDriven we offer four distinct training modalities:
- In-Classroom & In-Company Training
- Online, Instructor-Led Training
- Hybrid and Blended Learning
- Self-Paced Training