Play all

Introduction

What will you learn?

Nielsen Identity in numbers

Common data pipeline pattern - Airflow DAG

Spark clusters

What is EMR?

EMR pricing - example

Running Airflow-based Spark jobs on EMR

Basic Kubernetes terminology

Kubernetes auto-scale

Spark-On-Kubernetes overview

Spark-submit example - SparkPi

Spark-On-Kubernetes operator example - SparkPi

Airflow Spark Kubernetes integration

Common data pipeline pattern - revised

Connecting the dots... making it production-ready

Visibility

Robustness

Airflow integration current status

Description:

Learn how to migrate Apache Spark workloads from AWS EMR to Kubernetes in this 21-minute conference talk by Databricks. Explore the challenges of existing Spark infrastructure and the motivation behind migrating to Kubernetes. Discover aspects of running Spark natively on Kubernetes, including monitoring and logging. Gain insights into best practices for using Airflow as an orchestrator. Follow the journey of Nielsen Identity as they process massive amounts of data using Apache Spark, and understand how they combined the GCP Spark-on-K8s operator with a native Airflow integration to achieve their goals. Dive into topics such as Kubernetes auto-scaling, Spark-On-Kubernetes overview, and making the migration production-ready. This talk provides valuable information for data engineers and architects looking to optimize their Spark workloads and reduce operational costs.

Migrating Airflow-Based Apache Spark Jobs to Kubernetes - The Native Way

Databricks

Add to list

#Data Science #Big Data #Apache Spark #Computer Science #DevOps #Kubernetes #Data Engineering #Programming #Cloud Computing #Cloud Migration #Containerization