Play all

Intro

Data Challenges

Usual Data Lake

Getting the Data Right

Best Practices for Cluster Sizing & Selection

Selection of Instance Types

Selection of node size Rule of thumb

Observe Spark UI & tweak the workloads

Observe Ganglia Metrics & tweak the workloads

Performance Symptoms

Adaptive Ouery Execution

Data Governance with Delta Lake

Audit & Monitoring

Description:

Discover best practices for building robust data platforms using Apache Spark and Delta in this 27-minute talk from Databricks. Learn from real-world experiences to overcome technical challenges and create performant, scalable pipelines. Gain insights into operational tips for Apache Spark in production, optimal data pipeline design, and common misconfigurations to avoid. Explore strategies for optimizing costs, achieving performance at scale, and ensuring security compliance with GDPR and CCPA. Acquire valuable knowledge on cluster sizing, instance type selection, and workload optimization using Spark UI and Ganglia Metrics. Understand the benefits of Adaptive Query Execution and data governance with Delta Lake. Suitable for attendees with some experience in setting up Big Data pipelines and Apache Spark.

Best Practices for Building Robust Data Platforms with Apache Spark and Delta

Databricks

Add to list

#Data Science #Big Data #Apache Spark #Data Engineering #Data Pipelines #Delta Lake