Play all

Introduction

Data Preparation

Data Types

Camel

Data formats

Demo

Hive vs Spark

Demo Time

Demo Starts

Logs

HDFS

Python

Code

Recap

Office Hours

Description:

Explore the challenges and solutions for building robust streaming data pipelines with Apache Spark in this 42-minute conference talk by Zak Hassan from Red Hat. Learn how to integrate Apache Kafka, Apache Spark, and Apache Camel to create a continuous data pipeline for Spark applications, addressing issues like dirty data in ETL processes. Discover techniques for extracting, transforming, and loading data from various systems into Apache Kafka, and leverage Spark's built-in Kafka connector. Gain insights into running these technologies inside Docker and benefit from lessons learned in real-world implementations. The talk covers data preparation, various data types and formats, and includes demonstrations comparing Hive and Spark, as well as practical examples using HDFS and Python code.

Building Robust Streaming Data Pipelines with Apache Spark

Linux Foundation

Add to list

#Data Science #Big Data #Apache Spark #Programming #Programming Languages #Python #Computer Science #DevOps #Docker #Cloud Computing #Microservices #Apache Kafka #Apache Camel #Hadoop #HDFS #Data Streaming #Data Engineering #Data Pipelines #ETL

0:00 / 0:00