Play all

Intro

What is a Streaming Live Table? Based on Spark™ Structured Streaming

Development vs Production Fast iteration or enterprise grade reliability

Choosing pipeline boundaries Break up pipelines at natural external divisions.

Pitfall: hard-code sources & destinations Problem: Hard coding the source & destination makes it impossible to test changes outside of production, breaking CI/CD

Ensure correctness with Expectations Expectations are tests that ensure data quality in production

Expectations using the power of SQL Use SQL aggregates and joins to perform complex validations

Using Python Write advanced DataFrame code and UDFs

Installing libraries with pip pip is a package installer for python

Best Practice: Integrate using the event log Use the information in the event log with your existing operational tools.

DLT Automates Failure Recovery Transient issues are handled by built-in retry logic

Modularize your code with configuration Avoid hard coding paths, topic names, and other constants in your code.

Workflow Orchestration For Triggered DLT Pipelines

Use Delta for infinite retention Delta provides cheap, elastic and governable storage for transient sources

Description:

Explore declarative ETL pipelines with Delta Live Tables in this 51-minute SQLBits conference talk. Learn about modern software engineering and management techniques for ETL, enabling data analysts and engineers to focus on extracting value from data rather than tooling. Discover the concept of Streaming Live Tables based on Spark Structured Streaming, and understand the differences between development and production environments. Gain insights on choosing pipeline boundaries, avoiding pitfalls like hard-coding sources and destinations, and ensuring data quality through Expectations. Delve into using SQL and Python for complex validations and advanced DataFrame operations. Learn best practices for integrating with existing operational tools, automating failure recovery, and modularizing code with configuration. Explore workflow orchestration for triggered DLT pipelines and leveraging Delta for infinite retention of transient sources. Speaker Vuong Nguyen provides valuable knowledge on Azure, Spark, Data Lake, AWS, GCP, and DataBricks, covering topics in Big Data & Data Engineering for developers and data professionals. Read more

Declarative ETL Pipelines with Delta Live Tables - Modern Software Engineering for Data Analysts and Engineers

SQLBits

Add to list

#Data Science #Big Data #Apache Spark #Delta Lake #Delta Live Tables #Programming #Programming Languages #Python #Domain-Specific Languages (DSL) #SQL #Business #Business Intelligence #Data Lakes #Data Engineering #Data Pipelines #ETL

0:00 / 0:00