Play all

Intro

Welcome

Introductions

Agenda

Data Quality Cone of Anxiety

How do we address bad data

What is data observability

Freshness

Distribution

Volume

Schema

Data Lineage

Data Reliability Lifecycle

Lake vs Warehouse

Metadata

Storage

Query logs

Query engine

Questions

Describe Detail

Architecture for observability

Measuring update times

Loading data in CSV or JSON

Update cadence

Feature engineering

Lambda function

Delay between updates

Model Parameters

Training Labels

Questions and Answers

Summary

Upcoming events

Data Quality Fundamentals

Monte Carlo

Description:

Explore architecting for data quality in the lakehouse with Delta Lake and PySpark in this comprehensive tech talk. Learn how to combat data downtime by adopting DevOps and software engineering best practices. Discover techniques for identifying, resolving, and preventing data issues across the data lakehouse. Gain insights into optimizing data reliability across metadata, storage, and query engine tiers. Build your own data observability monitors using PySpark and understand the role of tools like Delta Lake in scaling this design. Dive into topics such as the Data Quality Cone of Anxiety, data observability principles, and the Data Reliability Lifecycle. Examine the differences between data lakes and warehouses, and explore practical examples of measuring update times, loading data, and feature engineering. Access accompanying exercises and Jupyter notebooks to apply your newfound knowledge in real-world scenarios.

Architecting for Data Quality in the Lakehouse with Delta Lake and PySpark

Databricks

Add to list

#Programming #Programming Languages #Python #PySpark #Data Science #Big Data #Apache Spark #Delta Lake #Computer Science #Information Technology #Data Management #Storage Optimization