Explore architecting for data quality in the lakehouse with Delta Lake and PySpark in this comprehensive tech talk. Learn how to combat data downtime by adopting DevOps and software engineering best practices. Discover techniques for identifying, resolving, and preventing data issues across the data lakehouse. Gain insights into optimizing data reliability across metadata, storage, and query engine tiers. Build your own data observability monitors using PySpark and understand the role of tools like Delta Lake in scaling this design. Dive into topics such as the Data Quality Cone of Anxiety, data observability principles, and the Data Reliability Lifecycle. Examine the differences between data lakes and warehouses, and explore practical examples of measuring update times, loading data, and feature engineering. Access accompanying exercises and Jupyter notebooks to apply your newfound knowledge in real-world scenarios.
Architecting for Data Quality in the Lakehouse with Delta Lake and PySpark