Explore how unsupervised machine learning can revolutionize data quality monitoring in Databricks in this 37-minute conference talk. Delve into the limitations of traditional rules and metrics approaches, and discover a set of fully unsupervised machine learning algorithms designed to monitor data quality at scale. Learn about the algorithms' functionality, strengths, and weaknesses, as well as their testing and calibration processes. Gain insights into unsupervised data quality monitoring techniques, their advantages and challenges, and practical steps to implement them in Databricks. Examine real-world examples using ticket sales data, and understand how to set up monitoring in Anomalo. Investigate various visualizations, including severity, explanation, distribution, and root cause analysis. Explore the process of encoding features automatically, building supervised models, and generating visualizations using SHAP values. Address challenges in implementation and testing, and learn how to get started with these techniques in Databricks.
Read more
Unsupervised Machine Learning for Scaling Data Quality Monitoring in Databricks