Explore how Zillow's data engineering team revolutionized their data pipeline architecture using Apache Spark in this 27-minute conference talk. Learn about the challenges of balancing development speed with pipeline maintainability in a rapidly evolving organization. Discover how Zillow identified and addressed technical debt, improved data quality enforcement, consolidated shared pipeline functionality, and implemented scalable complex business logic. Gain insights into the process of designing a new end-to-end pipeline architecture that enhances robustness, maintainability, and scalability while reducing code complexity. Understand the pain points in pipeline development, maintenance, and scaling, and explore the pros and cons of various ETL patterns. Delve into Zillow's approach to creating more scalable and robust data pipelines using Apache Spark, including the establishment of processing layers, the development of a Pipeler Library, config-driven orchestration, separation of data processing and business logic, and early data validation techniques.
Read more
Designing the Next Generation of Data Pipelines with Apache Spark - Zillow's Approach