Explore the integration of Kubernetes with on-premises big data clusters through this conference talk. Learn about the HDFS CSI Plugin design and architecture, addressing the challenge of consuming HDFS data with Kubernetes. Discover best practices for running Spark workloads on Kubernetes with HDFS access using the CSI plugin. Examine performance comparisons between Spark on Kubernetes with HDFS and Spark on YARN with HDFS using the TPC-DS benchmark suite. Gain insights into big data history, containerization benefits, Kubernetes architecture, CSI core services, volume lifecycle management, and Hadoop HDFS characteristics as persistent volumes. Understand the potential of Kubernetes as an alternative to Hadoop YARN for resource scheduling in on-premises big data environments.
HDFS CSI Plugin: Speeding Up Kubernetes in On-Premises Big Data Clusters