Play all

Intro

Experimental High Energy Physics is Data Intensive

Key Data Processing Challenge

Data Flow at LHC Experiments

R&D - Data Pipelines

Particle Classifiers Using Neural Networks

Deep Learning Pipeline for Physics Data

Analytics Platform at CERN

Hadoop and Spark Clusters at CERN

Step 1: Data Ingestion • Read input files: 4.5 TB from custom (ROOT) format

Feature Engineering

Step 2: Feature Preparation Features are converted to formats suitable for training

Performance and Lessons Learned • Data preparation is CPU bound

Neural Network Models and

Hyper-Parameter Tuning-DNN • Hyper-parameter tuning of the DNN model

Deep Learning at Scale with Spark

Spark, Analytics Zoo and BigDL

BigDL Run as Standard Spark Programs

BigDL Parameter Synchronization

Model Development - DNN for HLF • Model is instantiated using the Keras- compatible API provided by Analytics Zoo

Model Development - GRU + HLF A more complex network topology, combining a GRU of Low Level Feature + a DNN of High Level Features

Distributed Training

Performance and Scalability of Analytics Zoo/BigDL

Results - Model Performance

Workload Characterization

Training with TensorFlow 2.0 Training and test data

Recap: our Deep Learning Pipeline with Spark

Model Serving and Future Work

Summary • The use case developed addresses the needs for higher efficiency in event filtering at LHC experiments • Spark, Python notebooks

Labeled Data for Training and Test • Simulated events Software simulators are used to generate events

Description:

Explore a 39-minute conference talk detailing CERN's implementation of an Apache Spark-based data pipeline for deep learning research in High Energy Physics (HEP). Discover how CERN tackles the challenges of processing massive data volumes from Large Hadron Collider experiments, with particle collisions occurring every 25 nanoseconds. Learn about the novel event filtering system prototype using deep neural networks, and how it optimizes compute and storage resource usage. Dive into the data pipeline's architecture, which integrates PySpark, Spark SQL, and Python code via Jupyter notebooks for data preparation and feature engineering. Understand the key integrations enabling Apache Spark to ingest HEP data formats and interact with CERN's storage and compute systems. Examine the distributed training of neural network models using Keras API, BigDL, and Analytics Zoo on Spark clusters. Gain insights into the implementation details, results, and lessons learned from this cutting-edge application of big data technologies in particle physics research. Read more

Deep Learning Pipelines for High Energy Physics Using Apache Spark and Distributed Keras

Databricks

Add to list

#Data Science #Big Data #Apache Spark #Computer Science #Deep Learning #Artificial Intelligence #Neural Networks #Keras #Distributed Computing #Science #Physics #High-Energy Physics #Particle Physics #Large Hadron Collider #Data Engineering #Data Pipelines

0:00 / 0:00