Главная
Study mode:
on
1
Intro
2
Experimental High Energy Physics is Data Intensive
3
Key Data Processing Challenge
4
Data Flow at LHC Experiments
5
R&D - Data Pipelines
6
Particle Classifiers Using Neural Networks
7
Deep Learning Pipeline for Physics Data
8
Analytics Platform at CERN
9
Hadoop and Spark Clusters at CERN
10
Step 1: Data Ingestion • Read input files: 4.5 TB from custom (ROOT) format
11
Feature Engineering
12
Step 2: Feature Preparation Features are converted to formats suitable for training
13
Performance and Lessons Learned • Data preparation is CPU bound
14
Neural Network Models and
15
Hyper-Parameter Tuning-DNN • Hyper-parameter tuning of the DNN model
16
Deep Learning at Scale with Spark
17
Spark, Analytics Zoo and BigDL
18
BigDL Run as Standard Spark Programs
19
BigDL Parameter Synchronization
20
Model Development - DNN for HLF • Model is instantiated using the Keras- compatible API provided by Analytics Zoo
21
Model Development - GRU + HLF A more complex network topology, combining a GRU of Low Level Feature + a DNN of High Level Features
22
Distributed Training
23
Performance and Scalability of Analytics Zoo/BigDL
24
Results - Model Performance
25
Workload Characterization
26
Training with TensorFlow 2.0 Training and test data
27
Recap: our Deep Learning Pipeline with Spark
28
Model Serving and Future Work
29
Summary • The use case developed addresses the needs for higher efficiency in event filtering at LHC experiments • Spark, Python notebooks
30
Labeled Data for Training and Test • Simulated events Software simulators are used to generate events
Description:
Explore a 39-minute conference talk detailing CERN's implementation of an Apache Spark-based data pipeline for deep learning research in High Energy Physics (HEP). Discover how CERN tackles the challenges of processing massive data volumes from Large Hadron Collider experiments, with particle collisions occurring every 25 nanoseconds. Learn about the novel event filtering system prototype using deep neural networks, and how it optimizes compute and storage resource usage. Dive into the data pipeline's architecture, which integrates PySpark, Spark SQL, and Python code via Jupyter notebooks for data preparation and feature engineering. Understand the key integrations enabling Apache Spark to ingest HEP data formats and interact with CERN's storage and compute systems. Examine the distributed training of neural network models using Keras API, BigDL, and Analytics Zoo on Spark clusters. Gain insights into the implementation details, results, and lessons learned from this cutting-edge application of big data technologies in particle physics research. Read more

Deep Learning Pipelines for High Energy Physics Using Apache Spark and Distributed Keras

Databricks
Add to list
0:00 / 0:00