Главная
Study mode:
on
1
cloudera
2
Spark Execution Model
3
PySpark Driver Program
4
How do we ship around Python functions?
5
Pickle!
6
DataFrame is just another word for...
7
Use DataFrames
8
REPLs and Notebooks
9
Share your code
10
Standard Python Project
11
What is the shape of a PySpark job?
12
PySpark Structure?
13
Simple Main Method
14
Write Testable Code
15
Write Serializable Code
16
Testing with SparkTestingBase
17
Testing Suggestions
18
Writing distributed code is the easy part...
19
Get Serious About Logs
20
Know your environment
21
Complex Dependencies
22
Many Python Environments
Description:
Learn best practices for using PySpark in real-world applications through this conference talk from ODSC West 2015. Discover how to manage dependencies on a cluster, avoid common pitfalls of Python's duck typing, and understand Spark's computational model for effective distributed code execution. Explore techniques for package management with virtualenv, testing PySpark applications, and structuring code for optimal performance. Gain insights into handling complex dependencies, implementing proper logging, and navigating multiple Python environments. Follow along with a practical example of a statistical analysis on time series data to reinforce key concepts and improve your PySpark development skills.

PySpark Best Practices

Open Data Science
Add to list