Play all

cloudera

Spark Execution Model

PySpark Driver Program

How do we ship around Python functions?

Pickle!

DataFrame is just another word for...

Use DataFrames

REPLs and Notebooks

Share your code

Standard Python Project

What is the shape of a PySpark job?

PySpark Structure?

Simple Main Method

Write Testable Code

Write Serializable Code

Testing with SparkTestingBase

Testing Suggestions

Writing distributed code is the easy part...

Get Serious About Logs

Know your environment

Complex Dependencies

Many Python Environments

Description:

Learn best practices for using PySpark in real-world applications through this conference talk from ODSC West 2015. Discover how to manage dependencies on a cluster, avoid common pitfalls of Python's duck typing, and understand Spark's computational model for effective distributed code execution. Explore techniques for package management with virtualenv, testing PySpark applications, and structuring code for optimal performance. Gain insights into handling complex dependencies, implementing proper logging, and navigating multiple Python environments. Follow along with a practical example of a statistical analysis on time series data to reinforce key concepts and improve your PySpark development skills.

PySpark Best Practices

Open Data Science

Add to list

#Programming #Programming Languages #Python #PySpark #Data Science #Data Analysis #Big Data #Apache Spark #Computer Science #Distributed Computing