Play all

Introduction

Data

RDD vs DataFrame

What is RDD

Aggregate by Key

Describe

Replacing unknown fields

Dummy variables

For loops

SQL statements

Column selection statements

Putting it all together

Labeling

Vectors

Sparse Vector

Spark ML

Classifier

Transform

Evaluation

Logistic Regression

Comparing Results

Description:

Explore the power of PySpark for Big Data and Data Science in this 57-minute conference talk from PASS Data Community Summit. Dive into the world of Big Data Analytics using Spark and Python, learning how to perform essential analytical tasks such as creating RDDs and Data Frames, transforming columns, and generating aggregations. Discover the differences between RDD and DataFrame, understand key concepts like aggregate by key, and learn how to handle unknown fields and create dummy variables. Gain insights into using SQL statements, column selection, and implementing for loops in PySpark. Delve into machine learning applications with Spark ML, including classification, transformation, and evaluation techniques. Compare logistic regression results and understand how to leverage sparse vectors for efficient data representation. Whether you're new to Big Data or looking to expand your skillset, this talk provides a comprehensive introduction to PySpark's capabilities in data science and analytics. Read more

Power of Electric Snakes! PySpark for Big Data and Data Science

PASS Data Community Summit

Add to list

#Conference Talks #PASS Data Community Summit #Data Science #Programming #Programming Languages #Python #Domain-Specific Languages (DSL) #SQL #Data Processing #Data Transformation #Big Data Analytics #Computer Science #Machine Learning #Logistic Regression #Big Data #Apache Spark #RDDs

0:00 / 0:00