Ayon Roy

PySpark - Combining Machine Learning & Big Data

How do you apply machine learning when your dataset is too big for a single machine? Discover PySpark's powerful, distributed ML pipelines.

PySpark - Combining Machine Learning & Big Data
#1about 3 minutes

Combining big data and machine learning for business insights

The exponential growth of data necessitates combining big data processing with machine learning to personalize user experiences and drive revenue.

#2about 3 minutes

An introduction to the Apache Spark analytics engine

Apache Spark is a unified analytics engine for large-scale data processing that provides high-level APIs and specialized libraries like Spark SQL and MLlib.

#3about 4 minutes

Understanding Spark's core data APIs and abstractions

Spark's data abstractions evolved from the low-level Resilient Distributed Dataset (RDD) to the more optimized and user-friendly DataFrame and Dataset APIs.

#4about 11 minutes

How the Spark cluster architecture enables parallel processing

Spark's architecture uses a driver program to coordinate tasks across a cluster manager and multiple worker nodes, which run executors to process data in parallel.

#5about 5 minutes

Using Python with Spark through the PySpark library

PySpark provides a Python API for Spark, using the Py4J library to communicate between the Python process and Spark's core JVM environment.

#6about 5 minutes

Exploring the key features of the Spark MLlib library

Spark's MLlib offers a comprehensive toolkit for machine learning, including pre-built algorithms, featurization tools, pipelines for workflow management, and model persistence.

#7about 4 minutes

The standard workflow for machine learning in PySpark

A typical machine learning workflow in Spark involves using DataFrames, applying Transformers for feature engineering, training a model with an Estimator, and orchestrating these steps with a Pipeline.

#8about 3 minutes

Pre-built algorithms and utilities available in Spark MLlib

MLlib includes a variety of common, pre-built algorithms for classification, regression, and clustering, such as logistic regression, SVM, and K-means clustering.

Related jobs
Jobs that call for the skills explored in this talk.

Featured Partners

Related Articles

View all articles
DC
Daniel Cranney
The State of WebDev AI 2025 Results: What Can We Learn?
Introduction The 2025 edition of The State of WebDev AI offers a detailed snapshot of how developers are using AI today, which tools have gained the most traction over the past year, and what these trends suggest about the future of the industry. In...
The State of WebDev AI 2025 Results: What Can We Learn?
CH
Chris Heilmann
With AIs wide open - WeAreDevelopers at All Things Open 2025
Last week our VP of Developer Relations, Chris Heilmann, flew to Raleigh, North Carolina to present at All Things Open . An excellent event he had spoken at a few times in the past and this being the “Lucky 13” edition, he didn’t hesitate to come and...
With AIs wide open - WeAreDevelopers at All Things Open 2025
DD
Dilek Demir
Data Science & more: The Lopez dilemma
Catwalk, Data Science, Hollywood, Google Images, Haute Couture, StackOverflow, Comfort Zone, Dota 2 and Versace – all these topics are connected and influenced by each other. Read here how and why!In 2000 Jennifer Lopez's green Versace dress went vi...
Data Science & more: The Lopez dilemma

From learning to earning

Jobs that call for the skills explored in this talk.

PySpark Developer

PySpark Developer

DCV Technologies Limited
Charing Cross, United Kingdom

£72-114K
ETL
GIT
Java
Azure
+7
Data Engineer - PySpark

Data Engineer - PySpark

Arelance
Municipality of Madrid, Spain

Remote
24-27K
Intermediate
ETL
PySpark
Data analysis