What types of data processing tasks will I be able to perform after completing the course?

You will be able to perform a variety of tasks, including data cleaning, transformation, aggregation, and analysis of large datasets using PySpark’s RDDs and DataFrames.

What technologies and frameworks are covered in the course?

You’ll learn PySpark in detail, along with its integration with Hadoop, RDDs, DataFrames, and SQL-based data processing.

Is prior knowledge in data engineering required?

No, prior experience is not required; the course introduces PySpark basics before moving to advanced use cases.

Does the course cover workflow automation and ETL?

Yes, you’ll learn how to design ETL workflows and automate big data processing with PySpark.

Can I preview a course before enrolling?

Yes, you can preview the first video and view the syllabus before you enroll. You must purchase the course to access content not included in the preview.

When will I have access to the lectures and assignments?

If you decide to enroll in the course before the session start date, you will have access to all of the lecture videos and readings for the course. You’ll be able to submit assignments once the session starts.

What will I get when I enroll?

Once you enroll and your session begins, you will have access to all videos and other resources, including reading items and the course discussion forum. You’ll be able to view and submit practice assessments, and complete required graded assignments to earn a grade and a Course Certificate.

When will I receive my Course Certificate?

If you complete the course successfully, your electronic Course Certificate will be added to your Accomplishments page - from there, you can print your Course Certificate or add it to your LinkedIn profile.

Why can’t I audit this course?

This course is currently available only to learners who have paid or received financial aid, when available.

Is financial aid available?

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.

PySpark in Action: Hands-On Data Processing

Gain next-level skills with Coursera Plus for $199 (regularly $399). Save now.

PySpark in Action: Hands-On Data Processing

This course is part of PySpark for Data Science Specialization

Instructor: Edureka

Included with

Learn more

5 modules

Gain insight into a topic and learn the fundamentals.

Intermediate level

Recommended experience

2 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

5 modules

Gain insight into a topic and learn the fundamentals.

Intermediate level

Recommended experience

2 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

What you'll learn

Explore the fundamental concepts of Big Data and the components of the Hadoop ecosystem.
Explain the architecture and key principles of Apache Spark and its role in big data processing.
Utilize RDD transformations and actions to effectively process large-scale datasets with PySpark.
Execute advanced DataFrame operations, including data manipulation and aggregation techniques.

Skills you'll gain

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

17 assignments

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Build your subject-matter expertise

This course is part of the PySpark for Data Science Specialization

When you enroll in this course, you'll also be enrolled in this Specialization.

Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate

There are 5 modules in this course

PySpark in Action: Hands-on Data Processing is a practical course that equips you to work confidently with large-scale data using PySpark and distributed data processing frameworks. You’ll discover the fundamentals of Big Data, Apache Hadoop, and Apache Spark, then build on this knowledge through real-world exercises where you’ll process and analyze massive datasets.

During the course, you’ll gain hands-on experience with: - Foundational concepts of Big Data and components of the Hadoop ecosystem such as HDFS, enabling you to understand modern data storage and processing. - Spark architecture and critical design principles for scalable, fault-tolerant data workflows. - RDD transformations and actions, helping you handle large-scale datasets using PySpark’s distributed processing engine. - Advanced DataFrame techniques: manage complex data types, perform aggregations, and solve business data challenges efficiently. - PySpark SQL for applying advanced queries, optimizing processing workflows, and enabling rapid, reliable analysis at scale. This course is ideal for those new to data engineering or distributed computing who want a hands-on introduction to PySpark for large-scale data tasks. If you have basic Python skills but no prior experience in data engineering, you’ll find accessible explanations and step-by-step projects throughout. By course completion, you’ll be prepared to use PySpark in real-world projects, build and monitor data pipelines, automate processing, clean and integrate diverse datasets, and confidently tackle core challenges in distributed data analytics.

This module introduces you to the fundamental concepts of Big Data and Hadoop. You will explore the Hadoop ecosystem, its components, and the Hadoop Distributed File System (HDFS), setting the foundation for understanding big data processing and storage solutions.

What's included

15 videos5 readings4 assignments1 discussion prompt

15 videosTotal 74 minutes

Course Introduction4 minutes
What is Big Data?4 minutes
Applications of Big Data4 minutes
What is Hadoop?5 minutes
Hadoop Ecosystem2 minutes
Working of HDFS5 minutes
Introduction to Apache Spark6 minutes
Master-slave Architecture6 minutes
Spark Architecture1 minute
Data Processing with Apache Spark5 minutes
Directed Acyclic Graph (DAG)5 minutes
Introduction to Spark Ecosystem5 minutes
What is PySpark?4 minutes
Key Features of PySpark6 minutes
Basics of Python5 minutes

5 readingsTotal 50 minutes

Welcome to PySpark in Action: Hands-On Data Processing10 minutes
What is Big Data? – A Beginner’s Guide to the World of Big Data10 minutes
Spark SQL10 minutes
Features of PySpark10 minutes
Module Summary: Big Data Processing with PySpark10 minutes

4 assignmentsTotal 38 minutes

Knowledge Check: Big Data Processing with PySpark20 minutes
Practice Quiz: Big Data Essentials6 minutes
Practice Quiz: Apache Spark Fundamentals6 minutes
Practice Quiz: PySpark 6 minutes

1 discussion promptTotal 10 minutes

Introduce Yourself10 minutes

Dive into the core of PySpark by learning about Resilient Distributed Datasets (RDDs). This module covers the fundamentals of RDDs, how they work, and their key transformations and actions, enabling efficient distributed data processing in PySpark.

What's included

25 videos4 readings4 assignments3 discussion prompts

25 videosTotal 121 minutes

Introduction to RDDs6 minutes
Working of RDDs4 minutes
Creating RDDs6 minutes
Essentials of RDD6 minutes
Key Concepts of RDD6 minutes
Understanding Lazy Evaluations4 minutes
Advantages of Lazy Evaluation3 minutes
Introduction to Transformations5 minutes
Narrow and Wide Transformations5 minutes
Transformations: Map5 minutes
Transformations: Filter, Reduce and groupBykey4 minutes
Transformations: Distinct, Sample and Join 5 minutes
Transformations: Union and Subtract3 minutes
Introduction to Repartition6 minutes
Significance of Repartition1 minute
Introduction to Actions5 minutes
Actions: collect, reduce and reduceBykey5 minutes
Implementing Actions: collect, reduce and reduceBykey2 minutes
Actions: count, foreach and aggregate6 minutes
Implementing Actions: count, foreach and aggregate2 minutes
Actions: Coalesce, histogram and sortby4 minutes
Implementing Actions: Coalesce, histogram and sortby3 minutes
Working with RDD Transformations6 minutes
Applying Distinct, sample and join Transformations2 minutes
Grocery Store Data Analysis with PySPark RDDs7 minutes

4 readingsTotal 40 minutes

PySpark RDDs in Organization10 minutes
Managing RDD Transformations in PySpark10 minutes
Optimizing RDD operations in PySpark10 minutes
Module Summary: Working with RDD10 minutes

4 assignmentsTotal 38 minutes

Knowledge Check: Working with RDD20 minutes
Introduction to RDD6 minutes
RDD Transformations6 minutes
RDD Actions6 minutes

3 discussion promptsTotal 30 minutes

Introduction to RDDs10 minutes
Transformations: Map10 minutes
Actions: Coalesce, histogram, and sortBy10 minutes

This module covers the creation and manipulation of DataFrames in PySpark. You will learn how to perform basic and advanced operations, including aggregation, grouping, and handling missing data, with a focus on optimizing large-scale data processing tasks.

What's included

22 videos4 readings4 assignments1 discussion prompt

22 videosTotal 116 minutes

Overview of Data frames7 minutes
Introduction to DataFrames API4 minutes
Creating Data Frames from Different Sources6 minutes
Data Frames from RDD6 minutes
Basic DataFrame Operations6 minutes
Implementation of DataFrame Operations4 minutes
Performing Aggregations and Groupings - GroupBy and Window5 minutes
Performing Aggregations and Groupings - Cube and Rollup4 minutes
Handling Missing Data - Managing Null Values7 minutes
Demonstration for Handling Missing Data3 minutes
Working with Complex Data Types - Arrays and Structs6 minutes
Demonstration for Working with Complex Data Types3 minutes
Advanced DataFrame Transformations and Actions6 minutes
Demonstration: Working with DataFrames6 minutes
Introduction to Data Visualization and Key Aspects4 minutes
Introduction to Data Visualization - General Visuals3 minutes
Libraries for Data Visualization - Matplotlib and Seaborn3 minutes
Libraries for Data Visualization - Plotly3 minutes
Implementing Data Visualization5 minutes
Implementing Data Visualization - Plotting Charts5 minutes
Customizing the Visualizations 4 minutes
Customizing Charts and Visuals5 minutes

4 readingsTotal 40 minutes

Importance of PySpark DataFrames10 minutes
Window Functions in PySpark10 minutes
Data Visualization Libraries in PySpark10 minutes
Module Summary: PySpark DataFrames10 minutes

4 assignmentsTotal 38 minutes

Knowledge Check: PySpark Dataframes20 minutes
Introduction to PySpark DataFrames6 minutes
Advanced DataFrame Operations6 minutes
Data Visualizations with PySpark DataFrames6 minutes

1 discussion promptTotal 5 minutes

PySpark DataFrames and Traditional Pandas DataFrames5 minutes

In this module, you will explore the SQL capabilities of PySpark. Learn how to perform CRUD operations, execute SQL commands, and merge and aggregate data using PySpark SQL. You'll also discover best practices for using SQL with PySpark to enhance data workflows.

What's included

28 videos4 readings4 assignments2 discussion prompts

28 videosTotal 135 minutes

Structured Data vs. Unstructured Data5 minutes
Characteristic of Structured Data 4 minutes
Relational Database and its Components6 minutes
SQL in Relation with Relational Database6 minutes
Normalization and its Types5 minutes
Exploring Different Types of Normalization4 minutes
Data Querying and Filtering Logic6 minutes
DDL Commands - Creating Tables4 minutes
DDL Commands - Altering and Truncating Tables4 minutes
DQL Commands - Select Statement and Where Clause4 minutes
DQL Commands - Practical Implementation4 minutes
DML Commands - Insert, Update, and Delete3 minutes
DML Commands - Lock4 minutes
DCL Commands6 minutes
TCL Commands6 minutes
Alter - Altering a Table and Constraints5 minutes
Alter - Altering Indexes and Views2 minutes
Performing CRUD Operations6 minutes
Operations on PySpark SQL DataFrames3 minutes
Performing Operations on PySpark SQL DataFrames6 minutes
Data Merging and Aggregation using PySpark SQL4 minutes
Implementing Data Merging and Aggregation using PySpark SQL4 minutes
SQL Best Practices5 minutes
Data Integrity and Error Handling with PySpark2 minutes
Problem Statement: Ecommerce Organization 3 minutes
Data Analysis of an E-commerce Organization4 minutes
Demonstration: Spark SQL - Retail Organization4 minutes
Demonstration: Analyzing the Data3 minutes

4 readingsTotal 34 minutes

Best Practices for Data Querying: Optimizing SQL Performance8 minutes
User-Defined Functions (UDFs) in PySpark8 minutes
Best Practices for Using SQL with PySpark8 minutes
Module Summary: PySpark SQL10 minutes

4 assignmentsTotal 38 minutes

Knowledge Check: PySpark SQL20 minutes
Introduction to SQL6 minutes
SQL Commands6 minutes
Working with PySpark SQL6 minutes

2 discussion promptsTotal 10 minutes

Why Normalization is Crucial for Database Design?5 minutes
Importance of Aggregate Functions 5 minutes

This module is meant to test how well you understand the different ideas and lessons you've learned in this course. You will undertake a project based on these PySpark concepts and complete a comprehensive quiz that will assess your confidence and proficiency in Data Processing with PySpark.

What's included

1 video1 reading1 assignment1 discussion prompt

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.

Instructor

Instructor ratings

3.2 (5 ratings)

Edureka

131 Courses121,534 learners

Offered by

Edureka

Explore more from Data Analysis

Status: Free Trial
EDUCBA
PySpark & Python: Hands-On Guide to Data Processing
Course
Status: Preview
Edureka
Introduction to PySpark
Course
Status: Free Trial
EDUCBA
PySpark: Apply & Analyze Advanced Data Processing
Course
Status: Free
Coursera
PySpark Foundations: Process, analyze, and summarize data
Guided Project

Why people choose Coursera for their career

Felipe M.

Learner since 2018

"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020

"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021

"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Open new doors with Coursera Plus

Unlimited access to 10,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Learn more

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Explore degrees

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Learn more

Frequently asked questions

You will need access to a computer with Python and Apache Spark installed. Detailed setup instructions will be provided at the beginning of the course.

This course is designed for individuals new to big data and PySpark, providing a solid foundation to start working with distributed data processing.

While prior SQL knowledge is beneficial, it is not mandatory. The course will introduce SQL concepts as they relate to PySpark and provide practice with SQL queries.