Can I download the work from my Project after I complete it?

Yes, you can download and keep any of your created files from the Project. To do so, please make sure you save any files and work to your device before exiting the product environment.

Is financial aid available?

Financial aid is not available for Projects.

Can I audit a Project?

Auditing is not available for Projects.

How much experience do I need to do this Project?

At the top of the page, you can view the experience level recommended for this Project.

Can I complete this Project through my web browser, instead of installing special software?

Yes, everything you need to complete your Project will be available in your browser.

Building Multimodal Data Pipelines

Building Multimodal Data Pipelines

Instructor: Gilberto Hernandez

Ask Coursera

Project

Build in-demand job skills with step-by-step instructions

Intermediate level

Recommended experience

2 hours

Learn at your own pace

Hands-on learning

Learn more

Project

Build in-demand job skills with step-by-step instructions

Intermediate level

Recommended experience

2 hours

Learn at your own pace

Hands-on learning

Learn more

What you'll learn

Extract structured, queryable data from unstructured images, audio, and video using OCR, ASR, and Vision Language Models.
Build a VLM-backed pipeline that reasons across video frames to generate timestamped scene descriptions and track events over time.
Build a multimodal RAG app on real-world data—turning raw images, audio, and video into a queryable interface with grounded, cited answers.

Skills you'll practice

Tools you'll use

Details to know

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Learn, practice, and apply job-ready skills in less than 2 hours

Receive training from industry experts
Gain hands-on experience solving real-world job tasks

About this project

Images, audio, and video make up a growing share of the data companies generate today, but most pipelines are still built for structured data alone. This course teaches you to build AI-powered pipelines that process multimodal data and turn it into LLM-ready text.

You’ll start with the foundations: using ASR to extract transcripts from audio and turning images into LLM-ready text descriptions. From there, you’ll see how Vision Language Models generate descriptions from video segments, capturing not just what’s visible in a single frame, but what unfolds across a scene over time. You’ll then apply these skills to implement a multimodal RAG pipeline that searches across slides, audio, and video from meetings to answer questions about their content. By combining all three modalities, you give LLMs the rich context they need to deliver detailed answers across complex, real-world content. In detail, you’ll: Survey the multimodal data landscape, the unique challenges each data type presents, and the techniques that transform unstructured content into searchable text. Apply OCR and ASR to convert images and audio into structured text, then embed them into a unified vector space for cross-modal semantic search. Prompt Vision Language Models effectively, and choose the right frame sampling and embedding strategy for video. Run a Vision Language Model on meeting videos to generate timestamped segment descriptions, then embed them alongside audio and slides for unified semantic, and time-based search. Build a multimodal RAG system that retrieves across audio, slides, and video to generate grounded, cited answers from meeting recordings. Every technique you’ll learn serves the same goal data engineers have always had: take messy, unstructured data and turn it into something you can query, analyze, and build on.

Instructor

Gilberto Hernandez

DeepLearning.AI

1 Course112 learners

Offered by

DeepLearning.AI

Snowflake

How you'll learn

Hands-on, project-based learning
Practice new skills by completing job-related tasks with step-by-step instructions.
No downloads or installation required
Access the tools and resources you need in a cloud environment.
Available only on desktop
This project is designed for laptops or desktop computers with a reliable Internet connection, not mobile devices.

Why people choose Coursera for their career

Felipe M.

Learner since 2018

"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020

"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021

"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Status: Free Trial
Coursera
Preparing Multimodal Data: Vision, Audio, and NLP Pipelines
Course
Status: Free
DeepLearning.AI
Building Multimodal Search and RAG
Project
Status: Free Trial
Coursera
Unify Multimodal Data with Automated ETL
Course
Status: Free Trial
Coursera
End-to-End Multimodal AI: Fine-Tuning, Fusion, and MLOps
Course

Unlock access to 10,000+ courses with a subscription
Advance your career with an online degree
Earn a degree from world-class universities - 100% online
Join over 4,700 global companies that choose Coursera for Business

Frequently asked questions

In Projects, you'll complete an activity or scenario by following a set of instructions in an interactive hands-on environment. Projects are completed in a real cloud environment and within real instances of various products as opposed to a simulation or demo environment.

By purchasing a Project, you'll get everything you need to complete the Project including temporary access to any product required to complete the Project.

Even though Projects are technically available on mobile devices, we highly recommend that you complete Projects on a laptop or desktop only.

Building Multimodal Data Pipelines

Building Multimodal Data Pipelines

What you'll learn

Skills you'll practice

Tools you'll use

Details to know

See how employees at top companies are mastering in-demand skills

Learn, practice, and apply job-ready skills in less than 2 hours