When you enroll in this course, you'll also be enrolled in this Professional Certificate.
Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate from IBM
There are 3 modules in this course
Ready to level up your GenAI skills? Step into the exciting world of multimodal AI, where language, images, and speech come together to build smarter, more interactive applications.
In this hands-on course, you’ll learn how to build systems that work across multiple modalities, from creating AI-powered storytellers and meeting assistants to developing image captioning tools and video generation apps.
You’ll gain experience with real-world tools like IBM’s Granite, OpenAI’s Whisper, Sora and DALL·E, Meta’s Llama, Mistral’s Mixtral, and Gradio. Plus, you'll explore multimodal search, question answering, and retrieval systems that combine text, speech, and visual data.
By the end of the course, you’ll be able to design and build full-stack multimodal AI solutions using Python and frameworks like Flask and Gradio.
If you’re looking to gain in-demand skills for building the next generation of AI applications, enroll today and power up your AI career!
This module provides an in-depth introduction to multimodal AI, focusing on how AI systems process and integrate multiple data types, including text, speech, and images. You will explore core concepts and some of the challenges you will face in multimodal AI, gaining foundational skills with text and speech processing techniques. Through hands-on labs, you will apply AI-powered storytelling, speech-to-text transcription, and text-to-speech synthesis to real-world applications, such as AI-generated audiobooks and automated meeting assistants.
RAG and Agentic AI Professional Certificate Overview•6 minutes
Introduction to Multimodal AI •8 minutes
Text-to-Speech Technologies •8 minutes
Speech-to-Text Technologies •7 minutes
2 readings•Total 5 minutes
Reading: Course Overview•3 minutes
Reading: Summary and Highlights •2 minutes
2 assignments•Total 36 minutes
Graded Quiz: Foundations of Multimodal AI•21 minutes
Practice Quiz: Introduction to Multimodal AI: Text and Speech Processing•15 minutes
2 app items•Total 75 minutes
Lab: Use Mistral and gTTS to Create Your Personal Storyteller•30 minutes
Lab: Build a Meeting Assistant with Whisper, LangChain, & Gradio•45 minutes
6 plugins•Total 32 minutes
Helpful Tips for Course Completion•3 minutes
Reading: What is Multimodal Generative AI and Why Does It Matter? •5 minutes
Reading: What is Computer Vision? •7 minutes
Reading: Text Processing, Speech Processing, and Text-to-Speech •7 minutes
Reading: Challenges in Multimodal AI Integration •5 minutes
Cheat Sheet: Foundations of Multimodal AI •5 minutes
Integrating Visual and Video Modalities
Module 2•2 hours to complete
Module details
This module explores how AI processes generate visual data by integrating images and videos with text. You will examine text-to-image/image-to-text and text-to-video/video-to-text models, image captioning, and the fusion techniques necessary for effective multimodal AI systems. Through hands-on labs, you will apply state-of-the-art models like DALL·E and Sora to generate images and videos from text prompts. Additionally, you will implement an image captioning system using Meta’s Llama 4, gaining practical experience in combining vision and language models for real-world applications.
Understanding Image Captioning with Meta's Llama•7 minutes
Demo: Text-to-Video Generation with OpenAI's Sora•8 minutes
1 reading•Total 3 minutes
Reading: Summary and Highlights •3 minutes
2 assignments•Total 31 minutes
Graded Quiz: Integrating Visual and Video Modalities •21 minutes
Image Generation and Captioning •10 minutes
2 app items•Total 50 minutes
Lab: DALL·E Image Generation Guide for Beginners•20 minutes
Lab: Build an Image Captioning System with watsonx and IBM's Granite•30 minutes
3 plugins•Total 35 minutes
Reading: Introduction to Text-to-Video and Image-to-Video Technologies•12 minutes
Reading: Strengths, Limitations, and Practical Applications of Multimodal Vision Models in Real World Scenarios•8 minutes
Cheat Sheet: Integrating Visual and Video Modalities •15 minutes
Advanced Multimodal Applications
Module 3•2 hours to complete
Module details
The final module explores advanced multimodal AI applications, integrating image, text, and retrieval-based systems to build innovative solutions. You will dive into multimodal retrieval and search, multimodal Question Answering (QA), and chatbots, learning how cross-modal retrieval techniques enhance search engines and recommendation systems. Additionally, you will learn how integrating visual and textual data improves chatbot interactions. Through hands-on labs, you will build fully functional web applications with multimodal capabilities using Flask, applying state-of-the-art models and frameworks.
At IBM, we know how rapidly tech evolves and recognize the crucial need for businesses and professionals to build job-ready, hands-on skills quickly. As a market-leading tech innovator, we’re committed to helping you thrive in this dynamic landscape. Through IBM Skills Network, our expertly designed training programs in AI, software development, cybersecurity, data science, business management, and more, provide the essential skills you need to secure your first job, advance your career, or drive business success. Whether you’re upskilling yourself or your team, our courses, Specializations, and Professional Certificates build the technical expertise that ensures you, and your organization, excel in a competitive world.
What jobs can I get with multimodal generative AI skills?
Skills in multimodal generative AI, where systems integrate text, speech, images, and video, are in high demand for roles such as AI developer, machine learning engineer, multimodal AI researcher, and full-stack developer specializing in AI-powered user experiences.
Do I need machine learning experience to build multimodal generative AI apps?
Not necessarily. If you’re a Python developer, you can start building with generative AI using tools like IBM watsonx.ai, Flask, and Gradio—no advanced ML background required.
How is multimodal generative AI app development different from traditional app development?
Multimodal AI apps go beyond typical appdevelopment by incorporating multimodal large language models (MLLMs) and media-based inputs like speech, images, and video. You’ll still use familiar tools like Python, Flaskand Gradio, but you’ll also learn to integrate and orchestrate models for tasks like transcription, image generation, and AI-powered storytelling.
When will I have access to the lectures and assignments?
To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.
What will I get if I subscribe to this Certificate?
When you enroll in the course, you get access to all of the courses in the Certificate, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.