OpenAI introduces “Multimodal” AI that understands images, audio and video

The Rise of “Multimodal” AI: It Can See, Hear, and Speak

  • The latest generation of AI models like GPT-4o are “multimodal,” meaning they can understand and generate text, images, audio, and video simultaneously.
  • This allows for real-time conversations where you can show the AI a drawing and ask it to write a story about it.
  • It’s a key step towards creating AI that understands the world more like humans do.

We’re moving beyond AI that only understands text. With multimodal AI, you can have a live video call with ChatGPT, show it the broken hinge on a cabinet, and ask it to walk you through fixing it. It can look at a graph you’ve photographed and explain the data trends.

This blending of senses makes AI far more useful and intuitive. It’s the technology behind features like Google Lens and OpenAI’s new voice mode, and it’s fundamentally changing how we interact with machines.

Source: OpenAI Blog