Mira Murati’s Thinking Machines Lab Challenges OpenAI With Real-Time Response Model

Until September 2024, Mira Murati was best known as OpenAI’s CTO. After her unexpected departure and the launch of Thinking Machines Lab (TML)—notable early on for a billion-dollar deal with Nvidia—the stealthy startup has finally revealed its first research model. Dubbed TML-Interaction-Small, the system aims to reimagine how people and AI collaborate by processing audio, video, and text at the same time, in real time, without relying on external control components. TML frames the release as a research preview that demonstrates qualitatively new modes of human–machine interaction.

Most AI assistants today are turn-based: you speak or type, the model waits, processes, and then replies. During its reply, it ignores new signals; while you speak, it sits idle. TML argues this rigid rhythm creates an artificial bottleneck. “We believe we can solve this bandwidth bottleneck by making AI interactive in real time and across every modality,” the company says, contending that interfaces should adapt to people—not the other way around.

From turn-based to truly interactive

The core of TML-Interaction-Small is what the lab calls a Multi-Stream Micro-Turn design. Instead of waiting for full conversational rounds, the model continuously ingests and emits information in 200-millisecond blocks. Inputs (audio, video, text) and outputs (speech, text, actions) are treated as parallel data streams rather than a rigid sequence.

In practice, that means the model can listen while speaking, sense pauses, recognize interruptions, and react to visual cues—without requiring explicit commands. With a built-in sense of temporal context, it decides when to interject, continue, or wait, more closely mirroring natural human conversation.

Under the hood: early fusion and joint training

TML-Interaction-Small was trained from scratch with an encoder-free early fusion approach. Audio is represented as dMel signals and passed through a lightweight embedding layer. Video frames are split into 40×40-pixel patches and encoded via an hMLP module. Rather than training components separately and bolting them together, all modalities are trained jointly with a central Transformer.

The model uses a Mixture-of-Experts (MoE) architecture totaling 276 billion parameters, with 12 billion active per inference step. According to TML, larger variants are currently too slow for real-time use but are planned for release later this year.

Hand-off for deeper reasoning

When tasks demand heavier reasoning—such as complex problem-solving, web retrieval, or multi-step agent workflows—the interaction model delegates to an asynchronous background model. Crucially, the front-end interaction loop remains live, maintaining the conversation while seamlessly integrating results as they arrive. The goal is to keep responsiveness high without sacrificing depth on challenging queries.

Benchmarks: sharper timing, faster responses

TML reports that Interaction-Small outperforms comparable systems from OpenAI and Google on interaction quality and response speed. Latency stands out: where GPT-Realtime-2.0 averages 1.18 seconds to respond to a user’s utterance, TML’s model clocks in at 0.40 seconds on average.

The company also created new internal benchmarks to measure capabilities that typical chat models don’t cover—like reacting with precise timing to verbal or visual cues, or counting repetitions in a video without being prompted explicitly. Competing models, TML says, struggled or failed to respond on these tests.

Known constraints

TML is candid about limitations. Long-running sessions with continuous audio and video quickly balloon the context window, complicating memory and state management. Real-time performance also hinges on network quality; a stable connection is essential, and degraded bandwidth significantly harms the experience.

Why it matters

By collapsing the gap between sensing and responding, TML’s interaction model aspires to make AI feel more like a present conversational partner than a turn-based tool. If its results hold outside controlled conditions, the approach could reshape interfaces for live coaching, collaborative creation, accessibility, on-device assistants, and scenarios where timing and cues matter—like classrooms, meetings, or production studios.

Equally important is TML’s stance against “harnesses”—external speech recognizers or conversation managers that wrap a general model. The lab argues those components are less capable than the model itself and introduce new bottlenecks. Early fusion and joint training seek to give the core model native, time-aware multimodal competence rather than duct-taped skills.

What’s next

TML plans larger models as real-time performance permits and is launching a research fellowship to help the community develop evaluation standards for interaction-first AI. With real-time, multimodal, and model-native timing at the forefront, the company is betting that the next big breakthrough isn’t just about reasoning power—it’s about responsiveness, presence, and the seamless choreography of human–machine dialogue.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Unlock Your Escape: Mastering Asylum Life Codes for Roblox Adventures

Asylum Life Codes (May 2025) As a tech journalist and someone who…

Challenging AI Boundaries: Yann LeCun on Limitations and Potentials of Large Language Models

Exploring the Boundaries of AI: Yann LeCun’s Perspective on the Limitations of…

Unveiling Oracle’s AI Enhancements: A Leap Forward in Logistics and Database Management

Oracle Unveils Cutting-Edge AI Enhancements at Oracle Cloud World Mumbai In an…

Charting New Terrain: Physical Reservoir Computing and the Future of AI

Beyond Electricity: Exploring AI through Physical Reservoir Computing In an era where…