Mira Murati’s Thinking Machines Lab Challenges OpenAI With Real-Time Response Model
Until September 2024, Mira Murati was best known as OpenAI’s CTO. After her unexpected departure and the launch of Thinking Machines Lab (TML)—notable early on for a billion-dollar deal with Nvidia—the stealthy startup has finally revealed its first research model. Dubbed TML-Interaction-Small, the system aims to reimagine how people and AI collaborate by processing audio, video, and text at the same time, in real time, without relying on external control components. TML frames the release as a research preview that demonstrates qualitatively new modes of human–machine interaction.
Most AI assistants today are turn-based: you speak or type, the model waits, processes, and then replies. During its reply, it ignores new signals; while you speak, it sits idle. TML argues this rigid rhythm creates an artificial bottleneck. “We believe we can solve this bandwidth bottleneck by making AI interactive in real time and across every modality,” the company says, contending that interfaces should adapt to people—not the other way around.
From turn-based to truly interactive
The core of TML-Interaction-Small is what the lab calls a Multi-Stream Micro-Turn design. Instead of waiting for full conversational rounds, the model continuously ingests and emits information in 200-millisecond blocks. Inputs (audio, video, text) and outputs (speech, text, actions) are treated as parallel data streams rather than a rigid sequence.
In practice, that means the model can listen while speaking, sense pauses, recognize interruptions, and react to visual cues—without requiring explicit commands. With a built-in sense of temporal context, it decides when to interject, continue, or wait, more closely mirroring natural human conversation.
Under the hood: early fusion and joint training
TML-Interaction-Small was trained from scratch with an encoder-free early fusion approach. Audio is represented as dMel signals and passed through a lightweight embedding layer. Video frames are split into 40×40-pixel patches and encoded via an hMLP module. Rather than training components separately and bolting them together, all modalities are trained jointly with a central Transformer.
The model uses a Mixture-of-Experts (MoE) architecture totaling 276 billion parameters, with 12 billion active per inference step. According to TML, larger variants are currently too slow for real-time use but are planned for release later this year.
Hand-off for deeper reasoning
When tasks demand heavier reasoning—such as complex problem-solving, web retrieval, or multi-step agent workflows—the interaction model delegates to an asynchronous background model. Crucially, the front-end interaction loop remains live, maintaining the conversation while seamlessly integrating results as they arrive. The goal is to keep responsiveness high without sacrificing depth on challenging queries.
Benchmarks: sharper timing, faster responses
TML reports that Interaction-Small outperforms comparable systems from OpenAI and Google on interaction quality and response speed. Latency stands out: where GPT-Realtime-2.0 averages 1.18 seconds to respond to a user’s utterance, TML’s model clocks in at 0.40 seconds on average.
The company also created new internal benchmarks to measure capabilities that typical chat models don’t cover—like reacting with precise timing to verbal or visual cues, or counting repetitions in a video without being prompted explicitly. Competing models, TML says, struggled or failed to respond on these tests.
Known constraints
TML is candid about limitations. Long-running sessions with continuous audio and video quickly balloon the context window, complicating memory and state management. Real-time performance also hinges on network quality; a stable connection is essential, and degraded bandwidth significantly harms the experience.
Why it matters
By collapsing the gap between sensing and responding, TML’s interaction model aspires to make AI feel more like a present conversational partner than a turn-based tool. If its results hold outside controlled conditions, the approach could reshape interfaces for live coaching, collaborative creation, accessibility, on-device assistants, and scenarios where timing and cues matter—like classrooms, meetings, or production studios.
Equally important is TML’s stance against “harnesses”—external speech recognizers or conversation managers that wrap a general model. The lab argues those components are less capable than the model itself and introduce new bottlenecks. Early fusion and joint training seek to give the core model native, time-aware multimodal competence rather than duct-taped skills.
What’s next
TML plans larger models as real-time performance permits and is launching a research fellowship to help the community develop evaluation standards for interaction-first AI. With real-time, multimodal, and model-native timing at the forefront, the company is betting that the next big breakthrough isn’t just about reasoning power—it’s about responsiveness, presence, and the seamless choreography of human–machine dialogue.