bnew

Veteran
Joined
Nov 1, 2015
Messages
63,832
Reputation
9,783
Daps
174,065

Offline Video-LLMs Can Now Understand Real-Time Streams: Apple Researchers Introduce StreamBridge to Enable Multi-Turn and Proactive Video Understanding​


By Sajjad Ansari

May 12, 2025

Video-LLMs process whole pre-recorded videos at once. However, applications like robotics and autonomous driving need causal perception and interpretation of visual information online. This fundamental mismatch shows a limitation of current Video-LLMs, as they are not naturally designed to operate in streaming scenarios where timely understanding and responsiveness are paramount. The transition from offline to streaming video understanding presents two key challenges. First, multi-turn real-time understanding requires models to process the most recent video segment while maintaining historical visual and conversational context. Second, proactive response generation demands human-like behavior where the model actively monitors the visual stream and provides timely outputs based on unfolding content without explicit prompts.

Video-LLMs have gained significant attention for video understanding, combining visual encoders, modality projectors, and LLMs to generate contextual responses from video content. Several approaches have emerged to address the challenge of streaming video understanding. VideoLLMOnline and Flash-VStream introduced specialized online objectives and memory architectures for handling sequential inputs. MMDuet and ViSpeak developed dedicated components for proactive response generation. Multiple benchmark suites have been used to evaluate streaming capabilities, including StreamingBench, StreamBench, SVBench, OmniMMI, and OVO-Bench.

Researchers from Apple and Fudan University have proposed StreamBridge, a framework to transform offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: limited capability for multi-turn real-time understanding and lack of proactive response mechanisms. StreamBridge combines a memory buffer with a round-decayed compression strategy, supporting long-context interactions. It also incorporates a decoupled, lightweight activation model that integrates seamlessly with existing Video-LLMs for proactive response generation. Further, researchers introduced Stream-IT, a large-scale dataset designed for streaming video understanding, featuring mixed videotext sequences and diverse instruction formats.

AD_4nXd5pmAcTaM1OM110MY3irdJzY0-_ZFbE7NPrRssEdoA80I5ZSu1NAkWrc2cqcGm_ekG8KL1MSUPfXnXFWpazCFljf7QRIkjCAZGbJYbHF0eJwYGipJ-4ncpiDLN_MNw1T-NEIfY


StreamBridge framework is evaluated using mainstream offline Video-LLMs, LLaVA-OV-7B, Qwen2-VL-7B, and Oryx-1.5-7B. The Stream-IT dataset is added with approximately 600K samples from established datasets to maintain general video understanding capabilities, including LLaVA-178K, VCG-Plus, and ShareGPT4Video. OVO-Bench and StreamingBench are used for multi-turn real-time understanding, focusing on their real-time tasks. General video understanding is evaluated across seven benchmarks, including three short-video datasets (MVBench, PerceptionTest, TempCompass) and four long-video benchmarks (EgoSchema, LongVideoBench, MLVU, VideoMME).

The evaluation results show that Qwen2-VL † improved with average scores increasing from 55.98 to 63.35 on OVO-Bench and 69.04 to 72.01 on Streaming-Bench. In contrast, LLaVA-OV † experiences slight performance decreases, dropping from 64.02 to 61.64 on OVO-Bench and from 71.12 to 68.39 on Streaming-Bench. Fine-tuning on the Stream-IT dataset yields substantial improvements across all models. Oryx-1.5 † achieves gains of +11.92 on OVO-Bench and +4.2 on Streaming-Bench. Moreover, Qwen2-VL † reaches average scores of 71.30 on OVO-Bench and 77.04 on Streaming-Bench after Stream-IT fine-tuning, outperforming even proprietary models like GPT-4o and Gemini 1.5 Pro, showing the effectiveness of StreamBridge’s approach in enhancing streaming video understanding capabilities.

In conclusion, researchers introduced StreamBridge, a method to transform offline Video-LLMs into effective streaming-capable models. Its dual innovations, a memory buffer with round-decayed compression strategy and a decoupled lightweight activation model, address the core challenges of streaming video understanding without compromising general performance. Further, the Stream-IT dataset is introduced for streaming video understanding, with specialized interleaved video-text sequences. As streaming video understanding becomes increasingly essential in robotics and autonomous driving, StreamBridge offers a generalizable solution that transforms static Video-LLMs into dynamic, responsive systems capable of meaningful interaction in continuously evolving visual environments.




Check out the Paper . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit .

Here’s a brief overview of what we’re building at Marktechpost:


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,832
Reputation
9,783
Daps
174,065

Meet LangGraph Multi-Agent Swarm: A Python Library for Creating Swarm-Style Multi-Agent Systems Using LangGraph​


By Sana Hassan

May 15, 2025

LangGraph Multi-Agent Swarm is a Python library designed to orchestrate multiple AI agents as a cohesive “swarm.” It builds on LangGraph, a framework for constructing robust, stateful agent workflows, to enable a specialized form of multi-agent architecture. In a swarm, agents with different specializations dynamically hand off control to one another as tasks demand, rather than a single monolithic agent attempting everything. The system tracks which agent was last active so that when a user provides the next input, the conversation seamlessly resumes with that same agent. This approach addresses the problem of building cooperative AI workflows where the most qualified agent can handle each sub-task without losing context or continuity.

LangGraph Swarm aims to make such multi-agent coordination easier and more reliable for developers. It provides abstractions to link individual language model agents (each potentially with their tools and prompts) into one integrated application. The library comes with out-of-the-box support for streaming responses, short-term and long-term memory integration, and even human-in-the-loop intervention, thanks to its foundation on LangGraph. By leveraging LangGraph (a lower-level orchestration framework) and fitting naturally into the broader LangChain ecosystem, LangGraph Swarm allows machine learning engineers and researchers to build complex AI agent systems while maintaining explicit control over the flow of information and decisions.

LangGraph Swarm Architecture and Key Features


At its core, LangGraph Swarm represents multiple agents as nodes in a directed state graph, edges define handoff pathways, and a shared state tracks the ‘active_agent’. When an agent invokes a handoff, the library updates that field and transfers the necessary context so the next agent seamlessly continues the conversation. This setup supports collaborative specialization, letting each agent focus on a narrow domain while offering customizable handoff tools for flexible workflows. Built on LangGraph’s streaming and memory modules, Swarm preserves short-term conversational context and long-term knowledge, ensuring coherent, multi-turn interactions even as control shifts between agents.

Agent Coordination via Handoff Tools


LangGraph Swarm’s handoff tools let one agent transfer control to another by issuing a ‘Command’ that updates the shared state, switching the ‘active_agent’ and passing along context, such as relevant messages or a custom summary. While the default tool hands off the full conversation and inserts a notification, developers can implement custom tools to filter context, add instructions, or rename the action to influence the LLM’s behavior. Unlike autonomous AI-routing patterns, Swarm’s routing is explicitly defined: each handoff tool specifies which agent may take over, ensuring predictable flows. This mechanism supports collaboration patterns, such as a “Travel Planner” delegating medical questions to a “Medical Advisor” or a coordinator distributing technical and billing queries to specialized experts. It relies on an internal router to direct user messages to the current agent until another handoff occurs.

State Management and Memory


Managing state and memory is essential for preserving context as agents hand off tasks. By default, LangGraph Swarm maintains a shared state, containing the conversation history and an ‘active_agent’ marker, and uses a checkpointer (such as an in-memory saver or database store) to persist this state across turns. Also, it supports a memory store for long-term knowledge, allowing the system to log facts or past interactions for future sessions while keeping a window of recent messages for immediate context. Together, these mechanisms ensure the swarm never “forgets” which agent is active or what has been discussed, enabling seamless multi-turn dialogues and accumulating user preferences or critical data over time.

When more granular control is needed, developers can define custom state schemas so each agent has its private message history. By wrapping agent calls to map the global state into agent-specific fields before invocation and merging updates afterward, teams can tailor the degree of context sharing. This approach supports workflows ranging from fully collaborative agents to isolated reasoning modules, all while leveraging LangGraph Swarm’s robust orchestration, memory, and state-management infrastructure.

Customization and Extensibility


LangGraph Swarm offers extensive flexibility for custom workflows. Developers can override the default handoff tool, which passes all messages and switches the active agent, to implement specialized logic, such as summarizing context or attaching additional metadata. Custom tools simply return a LangGraph Command to update state, and agents must be configured to handle those commands via the appropriate node types and state-schema keys. Beyond handoffs, one can redefine how agents share or isolate memory using LangGraph’s typed state schemas: mapping the global swarm state into per-agent fields before invocation and merging results afterward. This enables scenarios where an agent maintains a private conversation history or uses a different communication format without exposing its internal reasoning. For full control, it’s possible to bypass the high-level API and manually assemble a ‘StateGraph’: add each compiled agent as a node, define transition edges, and attach the active-agent router. While most use cases benefit from the simplicity of ‘create_swarm’ and ‘create_react_agent’, the ability to drop down to LangGraph primitives ensures that practitioners can inspect, adjust, or extend every aspect of multi-agent coordination.

Ecosystem Integration and Dependencies


LangGraph Swarm integrates tightly with LangChain, leveraging components like LangSmith for evaluation, langchain\_openai for model access, and LangGraph for orchestration features such as persistence and caching. Its model-agnostic design lets it coordinate agents across any LLM backend (OpenAI, Hugging Face, or others), and it’s available in both Python (‘pip install langgraph-swarm’) and JavaScript/TypeScript (‘@langchain/langgraph-swarm’), making it suitable for web or serverless environments. Distributed under the MIT license and with active development, it continues to benefit from community contributions and enhancements in the LangChain ecosystem.

Sample Implementation


Below is a minimal setup of a two-agent swarm:

Here, Alice handles additions and can hand off to Bob, while Bob responds playfully but routes math questions back to Alice. The InMemorySaver ensures conversational state persists across turns.

Use Cases and Applications


LangGraph Swarm unlocks advanced multi-agent collaboration by enabling a central coordinator to dynamically delegate sub-tasks to specialized agents, whether that’s triaging emergencies by handing off to medical, security, or disaster-response experts, routing travel bookings between flight, hotel, and car-rental agents, orchestrating a pair-programming workflow between a coding agent and a reviewer, or splitting research and report generation tasks among researcher, reporter, and fact-checker agents. Beyond these examples, the framework can power customer-support bots that route queries to departmental specialists, interactive storytelling with distinct character agents, scientific pipelines with stage-specific processors, or any scenario where dividing work among expert “swarm” members boosts reliability and clarity. At the same time, LangGraph Swarm handles the underlying message routing, state management, and smooth transitions.

In conclusion, LangGraph Swarm marks a leap toward truly modular, cooperative AI systems. Structured multiple specialized agents into a directed graph solves tasks that a single model struggles with, each agent handles its expertise, and then hands off control seamlessly. This design keeps individual agents simple and interpretable while the swarm collectively manages complex workflows involving reasoning, tool use, and decision-making. Built on LangChain and LangGraph, the library taps into a mature ecosystem of LLMs, tools, memory stores, and debugging utilities. Developers retain explicit control over agent interactions and state sharing, ensuring reliability, yet still leverage LLM flexibility to decide when to invoke tools or delegate to another agent.




Check out the GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,832
Reputation
9,783
Daps
174,065

Rime Introduces Arcana and Rimecaster (Open Source): Practical Voice AI Tools Built on Real-World Speech​


By Asif Razzaq

May 14, 2025

The field of Voice AI is evolving toward more representative and adaptable systems. While many existing models have been trained on carefully curated, studio-recorded audio, Rime is pursuing a different direction: building foundational voice models that reflect how people actually speak. Its two latest releases, Arcana and Rimecaster , are designed to offer practical tools for developers seeking greater realism, flexibility, and transparency in voice applications.

Arcana: A General-Purpose Voice Embedding Model


Arcana is a spoken language text-to-speech (TTS) model optimized for extracting semantic, prosodic, and expressive features from speech. While Rimecaster focuses on identifying who is speaking, Arcana is oriented toward understanding how something is said—capturing delivery, rhythm, and emotional tone.

The model supports a variety of use cases, including:

  • Voice agents for businesses across IVR, support, outbound, and more
  • Expressive text-to-speech synthesis for creative applications
  • Dialogue systems that require speaker-aware interaction

Arcana is trained on a diverse range of conversational data collected in natural settings. This allows it to generalize across speaking styles, accents, and languages, and to perform reliably in complex audio environments, such as real-time interaction.

Arcana also captures speech elements that are typically overlooked—such as breathing, laughter, and speech disfluencies—helping systems to process voice input in a way that mirrors human understanding.

Rime also offers another TTS model optimized for high-volume, business-critical applications. Mist v2 enables efficient deployment on edge devices at extremely low latency without sacrificing quality. Its design blends acoustic and linguistic features , resulting in embeddings that are both compact and expressive.

Rimecaster: Capturing Natural Speaker Representation


Rimecaster is an open source speaker representation model developed to help train voice AI models, like Arcana and Mist v2. It moves beyond performance-oriented datasets, such as audiobooks or scripted podcasts. Instead, it is trained on full-duplex, multilingual conversations featuring everyday speakers. This approach allows the model to account for the variability and nuances of unscripted speech—such as hesitations, accent shifts, and conversational overlap.

Technically, Rimecaster transforms a voice sample into a vector embedding that represents speaker-specific characteristics like tone, pitch, rhythm, and vocal style. These embeddings are useful in a range of applications, including speaker verification, voice adaptation, and expressive TTS.

Key design elements of Rimecaster include:

  • Training Data : The model is built on a large dataset of natural conversations across languages and speaking contexts, enabling improved generalization and robustness in noisy or overlapping speech environments.
  • Model Architecture : Based on NVIDIA’s Titanet , Rimecaster produces four times denser speaker embeddings , supporting fine-grained speaker identification and better downstream performance.
  • Open Integration : It is compatible with Hugging Face and NVIDIA NeMo , allowing researchers and engineers to integrate it into training and inference pipelines with minimal friction.
  • Licensing : Released under an open source CC-by-4.0 license , Rimecaster supports open research and collaborative development.

By training on speech that reflects real-world use, Rimecaster enables systems to distinguish among speakers more reliably and deliver voice outputs that are less constrained by performance-driven data assumptions.

Realism and Modularity as Design Priorities


Rime’s recent updates align with its core technical principles: model realism , diversity of data , and modular system design . Rather than pursuing monolithic voice solutions trained on narrow datasets, Rime is building a stack of components that can be adapted to a wide range of speech contexts and applications.

Integration and Practical Use in Production Systems


Arcana and Mist v2 are designed with real-time applications in mind. Both support:

  • Streaming and low-latency inference
  • Compatibility with conversational AI stacks and telephony systems

They improve the naturalness of synthesized speech and enable personalization in dialogue agents. Because of their modularity, these tools can be integrated without significant changes to existing infrastructure.

For example, Arcana can help synthesize speech that retains the tone and rhythm of the original speaker in a multilingual customer service setting.

Conclusion


Rime’s voice AI models offer an incremental yet important step toward building voice AI systems that reflect the true complexity of human speech. Their grounding in real-world data and modular architecture make them suitable for developers and builders working across speech-related domains.

Rather than prioritizing uniform clarity at the expense of nuance, these models embrace the diversity inherent in natural language. In doing so, Rime is contributing tools that can support more accessible, realistic, and context-aware voice technologies.

Sources:





Thanks to the Rime team for the thought leadership/ Resources for this article. Rime team has sponsored us for this content/article.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,832
Reputation
9,783
Daps
174,065

xAI posts Grok’s behind-the-scenes prompts​


The instructions tell Grok that it is ‘extremely skeptical.’

by Emma Roth

May 16, 2025, 12:34 PM EDT

STK262_GROK_B_C


Image: The Verge

Emma Roth is a news writer who covers the streaming wars, consumer tech, crypto, social media, and much more. Previously, she was a writer and editor at MUO.

xAI has published the system prompts for its AI chatbot Grok after an “unauthorized” change led to a slew of unprompted responses on X about white genocide. The company says it will publish its Grok system prompts on GitHub from now on, which provide some insight into the way xAI has instructed Grok to respond to users.

A system prompt is a set of instructions served to a chatbot ahead of a user’s messages that developers use to direct its responses. xAI and Anthropic are two of the only major AI companies we checked that have made their system prompts public. In the past, people have used prompt injection attacks to expose system prompts, like instructions Microsoft gave the Bing AI bot (now Copilot) to keep its internal alias “Sydney” a secret, and avoid replying with content that violates copyrights.

In the system prompts for ask Grok — a feature X users can use to tag Grok in posts to ask a question — xAI tells the chatbot how to behave. “You are extremely skeptical,” the instructions say. “You do not blindly defer to mainstream authority or media. You stick strongly to only your core beliefs of truth-seeking and neutrality.” It adds the results in the response “are NOT your beliefs.”

Related​



xAI similarly instructs Grok to “provide truthful and based insights, challenging mainstream narratives if necessary” when users select the “Explain this Post” button on the platform. Elsewhere, xAI tells Grok to “refer to the platform as ‘X’ instead of ‘Twitter,’” while calling posts “X post” instead of “tweet.”

Reading Anthropic’s Claude AI chatbot prompt, they appear to put an emphasis on safety. “Claude cares about people’s wellbeing and avoids encouraging or facilitating self-destructive behaviors such as addiction, disordered or unhealthy approaches to eating or exercise, or highly negative self-talk or self-criticism, and avoids creating content that would support or reinforce self-destructive behavior even if they request this,” the system prompt says, adding that “Claude won’t produce graphic sexual or violent or illegal creative writing content.”
 
Top