bnew

Veteran
Joined
Nov 1, 2015
Messages
63,833
Reputation
9,783
Daps
174,078

LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks​


By Nikhil

May 16, 2025

Conversational artificial intelligence is centered on enabling large language models (LLMs) to engage in dynamic interactions where user needs are revealed progressively. These systems are widely deployed in tools that assist with coding, writing, and research by interpreting and responding to natural language instructions. The aspiration is for these models to flexibly adjust to changing user inputs over multiple turns, adapting their understanding with each new piece of information. This contrasts with static, single-turn responses and highlights a major design goal: sustaining contextual coherence and delivering accurate outcomes in extended dialogues.

A persistent problem in conversational AI is the model’s inability to handle user instructions distributed across multiple conversation turns. Rather than receiving all necessary information simultaneously, LLMs must extract and integrate key details incrementally. However, when the task is not specified upfront, models tend to make early assumptions about what is being asked and attempt final solutions prematurely. This leads to errors that persist through the conversation, as the models often stick to their earlier interpretations. The result is that once an LLM makes a misstep in understanding, it struggles to recover, resulting in incomplete or misguided answers.

AD_4nXcUVipGgOjoqnMYmS4e0WYdz27UAQnIzHn_Xy7bo5ioMmoi1EKIdLNxCHfpFu8JibE4JoEYCPxDcsaACRZqZ8RxrUG4Q7cL6Ys0ou9rYZGfIjOkQzOmqDQQv4AiI86h6HL4_mQ


Most current tools evaluate LLMs using single-turn, fully-specified prompts, where all task requirements are presented in one go. Even in research claiming multi-turn analysis, the conversations are typically episodic, treated as isolated subtasks rather than an evolving flow. These evaluations fail to account for how models behave when the information is fragmented and context must be actively constructed from multiple exchanges. Consequently, evaluations often miss the core difficulty models face: integrating underspecified inputs over several conversational turns without explicit direction.

Researchers from Microsoft Research and Salesforce Research introduced a simulation setup that mimics how users reveal information in real conversations. Their “sharded simulation” method takes complete instructions from high-quality benchmarks and splits them into smaller, logically connected parts or “shards.” Each shard delivers a single element of the original instruction, which is then revealed sequentially over multiple turns. This simulates the progressive disclosure of information that happens in practice. The setup includes a simulated user powered by an LLM that decides which shard to reveal next and reformulates it naturally to fit the ongoing context. This setup also uses classification mechanisms to evaluate whether the assistant’s responses attempt a solution or require clarification, further refining the simulation of genuine interaction.

AD_4nXdT1ruLcTssCpOuhB38sOH-ZeZLzVdQTCaJnrr9TdKtezyQGoY6pJ4aI-ZSfMCEju6Kvo1WLrY0PEb09VIbs5uQEYjRv0P6hz6GHKZZjUCLmQ7D8itY_57UBQ291XfpzVEYNTT4NQ


The technology developed simulates five types of conversations, including single-turn full instructions and multiple multi-turn setups. In SHARDED simulations, LLMs received instructions one shard at a time, forcing them to wait before proposing a complete answer. This setup evaluated 15 LLMs across six generation tasks: coding, SQL queries, API actions, math problems, data-to-text descriptions, and document summaries. Each task drew from established datasets such as GSM8K, Spider, and ToTTo. For every LLM and instruction, 10 simulations were conducted, totaling over 200,000 simulations. Aptitude, unreliability, and average performance were computed using a percentile-based scoring system, allowing direct comparison of best and worst-case outcomes per model.

Across all tasks and models, a consistent decline in performance was observed in the SHARDED setting. On average, performance dropped from 90% in single-turn to 65% in multi-turn scenarios—a 25-point decline. The main cause was not reduced capability but a dramatic rise in unreliability. While aptitude dropped by 16%, unreliability increased by 112%, revealing that models varied wildly in how they performed when information was presented gradually. For example, even top-performing models like GPT-4.1 and Gemini 2.5 Pro exhibited 30-40% average degradations. Additional compute at generation time or lowering randomness (temperature settings) offered only minor improvements in consistency.

AD_4nXeCr-yyIPogtmJ7umQXn5H0d0jo7VBf8bMzrulhe4Cw-OhaWxCGIi-ubmwOLXrpHYVm-1nzkRbLKMb3gMycTWV-2Gq_vUwNa8Ob0NdT7g58v3vc_69gi7gYDavde8O3LUkcrzeVJA


This research clarifies that even state-of-the-art LLMs are not yet equipped to manage complex conversations where task requirements unfold gradually. The sharded simulation methodology effectively exposes how models falter in adapting to evolving instructions, highlighting the urgent need to improve reliability in multi-turn settings. Enhancing the ability of LLMs to process incomplete instructions over time is essential for real-world applications where conversations are naturally unstructured and incremental.




Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,833
Reputation
9,783
Daps
174,078

Meta AI Open-Sources LlamaFirewall: A Security Guardrail Tool to Help Build Secure AI Agents​


By Asif Razzaq

May 8, 2025

As AI agents become more autonomous—capable of writing production code, managing workflows, and interacting with untrusted data sources—their exposure to security risks grows significantly. Addressing this evolving threat landscape, Meta AI has released LlamaFirewall , an open-source guardrail system designed to provide a system-level security layer for AI agents in production environments.

Addressing Security Gaps in AI Agent Deployments


Large language models (LLMs) embedded in AI agents are increasingly integrated into applications with elevated privileges. These agents can read emails, generate code, and issue API calls—raising the stakes for adversarial exploitation. Traditional safety mechanisms, such as chatbot moderation or hardcoded model constraints, are insufficient for agents with broader capabilities.

LlamaFirewall was developed in response to three specific challenges:

  1. Prompt Injection Attacks : Both direct and indirect manipulations of agent behavior via crafted inputs.
  2. Agent Misalignment : Deviations between an agent’s actions and the user’s stated goals.
  3. Insecure Code Generation : Emission of vulnerable or unsafe code by LLM-based coding assistants.

Core Components of LlamaFirewall


LlamaFirewall introduces a layered framework composed of three specialized guardrails, each targeting a distinct class of risks:

1. PromptGuard 2


PromptGuard 2 is a classifier built using BERT-based architectures to detect jailbreaks and prompt injection attempts. It operates in real time and supports multilingual input. The 86M parameter model offers strong performance, while a 22M lightweight variant provides low-latency deployment in constrained environments. It is designed to identify high-confidence jailbreak attempts with minimal false positives.

2. AlignmentCheck


AlignmentCheck is an experimental auditing tool that evaluates whether an agent’s actions remain semantically aligned with the user’s goals. It operates by analyzing the agent’s internal reasoning trace and is powered by large language models such as Llama 4 Maverick. This component is particularly effective in detecting indirect prompt injection and goal hijacking scenarios.

3. CodeShield


CodeShield is a static analysis engine that inspects LLM-generated code for insecure patterns. It supports syntax-aware analysis across multiple programming languages using Semgrep and regex rules. CodeShield enables developers to catch common coding vulnerabilities—such as SQL injection risks—before code is committed or executed.

Evaluation in Realistic Settings


Meta evaluated LlamaFirewall using AgentDojo , a benchmark suite simulating prompt injection attacks against AI agents across 97 task domains. The results show a clear performance improvement:

  • PromptGuard 2 (86M) alone reduced attack success rates (ASR) from 17.6% to 7.5% with minimal loss in task utility.
  • AlignmentCheck achieved a lower ASR of 2.9%, though with slightly higher computational cost.
  • Combined , the system achieved a 90% reduction in ASR, down to 1.75%, with a modest utility drop to 42.7%.

In parallel, CodeShield achieved 96% precision and 79% recall on a labeled dataset of insecure code completions, with average response times suitable for real-time usage in production systems.

Future Directions


Meta outlines several areas of active development:

  • Support for Multimodal Agents : Extending protection to agents that process image or audio inputs.
  • Efficiency Improvements : Reducing the latency of AlignmentCheck through techniques like model distillation.
  • Expanded Threat Coverage : Addressing malicious tool use and dynamic behavior manipulation.
  • Benchmark Development : Establishing more comprehensive agent security benchmarks to evaluate defense effectiveness in complex workflows.

Conclusion


LlamaFirewall represents a shift toward more comprehensive and modular defenses for AI agents. By combining pattern detection, semantic reasoning, and static code analysis, it offers a practical approach to mitigating key security risks introduced by autonomous LLM-based systems. As the industry moves toward greater agent autonomy, frameworks like LlamaFirewall will be increasingly necessary to ensure operational integrity and resilience.




Check out the Paper , Code and Project Page . Also, don’t forget to follow us on Twitter .

Here’s a brief overview of what we’re building at Marktechpost:


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,833
Reputation
9,783
Daps
174,078

ByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model Designed to Advance General-Purpose Multimodal Understanding and Reasoning​


By Sana Hassan

May 15, 2025

VLMs have become central to building general-purpose AI systems capable of understanding and interacting in digital and real-world settings. By integrating visual and textual data, VLMs have driven advancements in multimodal reasoning, image editing, GUI agents, robotics, and more, influencing sectors like education and healthcare. Despite this progress, VLMs still lag behind human capabilities, particularly in tasks involving 3D reasoning, object counting, creative visual interpretation, and interactive gameplay. A challenge lies in the scarcity of rich, diverse multimodal datasets, unlike the abundant textual resources available to LLMs. Additionally, multimodal data complexity poses significant training and evaluation hurdles.

Researchers at ByteDance have developed Seed1.5-VL, a compact yet powerful vision-language foundation model featuring a 532 M-parameter vision encoder and a 20 B-parameter Mixture-of-Experts LLM. Despite its efficient architecture, Seed1.5-VL achieves top results on 38 out of 60 public VLM benchmarks, excelling in tasks like GUI control, video understanding, and visual reasoning. It is trained on trillions of multimodal tokens using advanced data synthesis and post-training techniques, including human feedback. Innovations in training, such as hybrid parallelism and vision token redistribution, optimize performance. The model’s efficiency and strong reasoning capabilities suit real-world interactive applications like chatbots.

The Seed1.5-VL architecture features a vision encoder, an MLP adapter, and an LLM. Its custom vision encoder, Seed-ViT, supports native-resolution image input using 2D RoPE and processes images through 14×14 patches, followed by average pooling and an MLP. Pretraining involves masked image modeling, contrastive learning, and omni-modal alignment using images, text, and video-audio-caption pairs. The model uses a Dynamic Frame-Resolution Sampling approach for video encoding that adapts frame rates and resolutions based on content complexity, balancing efficiency and detail. This method enables effective spatial-temporal understanding within a token budget, ensuring comprehensive video representation across varied lengths and complexities.

The pre-training of Seed1.5-VL involved curating 3 trillion high-quality tokens across diverse domains. Image-text pairs from the web were filtered using CLIP scores, size/aspect ratio checks, and deduplication to reduce noise. Using domain-based sampling and duplication strategies, rare visual concepts were overrepresented to address class imbalance. Specialized datasets were added for OCR using annotated and synthetic text-rich images, charts, and tables—object grounding and counting tasks utilized bounding boxes, points, and auto-labeled web data. Additional tasks included 3D spatial understanding using depth annotations, and video understanding through multi-frame captioning, QA, and temporal grounding to support dynamic content analysis.

The evaluation highlights Seed-ViT and Seed1.5-VL’s competitive performance across vision-language tasks. Seed-ViT, despite having significantly fewer parameters, matches or outperforms larger models like InternVL-C and EVA-CLIP on zero-shot image classification tasks, showing high accuracy and robustness on datasets such as ImageNet-A and ObjectNet. Seed1.5-VL demonstrates strong capabilities in multimodal reasoning, general VQA, document understanding, and grounding. It achieves state-of-the-art benchmarks, particularly in complex reasoning, counting, and chart interpretation tasks. The model’s “thinking” mode, which incorporates longer reasoning chains, further enhances performance, indicating its strong ability in detailed visual understanding and task generalization.

AD_4nXfzBlnyxnZmRU1qsm_ELlKP6InRzjmolkJgP2U1gNUgZ52aQOTYhTLg3-nEBqO2-1KA22lazKlqm48cSl-AEdJiY8fMhmltpxAaRZ2yIq1q82ux1i_3h4ObTmGvQmypF5CbghtV


In conclusion, Seed1.5-VL is a vision-language foundation model featuring a 532 M-parameter vision encoder and a 20 B-parameter Mixture-of-Experts language model. Despite its compact size, it achieves state-of-the-art results on 38 of 60 public benchmarks and excels in complex reasoning, OCR, diagram interpretation, 3D spatial understanding, and video analysis. It also performs well in agent-driven tasks like GUI control and gameplay, surpassing models like OpenAI CUA and Claude 3.7. The model shows strong generalization to tasks beyond its training scope. The study outlines its architecture, data pipeline, and training methods and identifies future directions, including enhancing tool-use and visual reasoning capabilities.




Check out the Paper and Project Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,833
Reputation
9,783
Daps
174,078

NVIDIA Open-Sources Open Code Reasoning Models (32B, 14B, 7B)​


By Sana Hassan

May 8, 2025

NVIDIA continues to push the boundaries of open AI development by open-sourcing its Open Code Reasoning (OCR) model suite — a trio of high-performance large language models purpose-built for code reasoning and problem-solving. The 32B, 14B, and 7B variants, all released under the Apache 2.0 license .

Benchmarked to Beat the Best


The Open Code Reasoning (OCR) models come with notable benchmark achievements , outperforming OpenAI’s o3-Mini and o1 (low) models on the LiveCodeBench benchmark. LiveCodeBench is a comprehensive evaluation suite for code reasoning tasks such as debugging, code generation, and logic completion in real-world developer environments. In direct comparison, NVIDIA’s 32B OCR model tops the leaderboard in reasoning capability for open models.

This leap in performance is attributed not only to model architecture, but to NVIDIA’s custom “OCR dataset” — a high-quality, code-centric training corpus designed to emphasize instruction-following, reasoning, and multi-step code problem solving. According to NVIDIA, this results in a 30% improvement in token efficiency , allowing the models to produce accurate code and logical outputs with fewer tokens.

A Model Lineup for Every Use Case


The Open Code Reasoning suite comes in three parameter scales :

  • OpenCodeReasoning-Nemotron-32B
  • OpenCodeReasoning-Nemotron-14B
  • OpenCodeReasoning-Nemotron-7B

Each model balances scale with performance. The 32B variant delivers state-of-the-art results for high-performance inference and research; the 14B model provides strong reasoning capabilities with reduced compute requirements, and the 7B variant is ideal for resource-constrained environments while retaining competitive performance on benchmarks.

All models are trained using the Nemotron architecture , NVIDIA’s transformer-based backbone optimized for multilingual, multi-task learning. The model weights and configurations are available on Hugging Face:


Compatible with Open Inference Ecosystems


A key feature of these models is out-of-the-box compatibility with popular inference frameworks:

  • llama.cpp for lightweight CPU/GPU inference
  • vLLM for optimized GPU serving and speculative decoding
  • Transformers by Hugging Face for training and evaluation pipelines
  • TGI (Text Generation Inference) for scalable API deployment

This flexibility allows developers, researchers, and enterprises to plug these models into existing code AI infrastructure with minimal overhead.

A Step Forward for Open Code Intelligence


With this release, NVIDIA contributes significantly to the growing ecosystem of open code models. By targeting code reasoning — a domain historically dominated by proprietary models — and releasing under a fully open and permissive license, NVIDIA empowers the broader AI and developer community to build, fine-tune, and deploy advanced reasoning models in production.

The Open Code Reasoning suite adds to NVIDIA’s growing portfolio of open LLMs and strengthens its stance on accessible, transparent AI development. Whether you’re building developer copilots, automated code review agents, or code generation services, these models offer a high-performing, cost-effective, and community-friendly alternative to closed solutions.




Check out the 32B Model , 14B Model , 7B Model and 32B Instruction-Tuned Variant . Also, don’t forget to follow us on Twitter .

Here’s a brief overview of what we’re building at Marktechpost:


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,833
Reputation
9,783
Daps
174,078

Offline Video-LLMs Can Now Understand Real-Time Streams: Apple Researchers Introduce StreamBridge to Enable Multi-Turn and Proactive Video Understanding​


By Sajjad Ansari

May 12, 2025

Video-LLMs process whole pre-recorded videos at once. However, applications like robotics and autonomous driving need causal perception and interpretation of visual information online. This fundamental mismatch shows a limitation of current Video-LLMs, as they are not naturally designed to operate in streaming scenarios where timely understanding and responsiveness are paramount. The transition from offline to streaming video understanding presents two key challenges. First, multi-turn real-time understanding requires models to process the most recent video segment while maintaining historical visual and conversational context. Second, proactive response generation demands human-like behavior where the model actively monitors the visual stream and provides timely outputs based on unfolding content without explicit prompts.

Video-LLMs have gained significant attention for video understanding, combining visual encoders, modality projectors, and LLMs to generate contextual responses from video content. Several approaches have emerged to address the challenge of streaming video understanding. VideoLLMOnline and Flash-VStream introduced specialized online objectives and memory architectures for handling sequential inputs. MMDuet and ViSpeak developed dedicated components for proactive response generation. Multiple benchmark suites have been used to evaluate streaming capabilities, including StreamingBench, StreamBench, SVBench, OmniMMI, and OVO-Bench.

Researchers from Apple and Fudan University have proposed StreamBridge, a framework to transform offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: limited capability for multi-turn real-time understanding and lack of proactive response mechanisms. StreamBridge combines a memory buffer with a round-decayed compression strategy, supporting long-context interactions. It also incorporates a decoupled, lightweight activation model that integrates seamlessly with existing Video-LLMs for proactive response generation. Further, researchers introduced Stream-IT, a large-scale dataset designed for streaming video understanding, featuring mixed videotext sequences and diverse instruction formats.

AD_4nXd5pmAcTaM1OM110MY3irdJzY0-_ZFbE7NPrRssEdoA80I5ZSu1NAkWrc2cqcGm_ekG8KL1MSUPfXnXFWpazCFljf7QRIkjCAZGbJYbHF0eJwYGipJ-4ncpiDLN_MNw1T-NEIfY


StreamBridge framework is evaluated using mainstream offline Video-LLMs, LLaVA-OV-7B, Qwen2-VL-7B, and Oryx-1.5-7B. The Stream-IT dataset is added with approximately 600K samples from established datasets to maintain general video understanding capabilities, including LLaVA-178K, VCG-Plus, and ShareGPT4Video. OVO-Bench and StreamingBench are used for multi-turn real-time understanding, focusing on their real-time tasks. General video understanding is evaluated across seven benchmarks, including three short-video datasets (MVBench, PerceptionTest, TempCompass) and four long-video benchmarks (EgoSchema, LongVideoBench, MLVU, VideoMME).

The evaluation results show that Qwen2-VL † improved with average scores increasing from 55.98 to 63.35 on OVO-Bench and 69.04 to 72.01 on Streaming-Bench. In contrast, LLaVA-OV † experiences slight performance decreases, dropping from 64.02 to 61.64 on OVO-Bench and from 71.12 to 68.39 on Streaming-Bench. Fine-tuning on the Stream-IT dataset yields substantial improvements across all models. Oryx-1.5 † achieves gains of +11.92 on OVO-Bench and +4.2 on Streaming-Bench. Moreover, Qwen2-VL † reaches average scores of 71.30 on OVO-Bench and 77.04 on Streaming-Bench after Stream-IT fine-tuning, outperforming even proprietary models like GPT-4o and Gemini 1.5 Pro, showing the effectiveness of StreamBridge’s approach in enhancing streaming video understanding capabilities.

In conclusion, researchers introduced StreamBridge, a method to transform offline Video-LLMs into effective streaming-capable models. Its dual innovations, a memory buffer with round-decayed compression strategy and a decoupled lightweight activation model, address the core challenges of streaming video understanding without compromising general performance. Further, the Stream-IT dataset is introduced for streaming video understanding, with specialized interleaved video-text sequences. As streaming video understanding becomes increasingly essential in robotics and autonomous driving, StreamBridge offers a generalizable solution that transforms static Video-LLMs into dynamic, responsive systems capable of meaningful interaction in continuously evolving visual environments.




Check out the Paper . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit .

Here’s a brief overview of what we’re building at Marktechpost:


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,833
Reputation
9,783
Daps
174,078

Rime Introduces Arcana and Rimecaster (Open Source): Practical Voice AI Tools Built on Real-World Speech​


By Asif Razzaq

May 14, 2025

The field of Voice AI is evolving toward more representative and adaptable systems. While many existing models have been trained on carefully curated, studio-recorded audio, Rime is pursuing a different direction: building foundational voice models that reflect how people actually speak. Its two latest releases, Arcana and Rimecaster , are designed to offer practical tools for developers seeking greater realism, flexibility, and transparency in voice applications.

Arcana: A General-Purpose Voice Embedding Model


Arcana is a spoken language text-to-speech (TTS) model optimized for extracting semantic, prosodic, and expressive features from speech. While Rimecaster focuses on identifying who is speaking, Arcana is oriented toward understanding how something is said—capturing delivery, rhythm, and emotional tone.

The model supports a variety of use cases, including:

  • Voice agents for businesses across IVR, support, outbound, and more
  • Expressive text-to-speech synthesis for creative applications
  • Dialogue systems that require speaker-aware interaction

Arcana is trained on a diverse range of conversational data collected in natural settings. This allows it to generalize across speaking styles, accents, and languages, and to perform reliably in complex audio environments, such as real-time interaction.

Arcana also captures speech elements that are typically overlooked—such as breathing, laughter, and speech disfluencies—helping systems to process voice input in a way that mirrors human understanding.

Rime also offers another TTS model optimized for high-volume, business-critical applications. Mist v2 enables efficient deployment on edge devices at extremely low latency without sacrificing quality. Its design blends acoustic and linguistic features , resulting in embeddings that are both compact and expressive.

Rimecaster: Capturing Natural Speaker Representation


Rimecaster is an open source speaker representation model developed to help train voice AI models, like Arcana and Mist v2. It moves beyond performance-oriented datasets, such as audiobooks or scripted podcasts. Instead, it is trained on full-duplex, multilingual conversations featuring everyday speakers. This approach allows the model to account for the variability and nuances of unscripted speech—such as hesitations, accent shifts, and conversational overlap.

Technically, Rimecaster transforms a voice sample into a vector embedding that represents speaker-specific characteristics like tone, pitch, rhythm, and vocal style. These embeddings are useful in a range of applications, including speaker verification, voice adaptation, and expressive TTS.

Key design elements of Rimecaster include:

  • Training Data : The model is built on a large dataset of natural conversations across languages and speaking contexts, enabling improved generalization and robustness in noisy or overlapping speech environments.
  • Model Architecture : Based on NVIDIA’s Titanet , Rimecaster produces four times denser speaker embeddings , supporting fine-grained speaker identification and better downstream performance.
  • Open Integration : It is compatible with Hugging Face and NVIDIA NeMo , allowing researchers and engineers to integrate it into training and inference pipelines with minimal friction.
  • Licensing : Released under an open source CC-by-4.0 license , Rimecaster supports open research and collaborative development.

By training on speech that reflects real-world use, Rimecaster enables systems to distinguish among speakers more reliably and deliver voice outputs that are less constrained by performance-driven data assumptions.

Realism and Modularity as Design Priorities


Rime’s recent updates align with its core technical principles: model realism , diversity of data , and modular system design . Rather than pursuing monolithic voice solutions trained on narrow datasets, Rime is building a stack of components that can be adapted to a wide range of speech contexts and applications.

Integration and Practical Use in Production Systems


Arcana and Mist v2 are designed with real-time applications in mind. Both support:

  • Streaming and low-latency inference
  • Compatibility with conversational AI stacks and telephony systems

They improve the naturalness of synthesized speech and enable personalization in dialogue agents. Because of their modularity, these tools can be integrated without significant changes to existing infrastructure.

For example, Arcana can help synthesize speech that retains the tone and rhythm of the original speaker in a multilingual customer service setting.

Conclusion


Rime’s voice AI models offer an incremental yet important step toward building voice AI systems that reflect the true complexity of human speech. Their grounding in real-world data and modular architecture make them suitable for developers and builders working across speech-related domains.

Rather than prioritizing uniform clarity at the expense of nuance, these models embrace the diversity inherent in natural language. In doing so, Rime is contributing tools that can support more accessible, realistic, and context-aware voice technologies.

Sources:





Thanks to the Rime team for the thought leadership/ Resources for this article. Rime team has sponsored us for this content/article.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,833
Reputation
9,783
Daps
174,078

xAI posts Grok’s behind-the-scenes prompts​


The instructions tell Grok that it is ‘extremely skeptical.’

by Emma Roth

May 16, 2025, 12:34 PM EDT

STK262_GROK_B_C


Image: The Verge

Emma Roth is a news writer who covers the streaming wars, consumer tech, crypto, social media, and much more. Previously, she was a writer and editor at MUO.

xAI has published the system prompts for its AI chatbot Grok after an “unauthorized” change led to a slew of unprompted responses on X about white genocide. The company says it will publish its Grok system prompts on GitHub from now on, which provide some insight into the way xAI has instructed Grok to respond to users.

A system prompt is a set of instructions served to a chatbot ahead of a user’s messages that developers use to direct its responses. xAI and Anthropic are two of the only major AI companies we checked that have made their system prompts public. In the past, people have used prompt injection attacks to expose system prompts, like instructions Microsoft gave the Bing AI bot (now Copilot) to keep its internal alias “Sydney” a secret, and avoid replying with content that violates copyrights.

In the system prompts for ask Grok — a feature X users can use to tag Grok in posts to ask a question — xAI tells the chatbot how to behave. “You are extremely skeptical,” the instructions say. “You do not blindly defer to mainstream authority or media. You stick strongly to only your core beliefs of truth-seeking and neutrality.” It adds the results in the response “are NOT your beliefs.”

Related​



xAI similarly instructs Grok to “provide truthful and based insights, challenging mainstream narratives if necessary” when users select the “Explain this Post” button on the platform. Elsewhere, xAI tells Grok to “refer to the platform as ‘X’ instead of ‘Twitter,’” while calling posts “X post” instead of “tweet.”

Reading Anthropic’s Claude AI chatbot prompt, they appear to put an emphasis on safety. “Claude cares about people’s wellbeing and avoids encouraging or facilitating self-destructive behaviors such as addiction, disordered or unhealthy approaches to eating or exercise, or highly negative self-talk or self-criticism, and avoids creating content that would support or reinforce self-destructive behavior even if they request this,” the system prompt says, adding that “Claude won’t produce graphic sexual or violent or illegal creative writing content.”
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,833
Reputation
9,783
Daps
174,078



Google to give app devs access to Gemini Nano for on-device AI​


New APIs for Google's ML Kit will let developers plug into the on-device AI model.

Ryan Whitwam – May 16, 2025 2:15 PM |

17

CANADA - 2025/02/03: In this photo illustration, the Google Gemini AI logo is seen displayed on a smartphone screen. (Photo Illustration by Thomas Fuller/SOPA Images/LightRocket via Getty Images)

Credit: Thomas Fuller/SOPA Images/LightRocket via Getty Images

The rapid expansion of generative AI has changed the way Google and other tech giants design products, but most of the AI features you've used are running on remote servers with a ton of processing power. Your phone has a lot less power, but Google appears poised to give developers some important new mobile AI tools. At I/O next week, Google will likely announce a new set of APIs to let developers leverage the capabilities of Gemini Nano for on-device AI.

Google has quietly published documentation on big new AI features for developers. According to Android Authority, an update to the ML Kit SDK will add API support for on-device generative AI features via Gemini Nano. It's built on AI Core, similar to the experimental Edge AI SDK, but it plugs into an existing model with a set of predefined features that should be easy for developers to implement.

Google says ML Kit’s GenAI APIs will enable apps to do summarization, proofreading, rewriting, and image description without sending data to the cloud. However, Gemini Nano doesn't have as much power as the cloud-based version, so expect some limitations. For example, Google notes that summaries can only have a maximum of three bullet points, and image descriptions will only be available in English. The quality of outputs could also vary based on the version of Gemini Nano on a phone. The standard version (Gemini Nano XS) is about 100MB in size, but Gemini Nano XXS as seen on the Pixel 9a is a quarter of the size. It's text-only and has a much smaller context window.



Not all versions of Gemini Nano are created equal. Credit: Ryan Whitwam

This move is good for Android in general because ML Kit works on devices outside Google's Pixel line. While Pixel devices use Gemini Nano extensively, several other phones are already designed to run this model, including the OnePlus 13, Samsung Galaxy S25, and Xiaomi 15. As more phones add support for Google's AI model, developers will be able to target those devices with generative AI features.

The documentation is available for developers to peruse now, but we expect Google to fling the API doors open at I/O. The company has already confirmed an I/O session called "Gemini Nano on Android: Building with on-device gen AI." The description promises new APIs to "summarize, proofread, and rewrite text, as well as to generate image descriptions," which sounds exactly like what the new ML Kit APIs can do.



An important piece of the AI puzzle​


App developers interested in adding on-device generative AI features on Android are currently in a tough spot. Google offers the AI Edge SDK that can provide access to the NPU hardware for running models, but these tools are experimental and only work on the Pixel 9 series currently. It's also limited to text. Both Qualcomm and MediaTek offer APIs for running AI workloads, but features and functionality vary by device, which makes it risky to rely on them for a long-term project. And running your own model requires intimate knowledge of generative AI systems. The new APIs should make implementing local AI comparatively quick and easy.

Despite the limited functionality of an on-device model, this is an important part of how AI could become more helpful. Most people would probably prefer not to send all their personal data to a remote server for AI processing, but an on-device model can parse that information in a more secure way. For example, Google's Pixel Screenshots sees all your screenshots, but all the processing happens on your phone. Similarly, Motorola summarizes notifications locally on the new Razr Ultra foldable. On the other hand, its less capable base model Razr sends notifications to a server for processing.

The release of APIs that plug into Gemini Nano could provide some much-needed consistency to mobile AI. However, it does rely on Google and OEMs to collaborate on support for Gemini Nano. Some companies might decide to go their own way, and there will be plenty of phones that don't have enough power to run AI locally.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,833
Reputation
9,783
Daps
174,078



OpenAI adds GPT-4.1 to ChatGPT amid complaints over confusing model lineup​


Two new AI models join 7 others, leaving some paid users wondering which one is best.

Benj Edwards – May 14, 2025 6:16 PM |

36



Tin robots dance in a stock photo.


Credit: Getty Images

On Wednesday, OpenAI announced that ChatGPT users now have access to GPT-4.1, an AI language model previously available only through the company's API since its launch one month ago. The update brings what OpenAI describes as improved coding and web development capabilities to paid ChatGPT subscribers, with wider enterprise rollout planned in the coming weeks.

Adding GPT-4.1 and 4.1 mini to ChatGPT adds to an already complex model selection that includes GPT-4o, various specialized GPT-4o versions, o1-pro, o3-mini, and o3-mini-high models. There are technically nine AI models available for ChatGPT Pro subscribers. Wharton professor Ethan Mollick recently publicly lampooned the awkward situation on social media.

As of May 14, 2025, ChatGPT Pro users have access to 8 different main AI models, plus Deep Research.


As of May 14, 2025, ChatGPT Pro users have access to eight main AI models, plus Deep Research. Credit: Benj Edwards

Deciding which AI model to use can be daunting for AI novices. Reddit users and OpenAI forum members alike commonly voice confusion about the available options. "I do not understand the reason behind having multiple models available for use," wrote one Reddit user in March. "Why would anyone use anything but the best one?" Another Redditor said they were "a bit lost" with the many ChatGPT models available after switching back from using Anthropic Claude.



Reportedly better at coding​


So, what is actually different about GPT-4.1? Notably, it features a very large 1 million token context window that allows processing roughly 3,000 pages of text in a single conversation. The API launch included three versions: GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano. So far, only the full and mini versions are available in ChatGPT.

The full GPT-4.1 model reportedly prioritizes instruction following and coding tasks, which the company positions as an alternative to its o3 and o4-mini simulated reasoning models for basic programming needs. For the smaller of the two models in ChatGPT, the company claims that GPT-4.1 mini performs better in instruction following, coding, and "overall intelligence" compared to GPT-4o mini.

OpenAI is replacing GPT-4o mini with GPT-4.1 mini across all ChatGPT tiers, including free accounts. Free users will automatically switch to GPT-4.1 mini after reaching usage limits for GPT-4o. ChatGPT subscribers using Plus, Pro, or Team plans can access GPT-4.1 through a "more models" dropdown menu in the platform's model picker.

The release comes just two weeks after OpenAI made GPT-4 unavailable in ChatGPT on April 30. That earlier model, which launched in March 2023, once sparked widespread hype about AI capabilities. Compared to that hyperbolic launch, GPT-4.1's rollout has been a fairly understated affair—probably because it's tricky to convey the subtle differences between all of the available OpenAI models.

As if 4.1's launch wasn't confusing enough, the release also roughly coincides with OpenAI's July 2025 deadline for retiring the GPT-4.5 Preview from the API, a model one AI expert called a "lemon." Developers must migrate to other options, OpenAI says, although GPT-4.5 will remain available in ChatGPT for now.



A confusing addition to OpenAI’s model lineup​


In February, OpenAI CEO Sam Altman acknowledged on X his company's confusing AI model naming practices, writing, "We realize how complicated our model and product offerings have gotten." He promised that a forthcoming "GPT-5" model would consolidate the o-series and GPT-series models into a unified branding structure. But the addition of GPT-4.1 to ChatGPT appears to contradict that simplification goal.

So, if you use ChatGPT, which model should you use? If you're a developer using the models through the API, the consideration is more of a trade-off between capability, speed, and cost. But in ChatGPT, your choice might be limited more by personal taste in behavioral style and what you'd like to accomplish. Some of the "more capable" models have lower usage limits as well because they cost more for OpenAI to run.

For now, OpenAI is keeping GPT-4o as the default ChatGPT model, likely due to its general versatility, balance between speed and capability, and personable style (conditioned using reinforcement learning and a specialized system prompt). The simulated reasoning models like 03 and 04-mini-high are slower to execute but can consider analytical-style problems more systematically and perform comprehensive web research that sometimes feels genuinely useful when it surfaces relevant (non-confabulated) web links. Compared to those, OpenAI is largely positioning GPT-4.1 as a speedier AI model for coding assistance.

Just remember that all of the AI models are prone to confabulations, meaning that they tend to make up authoritative-sounding information when they encounter gaps in their trained "knowledge." So you'll need to double-check all of the outputs with other sources of information if you're hoping to use these AI models to assist with an important task.
 
Top