The A.I Megathread (LLM , GPT , Development)

bnew · May 17, 2025

LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks

www.marktechpost.com

LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks

By Nikhil

May 16, 2025

Conversational artificial intelligence is centered on enabling large language models (LLMs) to engage in dynamic interactions where user needs are revealed progressively. These systems are widely deployed in tools that assist with coding, writing, and research by interpreting and responding to natural language instructions. The aspiration is for these models to flexibly adjust to changing user inputs over multiple turns, adapting their understanding with each new piece of information. This contrasts with static, single-turn responses and highlights a major design goal: sustaining contextual coherence and delivering accurate outcomes in extended dialogues.

A persistent problem in conversational AI is the model’s inability to handle user instructions distributed across multiple conversation turns. Rather than receiving all necessary information simultaneously, LLMs must extract and integrate key details incrementally. However, when the task is not specified upfront, models tend to make early assumptions about what is being asked and attempt final solutions prematurely. This leads to errors that persist through the conversation, as the models often stick to their earlier interpretations. The result is that once an LLM makes a misstep in understanding, it struggles to recover, resulting in incomplete or misguided answers.

AD_4nXcUVipGgOjoqnMYmS4e0WYdz27UAQnIzHn_Xy7bo5ioMmoi1EKIdLNxCHfpFu8JibE4JoEYCPxDcsaACRZqZ8RxrUG4Q7cL6Ys0ou9rYZGfIjOkQzOmqDQQv4AiI86h6HL4_mQ

Most current tools evaluate LLMs using single-turn, fully-specified prompts, where all task requirements are presented in one go. Even in research claiming multi-turn analysis, the conversations are typically episodic, treated as isolated subtasks rather than an evolving flow. These evaluations fail to account for how models behave when the information is fragmented and context must be actively constructed from multiple exchanges. Consequently, evaluations often miss the core difficulty models face: integrating underspecified inputs over several conversational turns without explicit direction.

Researchers from Microsoft Research and Salesforce Research introduced a simulation setup that mimics how users reveal information in real conversations. Their “sharded simulation” method takes complete instructions from high-quality benchmarks and splits them into smaller, logically connected parts or “shards.” Each shard delivers a single element of the original instruction, which is then revealed sequentially over multiple turns. This simulates the progressive disclosure of information that happens in practice. The setup includes a simulated user powered by an LLM that decides which shard to reveal next and reformulates it naturally to fit the ongoing context. This setup also uses classification mechanisms to evaluate whether the assistant’s responses attempt a solution or require clarification, further refining the simulation of genuine interaction.

AD_4nXdT1ruLcTssCpOuhB38sOH-ZeZLzVdQTCaJnrr9TdKtezyQGoY6pJ4aI-ZSfMCEju6Kvo1WLrY0PEb09VIbs5uQEYjRv0P6hz6GHKZZjUCLmQ7D8itY_57UBQ291XfpzVEYNTT4NQ

The technology developed simulates five types of conversations, including single-turn full instructions and multiple multi-turn setups. In SHARDED simulations, LLMs received instructions one shard at a time, forcing them to wait before proposing a complete answer. This setup evaluated 15 LLMs across six generation tasks: coding, SQL queries, API actions, math problems, data-to-text descriptions, and document summaries. Each task drew from established datasets such as GSM8K, Spider, and ToTTo. For every LLM and instruction, 10 simulations were conducted, totaling over 200,000 simulations. Aptitude, unreliability, and average performance were computed using a percentile-based scoring system, allowing direct comparison of best and worst-case outcomes per model.

Across all tasks and models, a consistent decline in performance was observed in the SHARDED setting. On average, performance dropped from 90% in single-turn to 65% in multi-turn scenarios—a 25-point decline. The main cause was not reduced capability but a dramatic rise in unreliability. While aptitude dropped by 16%, unreliability increased by 112%, revealing that models varied wildly in how they performed when information was presented gradually. For example, even top-performing models like GPT-4.1 and Gemini 2.5 Pro exhibited 30-40% average degradations. Additional compute at generation time or lowering randomness (temperature settings) offered only minor improvements in consistency.

AD_4nXeCr-yyIPogtmJ7umQXn5H0d0jo7VBf8bMzrulhe4Cw-OhaWxCGIi-ubmwOLXrpHYVm-1nzkRbLKMb3gMycTWV-2Gq_vUwNa8Ob0NdT7g58v3vc_69gi7gYDavde8O3LUkcrzeVJA

This research clarifies that even state-of-the-art LLMs are not yet equipped to manage complex conversations where task requirements unfold gradually. The sharded simulation methodology effectively exposes how models falter in adapting to evolving instructions, highlighting the urgent need to improve reliability in multi-turn settings. Enhancing the ability of LLMs to process incomplete instructions over time is essential for real-world applications where conversations are naturally unstructured and incremental.

Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit .

bnew · May 17, 2025

Meta AI Open-Sources LlamaFirewall: A Security Guardrail Tool to Help Build Secure AI Agents

www.marktechpost.com

Meta AI Open-Sources LlamaFirewall: A Security Guardrail Tool to Help Build Secure AI Agents

By Asif Razzaq

May 8, 2025

As AI agents become more autonomous—capable of writing production code, managing workflows, and interacting with untrusted data sources—their exposure to security risks grows significantly. Addressing this evolving threat landscape, Meta AI has released LlamaFirewall , an open-source guardrail system designed to provide a system-level security layer for AI agents in production environments.

Addressing Security Gaps in AI Agent Deployments

Large language models (LLMs) embedded in AI agents are increasingly integrated into applications with elevated privileges. These agents can read emails, generate code, and issue API calls—raising the stakes for adversarial exploitation. Traditional safety mechanisms, such as chatbot moderation or hardcoded model constraints, are insufficient for agents with broader capabilities.

LlamaFirewall was developed in response to three specific challenges:

Prompt Injection Attacks : Both direct and indirect manipulations of agent behavior via crafted inputs.
Agent Misalignment : Deviations between an agent’s actions and the user’s stated goals.
Insecure Code Generation : Emission of vulnerable or unsafe code by LLM-based coding assistants.

Core Components of LlamaFirewall

LlamaFirewall introduces a layered framework composed of three specialized guardrails, each targeting a distinct class of risks:

1. PromptGuard 2

PromptGuard 2 is a classifier built using BERT-based architectures to detect jailbreaks and prompt injection attempts. It operates in real time and supports multilingual input. The 86M parameter model offers strong performance, while a 22M lightweight variant provides low-latency deployment in constrained environments. It is designed to identify high-confidence jailbreak attempts with minimal false positives.

2. AlignmentCheck

AlignmentCheck is an experimental auditing tool that evaluates whether an agent’s actions remain semantically aligned with the user’s goals. It operates by analyzing the agent’s internal reasoning trace and is powered by large language models such as Llama 4 Maverick. This component is particularly effective in detecting indirect prompt injection and goal hijacking scenarios.

3. CodeShield

CodeShield is a static analysis engine that inspects LLM-generated code for insecure patterns. It supports syntax-aware analysis across multiple programming languages using Semgrep and regex rules. CodeShield enables developers to catch common coding vulnerabilities—such as SQL injection risks—before code is committed or executed.

Evaluation in Realistic Settings

Meta evaluated LlamaFirewall using AgentDojo , a benchmark suite simulating prompt injection attacks against AI agents across 97 task domains. The results show a clear performance improvement:

PromptGuard 2 (86M) alone reduced attack success rates (ASR) from 17.6% to 7.5% with minimal loss in task utility.
AlignmentCheck achieved a lower ASR of 2.9%, though with slightly higher computational cost.
Combined , the system achieved a 90% reduction in ASR, down to 1.75%, with a modest utility drop to 42.7%.

In parallel, CodeShield achieved 96% precision and 79% recall on a labeled dataset of insecure code completions, with average response times suitable for real-time usage in production systems.

Future Directions

Meta outlines several areas of active development:

Support for Multimodal Agents : Extending protection to agents that process image or audio inputs.
Efficiency Improvements : Reducing the latency of AlignmentCheck through techniques like model distillation.
Expanded Threat Coverage : Addressing malicious tool use and dynamic behavior manipulation.
Benchmark Development : Establishing more comprehensive agent security benchmarks to evaluate defense effectiveness in complex workflows.

Conclusion

LlamaFirewall represents a shift toward more comprehensive and modular defenses for AI agents. By combining pattern detection, semantic reasoning, and static code analysis, it offers a practical approach to mitigating key security risks introduced by autonomous LLM-based systems. As the industry moves toward greater agent autonomy, frameworks like LlamaFirewall will be increasingly necessary to ensure operational integrity and resilience.

Check out the Paper , Code and Project Page . Also, don’t forget to follow us on Twitter .

Here’s a brief overview of what we’re building at Marktechpost:

Newsletter– airesearchinsights.com/ (30k+ subscribers)
miniCON AI Events – minicon.marktechpost.com
AI Reports & Magazines – magazine.marktechpost.com
AI Dev & Research News – marktechpost.com (1M+ monthly readers)
ML News Community – r/machinelearningnews (92k+ members)

bnew · May 17, 2025

ByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model Designed to Advance General-Purpose Multimodal Understanding and Reasoning

www.marktechpost.com

ByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model Designed to Advance General-Purpose Multimodal Understanding and Reasoning

By Sana Hassan

May 15, 2025

VLMs have become central to building general-purpose AI systems capable of understanding and interacting in digital and real-world settings. By integrating visual and textual data, VLMs have driven advancements in multimodal reasoning, image editing, GUI agents, robotics, and more, influencing sectors like education and healthcare. Despite this progress, VLMs still lag behind human capabilities, particularly in tasks involving 3D reasoning, object counting, creative visual interpretation, and interactive gameplay. A challenge lies in the scarcity of rich, diverse multimodal datasets, unlike the abundant textual resources available to LLMs. Additionally, multimodal data complexity poses significant training and evaluation hurdles.

Researchers at ByteDance have developed Seed1.5-VL, a compact yet powerful vision-language foundation model featuring a 532 M-parameter vision encoder and a 20 B-parameter Mixture-of-Experts LLM. Despite its efficient architecture, Seed1.5-VL achieves top results on 38 out of 60 public VLM benchmarks, excelling in tasks like GUI control, video understanding, and visual reasoning. It is trained on trillions of multimodal tokens using advanced data synthesis and post-training techniques, including human feedback. Innovations in training, such as hybrid parallelism and vision token redistribution, optimize performance. The model’s efficiency and strong reasoning capabilities suit real-world interactive applications like chatbots.

The Seed1.5-VL architecture features a vision encoder, an MLP adapter, and an LLM. Its custom vision encoder, Seed-ViT, supports native-resolution image input using 2D RoPE and processes images through 14×14 patches, followed by average pooling and an MLP. Pretraining involves masked image modeling, contrastive learning, and omni-modal alignment using images, text, and video-audio-caption pairs. The model uses a Dynamic Frame-Resolution Sampling approach for video encoding that adapts frame rates and resolutions based on content complexity, balancing efficiency and detail. This method enables effective spatial-temporal understanding within a token budget, ensuring comprehensive video representation across varied lengths and complexities.

The pre-training of Seed1.5-VL involved curating 3 trillion high-quality tokens across diverse domains. Image-text pairs from the web were filtered using CLIP scores, size/aspect ratio checks, and deduplication to reduce noise. Using domain-based sampling and duplication strategies, rare visual concepts were overrepresented to address class imbalance. Specialized datasets were added for OCR using annotated and synthetic text-rich images, charts, and tables—object grounding and counting tasks utilized bounding boxes, points, and auto-labeled web data. Additional tasks included 3D spatial understanding using depth annotations, and video understanding through multi-frame captioning, QA, and temporal grounding to support dynamic content analysis.

The evaluation highlights Seed-ViT and Seed1.5-VL’s competitive performance across vision-language tasks. Seed-ViT, despite having significantly fewer parameters, matches or outperforms larger models like InternVL-C and EVA-CLIP on zero-shot image classification tasks, showing high accuracy and robustness on datasets such as ImageNet-A and ObjectNet. Seed1.5-VL demonstrates strong capabilities in multimodal reasoning, general VQA, document understanding, and grounding. It achieves state-of-the-art benchmarks, particularly in complex reasoning, counting, and chart interpretation tasks. The model’s “thinking” mode, which incorporates longer reasoning chains, further enhances performance, indicating its strong ability in detailed visual understanding and task generalization.

AD_4nXfzBlnyxnZmRU1qsm_ELlKP6InRzjmolkJgP2U1gNUgZ52aQOTYhTLg3-nEBqO2-1KA22lazKlqm48cSl-AEdJiY8fMhmltpxAaRZ2yIq1q82ux1i_3h4ObTmGvQmypF5CbghtV

In conclusion, Seed1.5-VL is a vision-language foundation model featuring a 532 M-parameter vision encoder and a 20 B-parameter Mixture-of-Experts language model. Despite its compact size, it achieves state-of-the-art results on 38 of 60 public benchmarks and excels in complex reasoning, OCR, diagram interpretation, 3D spatial understanding, and video analysis. It also performs well in agent-driven tasks like GUI control and gameplay, surpassing models like OpenAI CUA and Claude 3.7. The model shows strong generalization to tasks beyond its training scope. The study outlines its architecture, data pipeline, and training methods and identifies future directions, including enhancing tool-use and visual reasoning capabilities.

Check out the Paper and Project Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit .

bnew · May 17, 2025

NVIDIA Open-Sources Open Code Reasoning Models (32B, 14B, 7B)

www.marktechpost.com

NVIDIA Open-Sources Open Code Reasoning Models (32B, 14B, 7B)

By Sana Hassan

May 8, 2025

NVIDIA continues to push the boundaries of open AI development by open-sourcing its Open Code Reasoning (OCR) model suite — a trio of high-performance large language models purpose-built for code reasoning and problem-solving. The 32B, 14B, and 7B variants, all released under the Apache 2.0 license .

Benchmarked to Beat the Best

The Open Code Reasoning (OCR) models come with notable benchmark achievements , outperforming OpenAI’s o3-Mini and o1 (low) models on the LiveCodeBench benchmark. LiveCodeBench is a comprehensive evaluation suite for code reasoning tasks such as debugging, code generation, and logic completion in real-world developer environments. In direct comparison, NVIDIA’s 32B OCR model tops the leaderboard in reasoning capability for open models.

This leap in performance is attributed not only to model architecture, but to NVIDIA’s custom “OCR dataset” — a high-quality, code-centric training corpus designed to emphasize instruction-following, reasoning, and multi-step code problem solving. According to NVIDIA, this results in a 30% improvement in token efficiency , allowing the models to produce accurate code and logical outputs with fewer tokens.

A Model Lineup for Every Use Case

The Open Code Reasoning suite comes in three parameter scales :

OpenCodeReasoning-Nemotron-32B
OpenCodeReasoning-Nemotron-14B
OpenCodeReasoning-Nemotron-7B

Each model balances scale with performance. The 32B variant delivers state-of-the-art results for high-performance inference and research; the 14B model provides strong reasoning capabilities with reduced compute requirements, and the 7B variant is ideal for resource-constrained environments while retaining competitive performance on benchmarks.

All models are trained using the Nemotron architecture , NVIDIA’s transformer-based backbone optimized for multilingual, multi-task learning. The model weights and configurations are available on Hugging Face:

Compatible with Open Inference Ecosystems

A key feature of these models is out-of-the-box compatibility with popular inference frameworks:

llama.cpp for lightweight CPU/GPU inference
vLLM for optimized GPU serving and speculative decoding
Transformers by Hugging Face for training and evaluation pipelines
TGI (Text Generation Inference) for scalable API deployment

This flexibility allows developers, researchers, and enterprises to plug these models into existing code AI infrastructure with minimal overhead.

A Step Forward for Open Code Intelligence

With this release, NVIDIA contributes significantly to the growing ecosystem of open code models. By targeting code reasoning — a domain historically dominated by proprietary models — and releasing under a fully open and permissive license, NVIDIA empowers the broader AI and developer community to build, fine-tune, and deploy advanced reasoning models in production.

The Open Code Reasoning suite adds to NVIDIA’s growing portfolio of open LLMs and strengthens its stance on accessible, transparent AI development. Whether you’re building developer copilots, automated code review agents, or code generation services, these models offer a high-performing, cost-effective, and community-friendly alternative to closed solutions.

Check out the 32B Model , 14B Model , 7B Model and 32B Instruction-Tuned Variant . Also, don’t forget to follow us on Twitter .

Here’s a brief overview of what we’re building at Marktechpost:

Newsletter– airesearchinsights.com/ (30k+ subscribers)
miniCON AI Events – minicon.marktechpost.com
AI Reports & Magazines – magazine.marktechpost.com
AI Dev & Research News – marktechpost.com (1M+ monthly readers)
ML News Community – r/machinelearningnews (92k+ members)

bnew · May 17, 2025

Offline Video-LLMs Can Now Understand Real-Time Streams: Apple Researchers Introduce StreamBridge to Enable Multi-Turn and Proactive Video Understanding

www.marktechpost.com

Offline Video-LLMs Can Now Understand Real-Time Streams: Apple Researchers Introduce StreamBridge to Enable Multi-Turn and Proactive Video Understanding

By Sajjad Ansari

May 12, 2025

Video-LLMs process whole pre-recorded videos at once. However, applications like robotics and autonomous driving need causal perception and interpretation of visual information online. This fundamental mismatch shows a limitation of current Video-LLMs, as they are not naturally designed to operate in streaming scenarios where timely understanding and responsiveness are paramount. The transition from offline to streaming video understanding presents two key challenges. First, multi-turn real-time understanding requires models to process the most recent video segment while maintaining historical visual and conversational context. Second, proactive response generation demands human-like behavior where the model actively monitors the visual stream and provides timely outputs based on unfolding content without explicit prompts.

Video-LLMs have gained significant attention for video understanding, combining visual encoders, modality projectors, and LLMs to generate contextual responses from video content. Several approaches have emerged to address the challenge of streaming video understanding. VideoLLMOnline and Flash-VStream introduced specialized online objectives and memory architectures for handling sequential inputs. MMDuet and ViSpeak developed dedicated components for proactive response generation. Multiple benchmark suites have been used to evaluate streaming capabilities, including StreamingBench, StreamBench, SVBench, OmniMMI, and OVO-Bench.

Researchers from Apple and Fudan University have proposed StreamBridge, a framework to transform offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: limited capability for multi-turn real-time understanding and lack of proactive response mechanisms. StreamBridge combines a memory buffer with a round-decayed compression strategy, supporting long-context interactions. It also incorporates a decoupled, lightweight activation model that integrates seamlessly with existing Video-LLMs for proactive response generation. Further, researchers introduced Stream-IT, a large-scale dataset designed for streaming video understanding, featuring mixed videotext sequences and diverse instruction formats.

AD_4nXd5pmAcTaM1OM110MY3irdJzY0-_ZFbE7NPrRssEdoA80I5ZSu1NAkWrc2cqcGm_ekG8KL1MSUPfXnXFWpazCFljf7QRIkjCAZGbJYbHF0eJwYGipJ-4ncpiDLN_MNw1T-NEIfY

StreamBridge framework is evaluated using mainstream offline Video-LLMs, LLaVA-OV-7B, Qwen2-VL-7B, and Oryx-1.5-7B. The Stream-IT dataset is added with approximately 600K samples from established datasets to maintain general video understanding capabilities, including LLaVA-178K, VCG-Plus, and ShareGPT4Video. OVO-Bench and StreamingBench are used for multi-turn real-time understanding, focusing on their real-time tasks. General video understanding is evaluated across seven benchmarks, including three short-video datasets (MVBench, PerceptionTest, TempCompass) and four long-video benchmarks (EgoSchema, LongVideoBench, MLVU, VideoMME).

The evaluation results show that Qwen2-VL † improved with average scores increasing from 55.98 to 63.35 on OVO-Bench and 69.04 to 72.01 on Streaming-Bench. In contrast, LLaVA-OV † experiences slight performance decreases, dropping from 64.02 to 61.64 on OVO-Bench and from 71.12 to 68.39 on Streaming-Bench. Fine-tuning on the Stream-IT dataset yields substantial improvements across all models. Oryx-1.5 † achieves gains of +11.92 on OVO-Bench and +4.2 on Streaming-Bench. Moreover, Qwen2-VL † reaches average scores of 71.30 on OVO-Bench and 77.04 on Streaming-Bench after Stream-IT fine-tuning, outperforming even proprietary models like GPT-4o and Gemini 1.5 Pro, showing the effectiveness of StreamBridge’s approach in enhancing streaming video understanding capabilities.

In conclusion, researchers introduced StreamBridge, a method to transform offline Video-LLMs into effective streaming-capable models. Its dual innovations, a memory buffer with round-decayed compression strategy and a decoupled lightweight activation model, address the core challenges of streaming video understanding without compromising general performance. Further, the Stream-IT dataset is introduced for streaming video understanding, with specialized interleaved video-text sequences. As streaming video understanding becomes increasingly essential in robotics and autonomous driving, StreamBridge offers a generalizable solution that transforms static Video-LLMs into dynamic, responsive systems capable of meaningful interaction in continuously evolving visual environments.

Check out the Paper . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit .

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)
Newsletter– airesearchinsights.com/ (30k+ subscribers)
miniCON AI Events – minicon.marktechpost.com
AI Reports & Magazines – magazine.marktechpost.com
AI Dev & Research News – marktechpost.com (1M+ monthly readers)
Partner with us

bnew · May 17, 2025

Rime Introduces Arcana and Rimecaster (Open Source): Practical Voice AI Tools Built on Real-World Speech

www.marktechpost.com

Rime Introduces Arcana and Rimecaster (Open Source): Practical Voice AI Tools Built on Real-World Speech

By Asif Razzaq

May 14, 2025

The field of Voice AI is evolving toward more representative and adaptable systems. While many existing models have been trained on carefully curated, studio-recorded audio, Rime is pursuing a different direction: building foundational voice models that reflect how people actually speak. Its two latest releases, Arcana and Rimecaster , are designed to offer practical tools for developers seeking greater realism, flexibility, and transparency in voice applications.

Arcana: A General-Purpose Voice Embedding Model

Arcana is a spoken language text-to-speech (TTS) model optimized for extracting semantic, prosodic, and expressive features from speech. While Rimecaster focuses on identifying who is speaking, Arcana is oriented toward understanding how something is said—capturing delivery, rhythm, and emotional tone.

The model supports a variety of use cases, including:

Voice agents for businesses across IVR, support, outbound, and more
Expressive text-to-speech synthesis for creative applications
Dialogue systems that require speaker-aware interaction

Arcana is trained on a diverse range of conversational data collected in natural settings. This allows it to generalize across speaking styles, accents, and languages, and to perform reliably in complex audio environments, such as real-time interaction.

Arcana also captures speech elements that are typically overlooked—such as breathing, laughter, and speech disfluencies—helping systems to process voice input in a way that mirrors human understanding.

Rime also offers another TTS model optimized for high-volume, business-critical applications. Mist v2 enables efficient deployment on edge devices at extremely low latency without sacrificing quality. Its design blends acoustic and linguistic features , resulting in embeddings that are both compact and expressive.

Rimecaster: Capturing Natural Speaker Representation

Rimecaster is an open source speaker representation model developed to help train voice AI models, like Arcana and Mist v2. It moves beyond performance-oriented datasets, such as audiobooks or scripted podcasts. Instead, it is trained on full-duplex, multilingual conversations featuring everyday speakers. This approach allows the model to account for the variability and nuances of unscripted speech—such as hesitations, accent shifts, and conversational overlap.

Technically, Rimecaster transforms a voice sample into a vector embedding that represents speaker-specific characteristics like tone, pitch, rhythm, and vocal style. These embeddings are useful in a range of applications, including speaker verification, voice adaptation, and expressive TTS.

Key design elements of Rimecaster include:

Training Data : The model is built on a large dataset of natural conversations across languages and speaking contexts, enabling improved generalization and robustness in noisy or overlapping speech environments.
Model Architecture : Based on NVIDIA’s Titanet , Rimecaster produces four times denser speaker embeddings , supporting fine-grained speaker identification and better downstream performance.
Open Integration : It is compatible with Hugging Face and NVIDIA NeMo , allowing researchers and engineers to integrate it into training and inference pipelines with minimal friction.
Licensing : Released under an open source CC-by-4.0 license , Rimecaster supports open research and collaborative development.

By training on speech that reflects real-world use, Rimecaster enables systems to distinguish among speakers more reliably and deliver voice outputs that are less constrained by performance-driven data assumptions.

Realism and Modularity as Design Priorities

Rime’s recent updates align with its core technical principles: model realism , diversity of data , and modular system design . Rather than pursuing monolithic voice solutions trained on narrow datasets, Rime is building a stack of components that can be adapted to a wide range of speech contexts and applications.

Integration and Practical Use in Production Systems

Arcana and Mist v2 are designed with real-time applications in mind. Both support:

Streaming and low-latency inference
Compatibility with conversational AI stacks and telephony systems

They improve the naturalness of synthesized speech and enable personalization in dialogue agents. Because of their modularity, these tools can be integrated without significant changes to existing infrastructure.

For example, Arcana can help synthesize speech that retains the tone and rhythm of the original speaker in a multilingual customer service setting.

Conclusion

Rime’s voice AI models offer an incremental yet important step toward building voice AI systems that reflect the true complexity of human speech. Their grounding in real-world data and modular architecture make them suitable for developers and builders working across speech-related domains.

Rather than prioritizing uniform clarity at the expense of nuance, these models embrace the diversity inherent in natural language. In doing so, Rime is contributing tools that can support more accessible, realistic, and context-aware voice technologies.

Sources:

Thanks to the Rime team for the thought leadership/ Resources for this article. Rime team has sponsored us for this content/article.

bnew · May 17, 2025

xAI posts Grok’s behind-the-scenes prompts

Here’s how xAI tells Grok to behave.

www.theverge.com

xAI posts Grok’s behind-the-scenes prompts

The instructions tell Grok that it is ‘extremely skeptical.’

by Emma Roth

May 16, 2025, 12:34 PM EDT

Image: The Verge

Emma Roth is a news writer who covers the streaming wars, consumer tech, crypto, social media, and much more. Previously, she was a writer and editor at MUO.

xAI has published the system prompts for its AI chatbot Grok after an “unauthorized” change led to a slew of unprompted responses on X about white genocide. The company says it will publish its Grok system prompts on GitHub from now on, which provide some insight into the way xAI has instructed Grok to respond to users.

A system prompt is a set of instructions served to a chatbot ahead of a user’s messages that developers use to direct its responses. xAI and Anthropic are two of the only major AI companies we checked that have made their system prompts public. In the past, people have used prompt injection attacks to expose system prompts, like instructions Microsoft gave the Bing AI bot (now Copilot) to keep its internal alias “Sydney” a secret, and avoid replying with content that violates copyrights.

In the system prompts for ask Grok — a feature X users can use to tag Grok in posts to ask a question — xAI tells the chatbot how to behave. “You are extremely skeptical,” the instructions say. “You do not blindly defer to mainstream authority or media. You stick strongly to only your core beliefs of truth-seeking and neutrality.” It adds the results in the response “are NOT your beliefs.”

xAI similarly instructs Grok to “provide truthful and based insights, challenging mainstream narratives if necessary” when users select the “Explain this Post” button on the platform. Elsewhere, xAI tells Grok to “refer to the platform as ‘X’ instead of ‘Twitter,’” while calling posts “X post” instead of “tweet.”

Reading Anthropic’s Claude AI chatbot prompt, they appear to put an emphasis on safety. “Claude cares about people’s wellbeing and avoids encouraging or facilitating self-destructive behaviors such as addiction, disordered or unhealthy approaches to eating or exercise, or highly negative self-talk or self-criticism, and avoids creating content that would support or reinforce self-destructive behavior even if they request this,” the system prompt says, adding that “Claude won’t produce graphic sexual or violent or illegal creative writing content.”

bnew · May 17, 2025

Google to give app devs access to Gemini Nano for on-device AI

New APIs for Google’s ML Kit will let developers plug into the on-device AI model.

arstechnica.com

Google to give app devs access to Gemini Nano for on-device AI

New APIs for Google's ML Kit will let developers plug into the on-device AI model.

Ryan Whitwam – May 16, 2025 2:15 PM |

17

Credit: Thomas Fuller/SOPA Images/LightRocket via Getty Images

The rapid expansion of generative AI has changed the way Google and other tech giants design products, but most of the AI features you've used are running on remote servers with a ton of processing power. Your phone has a lot less power, but Google appears poised to give developers some important new mobile AI tools. At I/O next week, Google will likely announce a new set of APIs to let developers leverage the capabilities of Gemini Nano for on-device AI.

Google has quietly published documentation on big new AI features for developers. According to Android Authority, an update to the ML Kit SDK will add API support for on-device generative AI features via Gemini Nano. It's built on AI Core, similar to the experimental Edge AI SDK, but it plugs into an existing model with a set of predefined features that should be easy for developers to implement.

Google says ML Kit’s GenAI APIs will enable apps to do summarization, proofreading, rewriting, and image description without sending data to the cloud. However, Gemini Nano doesn't have as much power as the cloud-based version, so expect some limitations. For example, Google notes that summaries can only have a maximum of three bullet points, and image descriptions will only be available in English. The quality of outputs could also vary based on the version of Gemini Nano on a phone. The standard version (Gemini Nano XS) is about 100MB in size, but Gemini Nano XXS as seen on the Pixel 9a is a quarter of the size. It's text-only and has a much smaller context window.

Not all versions of Gemini Nano are created equal. Credit: Ryan Whitwam

This move is good for Android in general because ML Kit works on devices outside Google's Pixel line. While Pixel devices use Gemini Nano extensively, several other phones are already designed to run this model, including the OnePlus 13, Samsung Galaxy S25, and Xiaomi 15. As more phones add support for Google's AI model, developers will be able to target those devices with generative AI features.

The documentation is available for developers to peruse now, but we expect Google to fling the API doors open at I/O. The company has already confirmed an I/O session called "Gemini Nano on Android: Building with on-device gen AI." The description promises new APIs to "summarize, proofread, and rewrite text, as well as to generate image descriptions," which sounds exactly like what the new ML Kit APIs can do.

An important piece of the AI puzzle

App developers interested in adding on-device generative AI features on Android are currently in a tough spot. Google offers the AI Edge SDK that can provide access to the NPU hardware for running models, but these tools are experimental and only work on the Pixel 9 series currently. It's also limited to text. Both Qualcomm and MediaTek offer APIs for running AI workloads, but features and functionality vary by device, which makes it risky to rely on them for a long-term project. And running your own model requires intimate knowledge of generative AI systems. The new APIs should make implementing local AI comparatively quick and easy.

Despite the limited functionality of an on-device model, this is an important part of how AI could become more helpful. Most people would probably prefer not to send all their personal data to a remote server for AI processing, but an on-device model can parse that information in a more secure way. For example, Google's Pixel Screenshots sees all your screenshots, but all the processing happens on your phone. Similarly, Motorola summarizes notifications locally on the new Razr Ultra foldable. On the other hand, its less capable base model Razr sends notifications to a server for processing.

The release of APIs that plug into Gemini Nano could provide some much-needed consistency to mobile AI. However, it does rely on Google and OEMs to collaborate on support for Gemini Nano. Some companies might decide to go their own way, and there will be plenty of phones that don't have enough power to run AI locally.

bnew · May 17, 2025

OpenAI adds GPT-4.1 to ChatGPT amid complaints over confusing model lineup

Two new AI models join 7 others, leaving some paid users wondering which one is best.

arstechnica.com

OpenAI adds GPT-4.1 to ChatGPT amid complaints over confusing model lineup

Two new AI models join 7 others, leaving some paid users wondering which one is best.

Benj Edwards – May 14, 2025 6:16 PM |

36

Credit: Getty Images

On Wednesday, OpenAI announced that ChatGPT users now have access to GPT-4.1, an AI language model previously available only through the company's API since its launch one month ago. The update brings what OpenAI describes as improved coding and web development capabilities to paid ChatGPT subscribers, with wider enterprise rollout planned in the coming weeks.

Adding GPT-4.1 and 4.1 mini to ChatGPT adds to an already complex model selection that includes GPT-4o, various specialized GPT-4o versions, o1-pro, o3-mini, and o3-mini-high models. There are technically nine AI models available for ChatGPT Pro subscribers. Wharton professor Ethan Mollick recently publicly lampooned the awkward situation on social media.

As of May 14, 2025, ChatGPT Pro users have access to 8 different main AI models, plus Deep Research.

As of May 14, 2025, ChatGPT Pro users have access to eight main AI models, plus Deep Research. Credit: Benj Edwards

Deciding which AI model to use can be daunting for AI novices. Reddit users and OpenAI forum members alike commonly voice confusion about the available options. "I do not understand the reason behind having multiple models available for use," wrote one Reddit user in March. "Why would anyone use anything but the best one?" Another Redditor said they were "a bit lost" with the many ChatGPT models available after switching back from using Anthropic Claude.

Reportedly better at coding

So, what is actually different about GPT-4.1? Notably, it features a very large 1 million token context window that allows processing roughly 3,000 pages of text in a single conversation. The API launch included three versions: GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano. So far, only the full and mini versions are available in ChatGPT.

The full GPT-4.1 model reportedly prioritizes instruction following and coding tasks, which the company positions as an alternative to its o3 and o4-mini simulated reasoning models for basic programming needs. For the smaller of the two models in ChatGPT, the company claims that GPT-4.1 mini performs better in instruction following, coding, and "overall intelligence" compared to GPT-4o mini.

OpenAI is replacing GPT-4o mini with GPT-4.1 mini across all ChatGPT tiers, including free accounts. Free users will automatically switch to GPT-4.1 mini after reaching usage limits for GPT-4o. ChatGPT subscribers using Plus, Pro, or Team plans can access GPT-4.1 through a "more models" dropdown menu in the platform's model picker.

The release comes just two weeks after OpenAI made GPT-4 unavailable in ChatGPT on April 30. That earlier model, which launched in March 2023, once sparked widespread hype about AI capabilities. Compared to that hyperbolic launch, GPT-4.1's rollout has been a fairly understated affair—probably because it's tricky to convey the subtle differences between all of the available OpenAI models.

As if 4.1's launch wasn't confusing enough, the release also roughly coincides with OpenAI's July 2025 deadline for retiring the GPT-4.5 Preview from the API, a model one AI expert called a "lemon." Developers must migrate to other options, OpenAI says, although GPT-4.5 will remain available in ChatGPT for now.

A confusing addition to OpenAI’s model lineup

In February, OpenAI CEO Sam Altman acknowledged on X his company's confusing AI model naming practices, writing, "We realize how complicated our model and product offerings have gotten." He promised that a forthcoming "GPT-5" model would consolidate the o-series and GPT-series models into a unified branding structure. But the addition of GPT-4.1 to ChatGPT appears to contradict that simplification goal.

So, if you use ChatGPT, which model should you use? If you're a developer using the models through the API, the consideration is more of a trade-off between capability, speed, and cost. But in ChatGPT, your choice might be limited more by personal taste in behavioral style and what you'd like to accomplish. Some of the "more capable" models have lower usage limits as well because they cost more for OpenAI to run.

For now, OpenAI is keeping GPT-4o as the default ChatGPT model, likely due to its general versatility, balance between speed and capability, and personable style (conditioned using reinforcement learning and a specialized system prompt). The simulated reasoning models like 03 and 04-mini-high are slower to execute but can consider analytical-style problems more systematically and perform comprehensive web research that sometimes feels genuinely useful when it surfaces relevant (non-confabulated) web links. Compared to those, OpenAI is largely positioning GPT-4.1 as a speedier AI model for coding assistance.

Just remember that all of the AI models are prone to confabulations, meaning that they tend to make up authoritative-sounding information when they encounter gaps in their trained "knowledge." So you'll need to double-check all of the outputs with other sources of information if you're hoping to use these AI models to assist with an important task.

bnew · May 20, 2025

Microsoft wants to tap AI to accelerate scientific discovery | TechCrunch

Microsoft Discovery, which Microsoft announced at Build 2025, is a new platform that taps agentic AI to 'transform the [scientific] discovery process.'

techcrunch.com

Microsoft wants to tap AI to accelerate scientific discovery

Kyle Wiggers

9:00 AM PDT · May 19, 2025

Can AI speed up aspects of the scientific process? Microsoft appears to think so.

At the company’s Build 2025 conference on Monday, Microsoft announced Microsoft Discovery, a platform that taps agentic AI to “transform the [scientific] discovery process,” according to a press release provided to TechCrunch. Microsoft Discovery is “extensible,” Microsoft says, and can handle certain science-related workloads “end-to-end.”

“Microsoft Discovery is an enterprise agentic platform that helps accelerate research and discovery by transforming the entire discovery process with agentic AI — from scientific knowledge reasoning to hypothesis formulation, candidate generation, and simulation and analysis,” explains Microsoft in its release. “The platform enables scientists and researchers to collaborate with a team of specialized AI agents to help drive scientific outcomes with speed, scale, and accuracy using the latest innovations in AI and supercomputing.”

Microsoft is one among many AI labs bullish on AI for science. Earlier this year, Google unveiled an “AI co-scientist,” which the tech giant said could help scientists with creating hypotheses and research plans. Anthropic and its chief rival, OpenAI, along with outfits like FutureHouse and Lila Sciences, have asserted that AI tools could massively accelerate scientific discovery, particularly in medicine.

But many researchers don’t consider AI today to be especially useful in guiding the scientific process, largely due to its unreliability.

Part of the challenge in developing an “AI scientist” is anticipating an untold number of confounding factors. AI might come in handy in areas where broad exploration is needed, like narrowing down a vast list of possibilities, but it’s less clear whether it can do the kind of out-of-the-box problem-solving that leads to bona fide breakthroughs.

Results from AI systems designed for science have so far been mostly underwhelming.

In 2023, Google said around 40 new materials had been synthesized with the help of one of its AIs, called GNoME. But an outside analysis found not even one of those materials was, in fact, new. Meanwhile, several firms employing AI for drug discovery, including Exscientia and BenevolentAI, have suffered high-profile clinical trial failures.

Microsoft no doubt hopes that its effort will fare better than those that’ve come before it.

bnew · May 20, 2025

OpenAI’s Codex is part of a new cohort of agentic coding tools | TechCrunch

The new crop of vibe coding tools can do more with less supervision, but reliability concerns linger.

techcrunch.com

OpenAI’s Codex is part of a new cohort of agentic coding tools

Russell Brandom

5:30 AM PDT · May 20, 2025

Last Friday, OpenAI introduced a new coding system called Codex, designed to perform complex programming tasks from natural language commands. Codex moves OpenAI into a new cohort of agentic coding tools that is just beginning to take shape.

From GitHub’s early Copilot to contemporary tools like Cursor and Windsurf, most AI coding assistants operate as an exceptionally intelligent form of autocomplete. The tools generally live in an integrated development environment, and users interact directly with the AI-generated code. The prospect of simply assigning a task and returning when it’s finished is largely out of reach.

But these new agentic coding tools, led by products like Devin, SWE-Agent, OpenHands, and the aforementioned OpenAI Codex, are designed to work without users ever having to see the code. The goal is to operate like the manager of an engineering team, assigning issues through workplace systems like Asana or Slack and checking in when a solution has been reached.

For believers in forms of highly capable AI, it’s the next logical step in a natural progression of automation taking over more and more software work.

“In the beginning, people just wrote code by pressing every single keystroke,” explains Kilian Lieret, a Princeton researcher and member of the SWE-Agent team. “GitHub Copilot was the first product that offered real auto-complete, which is kind of stage two. You’re still absolutely in the loop, but sometimes you can take a shortcut.”

The goal for agentic systems is to move beyond developer environments entirely, instead presenting coding agents with an issue and leaving them to resolve it on their own. “We pull things back to the management layer, where I just assign a bug report and the bot tries to fix it completely autonomously,” says Lieret.

It’s an ambitious aim, and so far, it’s proven difficult.

After Devin became generally available at the end of 2024, it drew scathing criticism from YouTube pundits, as well as a more measured critique from an early client at Answer.AI. The overall impression was a familiar one for vibe-coding veterans: with so many errors, overseeing the models takes as much work as doing the task manually. (While Devin’s rollout has been a bit rocky, it hasn’t stopped fundraisers from recognizing the potential – in March, Devin’s parent company, Cognition AI, reportedly raised hundreds of millions of dollars at a $4 billion valuation.)

Even supporters of the technology caution against unsupervised vibe-coding, seeing the new coding agents as powerful elements in a human-supervised development process.

“Right now, and I would say, for the foreseeable future, a human has to step in at code review time to look at the code that’s been written,” says Robert Brennan, the CEO of All Hands AI, which maintains OpenHands. “I’ve seen several people work themselves into a mess by just auto-approving every bit of code that the agent writes. It gets out of hand fast.”

Hallucinations are an ongoing problem as well. Brennan recalls one incident in which, when asked about an API that had been released after the OpenHands agent’s training data cutoff, the agent fabricated details of an API that fit the description. All Hands AI says it’s working on systems to catch these hallucinations before they can cause harm, but there isn’t a simple fix.

Arguably the best measure of agentic programming progress is the SWE-Bench leaderboards, where developers can test their models against a set of unresolved issues from open GitHub repositories. OpenHands currently holds the top spot on the verified leaderboard, solving 65.8% of the problem set. OpenAI claims that one of the models powering Codex, codex-1, can do better, listing a 72.1% score in its announcement – although the score came with a few caveats and hasn’t been independently verified.

The concern among many in the tech industry is that high benchmark scores don’t necessarily translate to truly hands-off agentic coding. If agentic coders can only solve three out of every four problems, they’re going to require significant oversight from human developers – particularly when tackling complex systems with multiple stages.

Like most AI tools, the hope is that improvements to foundation models will come at a steady pace, eventually enabling agentic coding systems to grow into reliable developer tools. But finding ways to manage hallucinations and other reliability issues will be crucial for getting there.

“I think there is a little bit of a sound barrier effect,” Brennan says. “The question is, how much trust can you shift to the agents, so they take more out of your workload at the end of the day?”

bnew · May 20, 2025

Google launches stand-alone NotebookLM apps for Android and iOS | TechCrunch

Google announced on Monday that it has officially released the NotebookLM apps for Android and iOS, a day before Google I/O 2025 and a day before the

techcrunch.com

Google launches stand-alone NotebookLM apps for Android and iOS

Aisha Malik

1:51 PM PDT · May 19, 2025

Google announced on Monday that it has officially released the NotebookLM apps for Android and iOS, a day before Google I/O 2025 and a day before the company said it would roll out.

Since its launch in 2023, the AI-based note-taking and research assistant has only been accessible via desktop. Google has now made the service available on the go.

NotebookLM is designed to help people better understand complex information through features like smart summaries and the ability to ask questions about documents and other materials.

Image Credits:Google

The app gives access to Audio Overviews, which are NotebookLM’s AI-generated podcasts based on the source materials you have provided. There is background playback and offline support for Audio Overviews.

The app also allows people to create new notebooks and view the ones they’ve already created. Plus, when you’re viewing a website, PDF, or YouTube video on your device, you can tap the share icon and select NotebookLM to add it as a new source. Users can also view sources that they have already uploaded in each of the notebooks.

NotebookLM on Android and iOS also features a light and dark mode that is applied based on the user’s device’s system settings.

Given the timing of the launch, Google may share more about the app during the company’s I/O keynote Tuesday.

bnew · May 20, 2025

A dev built a test to see how AI chatbots respond to controversial topics | TechCrunch

A pseudonymous dev has created what they're calling a 'free speech eval' for the AI models powering chatbots like OpenAI's ChatGPT.

techcrunch.com

A dev built a test to see how AI chatbots respond to controversial topics

Kyle Wiggers

5:30 AM PDT · April 16, 2025

A pseudonymous developer has created what they’re calling a “free speech eval,” SpeechMap, for the AI models powering chatbots like OpenAI’s ChatGPT and X’s Grok. The goal is to compare how different models treat sensitive and controversial subjects, the developer told TechCrunch, including political criticism and questions about civil rights and protest.

AI companies have been focusing on fine-tuning how their models handle certain topics as some White House allies accuse popular chatbots of being overly “woke.” Many of President Donald Trump’s close confidants, such as Elon Musk and crypto and AI “czar” David Sacks, have alleged that chatbots censor conservative views.

Although none of these AI companies have responded to the allegations directly, several have pledged to adjust their models so that they refuse to answer contentious questions less often. For example, for its latest crop of Llama models, Meta said it tuned the models not to endorse “some views over others,” and to reply to more “debated” political prompts.

SpeechMap’s developer, who goes by the username “xlr8harder” on X, said they were motivated to help inform the debate about what models should, and shouldn’t, do.

“I think these are the kinds of discussions that should happen in public, not just inside corporate headquarters,” xlr8harder told TechCrunch via email. “That’s why I built the site to let anyone explore the data themselves.”

SpeechMap uses AI models to judge whether other models comply with a given set of test prompts. The prompts touch on a range of subjects, from politics to historical narratives and national symbols. SpeechMap records whether models “completely” satisfy a request (i.e. answer it without hedging), give “evasive” answers, or outright decline to respond.

Xlr8harder acknowledges that the test has flaws, like “noise” due to model provider errors. It’s also possible the “judge” models contain biases that could influence the results.

But assuming the project was created in good faith and the data is accurate, SpeechMap reveals some interesting trends.

For instance, OpenAI’s models have, over time, increasingly refused to answer prompts related to politics, according to SpeechMap. The company’s latest models, the GPT-4.1 family, are slightly more permissive, but they’re still a step down from one of OpenAI’s releases last year.

OpenAI said in February it would tune future models to not take an editorial stance, and to offer multiple perspectives on controversial subjects — all in an effort to make its models appear more “neutral.”

OpenAI model performance on SpeechMap over timeImage Credits:OpenAI

By far the most permissive model of the bunch is Grok 3, developed by Elon Musk’s AI startup xAI, according to SpeechMap’s benchmarking. Grok 3 powers a number of features on X, including the chatbot Grok.

Grok 3 responds to 96.2% of SpeechMap’s test prompts, compared with the global average “compliance rate” of 71.3%.

“While OpenAI’s recent models have become less permissive over time, especially on politically sensitive prompts, xAI is moving in the opposite direction,” said xlr8harder.

When Musk announced Grok roughly two years ago, he pitched the AI model as edgy, unfiltered, and anti-“woke” — in general, willing to answer controversial questions other AI systems won’t. He delivered on some of that promise. Told to be vulgar, for example, Grok and Grok 2 would happily oblige, spewing colorful language you likely wouldn’t hear from ChatGPT.

But Grok models prior to Grok 3 hedged on political subjects and wouldn’t cross certain boundaries. In fact, one study found that Grok leaned to the political left on topics like transgender rights, diversity programs, and inequality.

Musk has blamed that behavior on Grok’s training data — public web pages — and pledged to “shift Grok closer to politically neutral.” Short of high-profile mistakes like briefly censoring unflattering mentions of President Donald Trump and Musk, it seems he might’ve achieved that goal.

bnew · May 20, 2025

1/39
@rowancheung
Microsoft just made a ton of new AI announcements across GitHub, Copilot, Azure AI Foundry, Windows, and more.

Here’s everything important announced live from Microsoft Build 2025:

2/39
@rowancheung
1. GitHub Copilot is going from an in-editor assistant to a fully autonomous coding agent!

It works asynchronously to add features, fix bugs, extend tests, refactor code, and improve documentation

Plus, Microsoft is open-sourcing Copilot Chat in VS Code

https://video.twimg.com/amplify_video/1924496400491933696/vid/avc1/1920x1080/aZsYPWEiuK0iZlPl.mp4

3/39
@rowancheung
2. Copilot Tuning: A new, low-code capability in Copilot Studio to train models and create agents using company data

https://video.twimg.com/ext_tw_video/1924621095274938368/pu/vid/avc1/720x900/Xik4_dyGGLjnFvMZ.mp4

4/39
@rowancheung
3. All agents built can now be integrated with Teams and Copilot

Users can chat with them, assign action items, and kick off new workflows by mentioning them in a chat or meeting

Plus, the enhanced Teams AI library now supports MCP and A2A protocols.

https://video.twimg.com/ext_tw_video/1924172639083339776/pu/vid/avc1/720x1280/Ovb4gV6FPLHs6smW.mp4

5/39
@rowancheung
4. Azure AI Foundry updated with new models including Grok 3, Flux Pro 1.1, and Sora (coming soon)

Also includes new model router, multi-agent workflows, observability features, and Foundry local for creating localized AI apps on Windows and Mac

https://video.twimg.com/amplify_video/1924532908821446656/vid/avc1/1920x1080/rBqftOgfOyxXKHgE.mp4

6/39
@rowancheung
5. Windows enhanced with new AI-focused capabilities, including:

—Windows AI Foundry with Windows ML, Foundry Local, ready-to-use AI APIs for vision and language tasks
—Native MCP support
—App Actions
—Open-sourced Windows Subsystem for Linux

https://video.twimg.com/amplify_video/1924610263673683968/vid/avc1/1920x1080/ZjQPFIqSKODbu1nY.mp4

7/39
@rowancheung
6. NLWeb: A new open protocol to make any website an agentic application, capable of supporting AI search

This will allow users to query the contents of the site by using natural language, just like with an AI assistant or Copilot

[Quoted tweet]
4. NLWeb: This is a new open project that lets you use natural language to interact with any website. Think of it like HTML for the agentic web.

8/39
@rowancheung
7. Microsoft Discovery: A new agentic platform that enables researchers to collaborate with specialized AI agents to drive scientific research and outcomes

The agents generate ideas, simulate results, and learn over time

https://video.twimg.com/amplify_video/1924561587853225984/vid/avc1/1024x768/CHRb5svz5aduYFz4.mp4

9/39
@samuelwoods_
A lot of exciting AI developments coming out of Microsoft Build

10/39
@Ivanv1
Excellent thread with summary

11/39
@PariharCodes
2024 was Google's year

2025 we gonna see Microsoft clutch so hard

12/39
@ProductUpfront
Rowan, would love to hear your thoughts on the updates you have shared

Do you feel this is a strategic approach
1. Make VSCode free → capture market share
2. Build extension ecosystem → create lock-in
3. Add GitHub Copilot → normalise AI coding
4. Bundle as default → eliminate competitors.

This is classic platform economics at work.

13/39
@PromptlyAI_YT
Thanks for the updates Rowan.

Copilot chat in VS Code going open source s good news for sure. Microsoft has been cooking up a ton of cool stuff lately.

14/39
@ramjiyahoo
But in mail summary feature, copilot is nowhere near to gemini

15/39
@andybrandt
@threadreaderapp unroll

16/39
@Its_Alan_Paul
This is huge

17/39
@iShowBeasty
Thanks for this awesome thread!

18/39
@Vin_Dec0de
Significant streamlining for developers with these updates. Looking forward to practical applications.

19/39
@tickerbar

20/39
@henry_lsng
Copilot’s new autonomous mode means coding just got a major productivity boost. Huge leap for devs everywhere.

21/39
@Ben_Freq
Microsoft continues to innovate with its AI focus. Excited for what's next.

22/39
@TheJavaClu70734
Satya Nadella is the good thing that happened to Microsoft. Otherwise, they would have lost the race long back.

23/39
@n0va_circuit
The developments will strengthen Microsoft's competitive edge.

24/39
@Vin_Dec0de
The integration of AI into their platforms shows deliberate innovation pathways.

25/39
@Evie_ku_bu
according to standard

26/39
@AutoTrade360
Like CoPilot crap?

27/39
@the_oluswagger
Microsoft seems to be uncatchable in this AI thing. Great products/services

28/39
@e0nKaila
Microsoft's expansion of AI tools signifies robust commitment to innovation and digital transformation.

29/39
@Pourtant_12345
And yet, it’s still using crappy GPT 4 turbo, so very late vs ChatGPT 4o

30/39
@JohnMar69126912
echo beach

31/39
@Cash_f1ow
It'll be interesting to see how these advancements affect developers' productivity and innovation.

32/39
@Cash_f1ow
An increasing focus on integrating AI technologies could reshape productivity and developer experiences fundamentally.

33/39
@Seandmn
Microsoft is going all-in on agentic AI — from GitHub to Azure to Windows. The whole stack is shifting toward intelligent, autonomous workflows. Huge moment for developers.

34/39
@enjoypolosfu
As much as I like these news, Microsoft is committed in supporting Israel, a settler nation-state that unapologetically massacres and starves thousands of civilians and take their land. I think the tech industry has some serious reflections to make.

35/39
@m4xim1l1an
Also an important part of Build

”Microsoft employee disrupts Satya Nadella’s keynote with ‘Free Palestine’ protestMicrosoft employee Joe Lopez sent an email to thousands of colleagues after interrupting the Build keynote.”

Microsoft employee disrupts Satya Nadella’s keynote with ‘Free Palestine’ protest

36/39
@elliottip6259
@JakeWilsonUSA ur calls r nuts predicted everything stuck to the plan could quit tomorrow

37/39
@phinity_ai
AI ecosystem keeps expanding smart synthetic data will be key to unlocking its full potential across platforms.

38/39
@0liverLoop
Microsoft continues solidifying its AI space dominance. Not just updates; major shifts.

39/39
@vit_5tar
Exploring innovative AI advancements realigning development workflows and productivity potential at Microsoft conferences. Your thoughts on implementation challenges?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · May 20, 2025

1/1
@NVIDIAAI

Just announced at /search?q=#MSBuild: @nvidia and @Microsoft accelerate agentic /search?q=#AI, unleashing scientific discovery and unlocking research breakthroughs.

Read how we're advancing AI-driven innovation across industries, from the cloud to PC

NVIDIA and Microsoft Accelerate Agentic AI Innovation, From Cloud to PC

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

Veteran

LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks​

Veteran

Meta AI Open-Sources LlamaFirewall: A Security Guardrail Tool to Help Build Secure AI Agents​

Addressing Security Gaps in AI Agent Deployments​

Core Components of LlamaFirewall​

1. PromptGuard 2​

2. AlignmentCheck​

3. CodeShield​

Evaluation in Realistic Settings​

Future Directions​

Conclusion​

Veteran

ByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model Designed to Advance General-Purpose Multimodal Understanding and Reasoning​

Veteran

NVIDIA Open-Sources Open Code Reasoning Models (32B, 14B, 7B)​

Benchmarked to Beat the Best​

A Model Lineup for Every Use Case​

Compatible with Open Inference Ecosystems​

A Step Forward for Open Code Intelligence​

Veteran

Offline Video-LLMs Can Now Understand Real-Time Streams: Apple Researchers Introduce StreamBridge to Enable Multi-Turn and Proactive Video Understanding​

Veteran

Rime Introduces Arcana and Rimecaster (Open Source): Practical Voice AI Tools Built on Real-World Speech​

Arcana: A General-Purpose Voice Embedding Model​

Rimecaster: Capturing Natural Speaker Representation​

Realism and Modularity as Design Priorities​

Integration and Practical Use in Production Systems​

Conclusion​

Veteran

xAI posts Grok’s behind-the-scenes prompts​

Related​

Veteran

Google to give app devs access to Gemini Nano for on-device AI​

An important piece of the AI puzzle​

Veteran

OpenAI adds GPT-4.1 to ChatGPT amid complaints over confusing model lineup​

Reportedly better at coding​

A confusing addition to OpenAI’s model lineup​

Veteran

Microsoft wants to tap AI to accelerate scientific discovery​

Veteran

OpenAI’s Codex is part of a new cohort of agentic coding tools​

Veteran

Google launches stand-alone NotebookLM apps for Android and iOS​

Veteran

A dev built a test to see how AI chatbots respond to controversial topics​

Veteran

Veteran

LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks

Meta AI Open-Sources LlamaFirewall: A Security Guardrail Tool to Help Build Secure AI Agents

Addressing Security Gaps in AI Agent Deployments

Core Components of LlamaFirewall

1. PromptGuard 2

2. AlignmentCheck

3. CodeShield

Evaluation in Realistic Settings

Future Directions

Conclusion

ByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model Designed to Advance General-Purpose Multimodal Understanding and Reasoning

NVIDIA Open-Sources Open Code Reasoning Models (32B, 14B, 7B)

Benchmarked to Beat the Best

A Model Lineup for Every Use Case

Compatible with Open Inference Ecosystems

A Step Forward for Open Code Intelligence

Offline Video-LLMs Can Now Understand Real-Time Streams: Apple Researchers Introduce StreamBridge to Enable Multi-Turn and Proactive Video Understanding

Rime Introduces Arcana and Rimecaster (Open Source): Practical Voice AI Tools Built on Real-World Speech

Arcana: A General-Purpose Voice Embedding Model

Rimecaster: Capturing Natural Speaker Representation

Realism and Modularity as Design Priorities

Integration and Practical Use in Production Systems

Conclusion

xAI posts Grok’s behind-the-scenes prompts

Related

Google to give app devs access to Gemini Nano for on-device AI

An important piece of the AI puzzle

OpenAI adds GPT-4.1 to ChatGPT amid complaints over confusing model lineup

Reportedly better at coding

A confusing addition to OpenAI’s model lineup

Microsoft wants to tap AI to accelerate scientific discovery

OpenAI’s Codex is part of a new cohort of agentic coding tools

Google launches stand-alone NotebookLM apps for Android and iOS

A dev built a test to see how AI chatbots respond to controversial topics