bnew

Veteran
Joined
Nov 1, 2015
Messages
63,451
Reputation
9,692
Daps
173,295



1/11
@GoogleDeepMind
We’re releasing an updated Gemini 2.5 Pro (I/O edition) to make it even better at coding. 🚀

You can build richer web apps, games, simulations and more - all with one prompt.

In @GeminiApp, here's how it transformed images of nature into code to represent unique patterns 🌱



https://video.twimg.com/amplify_video/1919768928928051200/vid/avc1/1080x1920/taCOcXbyaVFwRWLw.mp4

2/11
@GoogleDeepMind
This latest version of Gemini 2.5 Pro leads on the WebDev Arena Leaderboard - which measures how well an AI can code a compelling web app. 🛠️

It also ranks #1 on @LMArena_ai in Coding.



GqRiF1PWIAAaHN0.jpg


3/11
@GoogleDeepMind
Beyond creating beautiful UIs, these improvements extend to tasks such as code transformation and editing as well as developing complex agents.

Now available to try in @GeminiApp, @Google AI Studio and @GoogleCloud’s /search?q=#VertexAI platform. Find out more → Build rich, interactive web apps with an updated Gemini 2.5 Pro



GqRiYqbW0AE6YtX.jpg


4/11
@koltregaskes
Excellent, will we get the non-preview version at I/O?



5/11
@alialobai1
@jacksharkey11 they are cooking …



6/11
@laoddev
that is wild



7/11
@RaniBaghezza
Very cool



8/11
@burny_tech
Gemini is a gift that I can have 100 simple coding ideas per day and draft simple versions of them all



9/11
@thomasxdijkstra
@cursor_ai when



10/11
@shiels_ai
Unreal 🤯🤯🤯



11/11
@LarryPanozzo
Anthropic rn




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/21
@GeminiApp
We just dropped Gemini 2.5 Pro (I/O edition). It’s our most intelligent model that’s even better at coding.

Now, you can build interactive web apps in Canvas with fewer prompts.

Head to ‎Gemini and select “Canvas” in the prompt bar to try it out, and let us know what you’re building in the comments.



https://video.twimg.com/amplify_video/1919768593987727360/vid/avc1/1920x1080/I7FL20DtXMKELQCF.mp4

2/21
@GeminiApp
Interact with the game from our post here: ‎Gemini - lets use noto emoji font https://fonts.google.com/noto/specimen/Noto+Color+Emoji



3/21
@metadjai
Awesome! ✨



4/21
@accrued_int
it's like they are just showing off now ☺️



5/21
@ComputerMichau
For me 2.5 Pro is still experimental.



6/21
@arulPrak_
AI agentic commerce ecosystem for travel industry



7/21
@sumbios
Sweet



8/21
@AIdenAIStar
I'd say it is a good model. Made myself a Gemini defender game



https://video.twimg.com/amplify_video/1919783171723292672/vid/avc1/1094x720/Y9mPukwagcRIr7fK.mp4

9/21
@car_heroes
ok started trial. Basic Pacman works. Anything else useful so far is blank screen after a couple of updates. It can't figure it out. New MAC, Sequoia 15.3.2 and Chrome Version 136.0.7103.92. I want this to work but I cant waist time on stuff that should work at launch.



10/21
@rand_longevity
this week is really heating up



11/21
@reallyoptimized
@avidseries You got your own edition! It's completely not woke, apparently.



12/21
@A_MacLullich
I could also make other simple clinical webapps to help with workflow. For example, if a patient with /search?q=#delirium is distressed, this screen could help doctors and nurses to assess for causes. Clicking on each box would reveal more details.



GqRxGZVWsAAPHxo.png


13/21
@nurullah_kuus
Seems interesting, i ll give it a shot



14/21
@dom_liu__
I used Gemini 2.5 Pro to create a Dragon game, and it was so much fun! The code generation was fast, complete, and worked perfectly on the first try with no extra tweaks needed. I have a small question: is this new model using gemini-2.5-pro-preview-05-06?



GqRoDN1bUAARTNl.jpg


15/21
@ai_for_success
Why is ir showing Experimental?



16/21
@G33K13765260
damn. it fukked my entire code.. ran back to claude :smile:



17/21
@A_MacLullich
Would like to develop a 4AT /search?q=#delirium assessment tool webapp too.

I already have @replit one here: http://www.the4AT.com/trythe4AT - would be nice to have a webapp option for people too.



GqRwK7aX0AAANEA.png


18/21
@davelalande
I am curious about Internet usage. I mainly use X and AI, and I rarely traverse the web anymore. How many new websites are finding success, and is the rest of the world using the web like it's 1999? Will chat models build an app for one-time use with that chat session?



19/21
@arthurSlee
Using this solar system prompt - I initially got an error. However after the fix, it did create the best looking solar system in one prompt.

‎Gemini - Solar System Visualization HTML Page

Nice work. I also like how easy it is to share executing code.



20/21
@AI_Techie_Arun
Wow!!!! Amazing

But what's the I/O edition?



21/21
@JvShah124
Great 😃




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196






1/11
@slow_developer
now this is very interesting...

the new gemini 2.5 pro model seems to have fallen behind in many areas

coding is the only thing it still handles well.

so, does that mean this model was built mainly for coding?



GqVZr68aMAAf8Y_.jpg


2/11
@Shawnryan96
I have not seen any issues in real world use. In fact image reasoning seems better



3/11
@slow_developer
i haven’t tried anything except for the code, but this is a comparison-based chart with the previous version



4/11
@psv2522
its not fallen behind the new model is probably a distillation+trained for coding much better.



5/11
@slow_developer
much like what Anthropic did with 3.5 to their next updated version 3.6?



6/11
@sdmat123
That's how tuning works, yes. You can see the same kind of differences in Sonnet 3.7 vs 3.6.

3.7 normal regressed quantitatively on MMLU and ARC even with base level reasoning skills on 3.6. It is regarded as subjectively worse in many domains outside of coding.



7/11
@slow_developer
agree

[Quoted tweet]
much like what Anthropic did with 3.5 to their next updated version 3.6?


8/11
@mhdfaran
It’s interesting how coding is still the highlight here.



9/11
@NFTMentis
Wait - what?

Is this a response the to @OpenAI news re: @windsurf_ai ?



10/11
@K_to_Macro
This shows the weakness of RL



11/11
@humancore_ai
I don’t care. I want one that is a beast at coding, there are plenty of general purpose ones.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196











1/11
@OfficialLoganK
Gemini 2.5 Pro just got an upgrade & is now even better at coding, with significant gains in front-end web dev, editing, and transformation.

We also fixed a bunch of function calling issues that folks have been reporting, it should now be much more reliable. More details in 🧵



GqRgjC0WgAAJJsC.jpg


2/11
@OfficialLoganK
The new model, "gemini-2.5-pro-preview-05-06" is the direct successor / replacement of the previous version (03-25), if you are using the old model, no change is needed, it should auto route to the new version with the same price and rate limits.

Gemini 2.5 Pro Preview: even better coding performance- Google Developers Blog



3/11
@OfficialLoganK
And don't just take our word for it:

“The updated Gemini 2.5 Pro achieves leading performance on our junior-dev evals. It was the first-ever model that solved one of our evals involving a larger refactor of a request routing backend. It felt like a more senior developer because it was able to make correct judgement calls and choose good abstractions.”

– Silas Alberti, Founding Team, Cognition



4/11
@OfficialLoganK
Developers really like 2.5 Pro:

“We found Gemini 2.5 Pro to be the best frontier model when it comes to "capability over latency" ratio. I look forward to rolling it out on Replit Agent whenever a latency-sensitive task needs to be accomplished with a high degree of reliability.”

– Michele Catasta, President, Replit



5/11
@OfficialLoganK
Super excited to see how everyone uses the new 2.5 Pro model, and I hope you all enjoy a little pre-IO launch : )

The team has been super excited to get this into the hands of everyone so we decided not to wait until IO.



6/11
@JonathanRoseD
Does gemini-2.5-pro-preview-05-06 improve any other aspects other than coding?



7/11
@OfficialLoganK
Mostly coding !



8/11
@devgovz
Ok, what about 2.0 Flash with image generation? When will the experimental period end?



9/11
@OfficialLoganK
Soon!



10/11
@frantzenrichard
Great! How about that image generation?



11/11
@OfficialLoganK
: )




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/11
@demishassabis
Very excited to share the best coding model we’ve ever built! Today we’re launching Gemini 2.5 Pro Preview 'I/O edition' with massively improved coding capabilities. Ranks no.1 on LMArena in Coding and no.1 on the WebDev Arena Leaderboard.

It’s especially good at building interactive web apps - this demo shows how it can be helpful for prototyping ideas. Try it in @GeminiApp, Vertex AI, and AI Studio Google AI Studio

Enjoy the pre-I/O goodies !



https://video.twimg.com/amplify_video/1919778857193816064/vid/avc1/1920x1080/FtMuHzKJiZuaP5Uy.mp4

2/11
@demishassabis
It’s been amazing to see the response to Gemini 2.5 series so far - and we're continuing to rev in response to feedback, so keep it coming !

https://blog.google/products/gemini/gemini-2-5-pro-updates



3/11
@demishassabis
just a casual +147 elo rating improvement... no big deal 😀



GqRyhq_WAAAsZxS.jpg


4/11
@johnseach
Gemini is now the best coding LLM by far. It is excelling at astrophysics code where all other fail. Google is now the AI coding gold standard.



5/11
@WesRothMoney
love it!

I built a full city traffic simulator in under 20 minutes.

here's the timelapse from v1.0 to (almost) done.



https://video.twimg.com/amplify_video/1919886890997841920/vid/avc1/1280x720/neHj9PPTfPxeaU3U.mp4

6/11
@botanium
This is mind blowing 🤯



7/11
@_philschmid
Lets go 🚀



8/11
@A_MacLullich
Excited to try this - will be interesting to compare with others? Any special use cases?



9/11
@ApollonVisual
congrats on the update. I feel that coding focused LLMs will accelerate progress expotentially



10/11
@JacobColling
Excited to try this in Cursor!



11/11
@SebastianKits
Loving the single-shot quality, but would love to see more work towards half-autonomous agentic usage. E.g when giving a task to plan and execute a larger MVP, 2.5 pro (and all other models) often do things in a bad order that leads to badly defined styleguides, not very cohesive view specs etc. This is not a problem of 2.5 pro, all models of various providers do this without excessive guidance.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,451
Reputation
9,692
Daps
173,295


1/4
@LechMazur
Gemini 2.5 Pro Preview (05-06) scores 42.5, compared to 54.1 for Gemini 2.5 Pro Exp (03-25) on the Extended NYT Connections Benchmark.

More info: GitHub - lechmazur/nyt-connections: Benchmark that evaluates LLMs using 651 NYT Connections puzzles extended with extra trick words



GqWauNMXoAAuKh8.jpg


2/4
@LechMazur
Mistral Medium 3 scores 12.9.



GqXRed-XgAEHCbJ.jpg


3/4
@akatzzzzz
code sloptimized



4/4
@ficlive
Big fan of your benchmarks, can you test 03-25 Preview as well as that's where the big decline was for us.

[Quoted tweet]
Gemini 2.5 Pro Preview gives good results, but can't quite match the original experimental version.


GqSKqyaWUAAd-x6.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/3
@HCSolakoglu
Reviewing recent benchmark data for gemini-2.5-pro. Comparing the 05-07 to the 03-25, we see a roughly 4.2% lower Elo score on EQ-Bench 3 and about a 4.9% lower score on the Longform Creative Writing benchmark. Interesting shifts.



GqXYy2PW0AArINn.jpg

GqXYzGkXkAA_iiI.jpg


2/3
@HCSolakoglu
Tests & images via: @sam_paech



3/3
@MahawarYas27492
@OfficialLoganK @joshwoodward @DynamicWebPaige




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196







1/7
@ChaseBrowe32432
Ran a few times to verify, seeing degraded performance on my visual physics reasoning benchmark for the new Gemini 2.5 Pro



GqSGaMWXUAAIjCU.png


2/7
@ChaseBrowe32432
cbrower



3/7
@random_wander_
nice benchmark! Gwen and Grok would be interesting.



4/7
@ChaseBrowe32432
Grok still has no API vision, I haven’t got to running Qwen bc I don’t know how to deal with providers being wishy wsshy about precision



5/7
@figuret20
Most benchmarks this new version is worse. Check the official benchmark results for this new one vs the old one. This is a downgrade on everything but webdev arena.



6/7
@ChaseBrowe32432
Where do you see official benchmark results? I thought they'd come with the new model card but I can still only see the old model card



7/7
@akatzzzzz
Worst timeline ever is overfitting to code slop and calling it AGI




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/2
@r0ck3t23
Performance Analysis: Gemini 2.5 Pro Preview vs Previous Version

Fascinating benchmark comparison here! The data reveals some interesting trends:

The Preview build (05-06) of Gemini 2.5 Pro shows notable improvements in coding metrics (+5.2% on generationLiveCodeBench, +2.5% on editingAider Polyglot) compared to the earlier Experimental build (03-25).

However, there are modest performance decreases across most other domains:
- Math: -3.7% on AIME 2025
- Image understanding: -3.8% on Vibe-Eval
- Science: -1.0% on GPQA diamond
- Visual reasoning: -2.1% on reasoningMMU

This raises interesting questions about optimization trade-offs. While it excels at code-related tasks, has this focus come at the expense of other capabilities? Or is this part of a broader optimization strategy that will eventually see improvements across all domains?



GqVnxClWoAEJ7Mo.jpg


2/2
@Lowkeytyc00n1
That's an Improvement




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,451
Reputation
9,692
Daps
173,295

Google Releases 76-Page Whitepaper on AI Agents: A Deep Technical Dive into Agentic RAG, Evaluation Frameworks, and Real-World Architectures​


By Sana Hassan

May 6, 2025

Google has published the second installment in its Agents Companion series—an in-depth 76-page whitepaper aimed at professionals developing advanced AI agent systems. Building on foundational concepts from the first release, this new edition focuses on operationalizing agents at scale, with specific emphasis on agent evaluation, multi-agent collaboration, and the evolution of Retrieval-Augmented Generation ( RAG ) into more adaptive, intelligent pipelines.

Agentic RAG: From Static Retrieval to Iterative Reasoning


At the center of this release is the evolution of RAG architectures. Traditional RAG pipelines typically involve static queries to vector stores followed by synthesis via large language models. However, this linear approach often fails in multi-perspective or multi-hop information retrieval.

Agentic RAG reframes the process by introducing autonomous retrieval agents that reason iteratively and adjust their behavior based on intermediate results. These agents improve retrieval precision and adaptability through:

  • Context-Aware Query Expansion : Agents reformulate search queries dynamically based on evolving task context.
  • Multi-Step Decomposition : Complex queries are broken into logical subtasks, each addressed in sequence.
  • Adaptive Source Selection : Instead of querying a fixed vector store, agents select optimal sources contextually.
  • Fact Verification : Dedicated evaluator agents validate retrieved content for consistency and grounding before synthesis.

The net result is a more intelligent RAG pipeline, capable of responding to nuanced information needs in high-stakes domains such as healthcare, legal compliance, and financial intelligence.

Rigorous Evaluation of Agent Behavior


Evaluating the performance of AI agents requires a distinct methodology from that used for static LLM outputs. Google’s framework separates agent evaluation into three primary dimensions:

  1. Capability Assessment : Benchmarking the agent’s ability to follow instructions, plan, reason, and use tools. Tools like AgentBench, PlanBench, and BFCL are highlighted for this purpose.
  2. Trajectory and Tool Use Analysis : Instead of focusing solely on outcomes, developers are encouraged to trace the agent’s action sequence (trajectory) and compare it to expected behavior using precision, recall, and match-based metrics.
  3. Final Response Evaluation : Evaluation of the agent’s output through autoraters—LLMs acting as evaluators—and human-in-the-loop methods. This ensures that assessments include both objective metrics and human-judged qualities like helpfulness and tone.

This process enables observability across both the reasoning and execution layers of agents, which is critical for production deployments.

Scaling to Multi-Agent Architectures


As real-world systems grow in complexity, Google’s whitepaper emphasizes a shift toward multi-agent architectures , where specialized agents collaborate, communicate, and self-correct.

Key benefits include:

  • Modular Reasoning : Tasks are decomposed across planner, retriever, executor, and validator agents.
  • Fault Tolerance : Redundant checks and peer hand-offs increase system reliability.
  • Improved Scalability : Specialized agents can be independently scaled or replaced.

Evaluation strategies adapt accordingly. Developers must track not only final task success but also coordination quality, adherence to delegated plans, and agent utilization efficiency. Trajectory analysis remains the primary lens, extended across multiple agents for system-level evaluation.

Real-World Applications: From Enterprise Automation to Automotive AI


The second half of the whitepaper focuses on real-world implementation patterns:

AgentSpace and NotebookLM Enterprise


Google’s AgentSpace is introduced as an enterprise-grade orchestration and governance platform for agent systems. It supports agent creation, deployment, and monitoring, incorporating Google Cloud’s security and IAM primitives. NotebookLM Enterprise, a research assistant framework, enables contextual summarization, multimodal interaction, and audio-based information synthesis.

Automotive AI Case Study


A highlight of the paper is a fully implemented multi-agent system within a connected vehicle context. Here, agents are designed for specialized tasks—navigation, messaging, media control, and user support—organized using design patterns such as:

  • Hierarchical Orchestration : Central agent routes tasks to domain experts.
  • Diamond Pattern : Responses are refined post-hoc by moderation agents.
  • Peer-to-Peer Handoff : Agents detect misclassification and reroute queries autonomously.
  • Collaborative Synthesis : Responses are merged across agents via a Response Mixer.
  • Adaptive Looping : Agents iteratively refine results until satisfactory outputs are achieved.

This modular design allows automotive systems to balance low-latency, on-device tasks (e.g., climate control) with more resource-intensive, cloud-based reasoning (e.g., restaurant recommendations).




Check out the Full Guide here .


 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,451
Reputation
9,692
Daps
173,295

Google Launches Gemini 2.5 Pro I/O: Outperforms GPT-4 in Coding, Supports Native Video Understanding and Leads WebDev Arena​


By Asif Razzaq

May 7, 2025

Just ahead of its annual I/O developer conference , Google has released an early preview of Gemini 2.5 Pro (I/O Edition) —a substantial update to its flagship AI model focused on software development and multimodal reasoning and understanding. This latest version delivers marked improvements in coding accuracy, web application generation, and video-based understanding, placing it at the forefront of large model evaluation leaderboards.

With top rankings in LM Arena’s WebDev and Coding categories, Gemini 2.5 Pro I/O emerges as a serious contender in applied AI programming assistance and multimodal intelligence.

Leading in Web App Development: Top of WebDev Arena


The I/O Edition distinguishes itself in frontend software development, achieving the top spot on the WebDev Arena leaderboard —a benchmark based on human evaluation of generated web applications. Compared to its predecessor, the model improves by +147 Elo points, underscoring meaningful progress in quality and consistency.

Key capabilities include:

  • End-to-End Frontend Generation
    Gemini 2.5 Pro I/O generates complete browser-ready applications from a single prompt. Outputs include well-structured HTML, responsive CSS, and functional JavaScript—reducing the need for iterative prompts or post-processing.
  • High-Fidelity UI Generation
    The model interprets structured UI prompts with precision, producing readable and modular code components that are suitable for direct deployment or integration into existing codebases.
  • Consistency Across Modalities
    Outputs remain consistent across various frontend tasks, enabling developers to use the model for layout prototyping, styling, and even component-level rendering.

This makes Gemini particularly valuable in streamlining frontend workflows, from mockup to functional prototype.

General Coding Performance: Outpacing GPT-4 and Claude 3.7


Beyond web development, Gemini 2.5 Pro I/O shows strong general-purpose coding capabilities. It now ranks first in LM Arena’s coding benchmark, ahead of competitors such as GPT-4 and Claude 3.7 Sonnet.

Notable enhancements include:

  • Multi-Step Programming Support
    The model can perform chained tasks such as code refactoring, optimization, and cross-language translation with increased accuracy.
  • Improved Tool Use
    Google reports a reduction in tool-calling errors during internal testing—an important milestone for real-time development scenarios where tool invocation is tightly coupled with model output.
  • Structured Instructions via Vertex AI
    In enterprise environments, the model supports structured system instructions, giving teams greater control over execution flow, especially in multi-agent or workflow-based systems.

Together, these improvements make the I/O Edition a more reliable assistant for tasks that go beyond single-function completions—supporting real-world software development practices.

Native Video Understanding and Multimodal Contexts


In a notable leap toward generalist AI, Gemini 2.5 Pro I/O introduces built-in support for video understanding. The model scores 84.8% on the VideoMME benchmark , indicating robust performance in spatial-temporal reasoning tasks.

Key features include:

  • Direct Video-to-Structure Understanding
    Developers can feed video inputs into AI Studio and receive structured outputs—eliminating the need for manual intermediate steps or model switching.
  • Unified Multimodal Context Window
    The model accepts extended, multimodal sequences—text, image, and video—within a single context. This simplifies the development of cross-modal workflows where continuity and memory retention are essential.
  • Application Readiness
    Video understanding is integrated into AI Studio today, with extended capabilities available through Vertex AI, making the model immediately usable for enterprise-facing tools.

This makes Gemini suitable for a range of new use cases, from video content summarization and instructional QA to dynamic UI adaptation based on video feeds.

Deployment and Integration


Gemini 2.5 Pro I/O is now available across key Google platforms:

  • Google AI Studio : For interactive experimentation and rapid prototyping
  • Vertex AI : For enterprise-grade deployment with support for system-level configuration and tool use
  • Gemini App : For general access via natural language interfaces

While the model does not yet support fine-tuning, it accepts prompt-based customization and structured input/output, making it adaptable for task-specific pipelines without retraining.

Conclusion


Gemini 2.5 Pro I/O marks a significant step forward in making large language models practically useful for developers and enterprises alike. Its leadership on both WebDev and coding leaderboards, combined with native support for multimodal input, illustrates Google’s growing emphasis on real-world applicability.

Rather than focusing solely on raw language modeling benchmarks, this release prioritizes functional quality—offering developers structured, accurate, and context-aware outputs across a diverse range of tasks. With Gemini 2.5 Pro I/O, Google continues to shape the future of developer-centric AI systems.




Check out the Technical details and Try it here .
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,451
Reputation
9,692
Daps
173,295

LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model​


By Asif Razzaq

May 6, 2025

Researchers at the Institute of Computing Technology, Chinese Academy of Sciences, have introduced LLaMA-Omni2 , a family of speech-capable large language models (SpeechLMs) now available on Hugging Face . This research introduces a modular framework that enables real-time spoken dialogue by integrating speech perception and synthesis with language understanding. Unlike earlier cascaded systems, LLaMA-Omni2 operates in an end-to-end pipeline while retaining modular interpretability and low training cost.

Overview of the LLaMA-Omni2 Architecture


LLaMA-Omni2 encompasses models ranging from 0.5B to 14B parameters, each built atop the Qwen2.5-Instruct series. The architecture consists of:

  • Speech Encoder : Utilizes Whisper-large-v3 to transform input speech into token-level acoustic representations.
  • Speech Adapter : Processes encoder outputs using a downsampling layer and a feed-forward network to align with the language model’s input space.
  • Core LLM : The Qwen2.5 models serve as the main reasoning engine.
  • Streaming TTS Decoder : Converts LLM outputs into speech tokens using an autoregressive Transformer and then generates mel spectrograms through a causal flow matching model inspired by CosyVoice2.

A gating mechanism fuses LLM hidden states with textual embeddings before speech synthesis, enhancing contextual fidelity in the generated audio.

Screenshot-2025-05-06-at-4.10.36%E2%80%AFPM-1-1024x636.png


Streaming Generation with Read-Write Scheduling


The model adopts a read-write strategy to facilitate streaming output. Specifically, for every R tokens produced by the LLM, W speech tokens are generated. This enables synchronized textual and acoustic generation, minimizing latency without compromising fluency.

Empirical findings suggest that setting R = 3 and W = 10 provides a favorable trade-off between latency (~583 ms), alignment (ASR-WER: 3.26), and perceptual quality (UTMOS: 4.19).

Training Approach


Despite achieving competitive performance, LLaMA-Omni2 is trained on a relatively compact corpus—200K multi-turn speech-to-speech dialogue samples. These samples are synthesized from instruction-following text datasets (Alpaca, UltraChat), with diverse input voices and a consistent output voice generated using FishSpeech and CosyVoice2 models.

Training is executed in two stages:

  • Stage I : Independently optimizes the speech-to-text and text-to-speech modules.
  • Stage II : Fine-tunes the speech-to-speech generation path, including the gating and autoregressive decoding components.

Benchmark Results


The models are evaluated on spoken question answering and speech instruction following tasks using both speech-to-text (S2T) and speech-to-speech (S2S) modes.

Model Llama Q (S2S) Web Q (S2S) GPT-4o Score ASR-WER Latency (ms) GLM-4-Voice (9B) 50.7 15.9 4.09 3.48 1562.8 LLaMA-Omni (8B) 49.0 23.7 3.52 3.67 346.7 LLaMA-Omni2-7B 60.7 31.3 4.15 3.26 582.9

The performance scales consistently with model size. Notably, LLaMA-Omni2-14B outperforms all baselines across tasks, even with substantially less training data than native SpeechLMs such as GLM-4-Voice.

Component Analyses


  • Gate Fusion Module : Removing the gating mechanism increases ASR-WER and reduces speech quality, confirming its role in aligning textual and contextual signals.
  • TTS Pretraining : Initializing the TTS model from Qwen2.5 and fine-tuning in a streaming setup yields the best performance. Training from scratch fails to converge effectively.
  • Read/Write Strategies : Adjusting the R:W ratio impacts latency and quality. Larger W improves UTMOS but at the cost of response delay.

Additionally, the study demonstrates that multi-turn dialogue data is more effective than single-turn data in training speech interaction capabilities, and that performance plateaus around 200K samples.

Conclusion


LLaMA-Omni2 demonstrates that high-quality, low-latency spoken interaction with LLMs is feasible without the need for extensive pretraining on massive speech corpora. By combining modular architecture with autoregressive streaming synthesis, the system offers a practical pathway for real-time speech applications.




Check out the Paper , Model on Hugging Face and GitHub Page .


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,451
Reputation
9,692
Daps
173,295

OpenAI Releases a Strategic Guide for Enterprise AI Adoption: Practical Lessons from the Field​


By Asif Razzaq

May 5, 2025

OpenAI has published a comprehensive 24-page document titled AI in the Enterpris e , offering a pragmatic framework for organizations navigating the complexities of large-scale AI deployment. Rather than focusing on abstract theories, the report presents seven implementation strategies based on field-tested insights from collaborations with leading companies including Morgan Stanley, Klarna, Lowe’s, and Mercado Libre.

The document reads less like promotional material and more like an operational guidebook—emphasizing systematic evaluation, infrastructure readiness, and domain-specific integration.

1. Establish a Rigorous Evaluation Process


The first recommendation is to initiate AI adoption through well-defined evaluations (“evals”) that benchmark model performance against targeted use cases. Morgan Stanley applied this approach by assessing language translation, summarization, and knowledge retrieval in financial advisory contexts. The outcome was measurable: improved document access, reduced search latency, and broader AI adoption among advisors.

Evals not only validate models for deployment but also help refine workflows with empirical feedback loops, enhancing both safety and model alignment.

2. Integrate AI at the Product Layer


Rather than treating AI as an auxiliary function, the report stresses embedding it directly into user-facing experiences. For instance, Indeed utilized GPT-4o mini to personalize job matching, supplementing recommendations with contextual “why” statements. This increased user engagement and hiring success rates while maintaining cost-efficiency through fine-tuned, token-optimized models.

The key takeaway: model performance alone is insufficient—impact scales when AI is embedded into product logic and tailored to domain-specific needs.

3. Invest Early to Capture Compounding Returns


Klarna’s early investment in AI yielded substantial gains in operational efficiency. A GPT-powered assistant now handles two-thirds of support chats, reducing resolution times from 11 minutes to 2. The company also reports that 90% of employees are using AI in their workflows, a level of adoption that enables rapid iteration and organizational learning.

This illustrates how early engagement not only improves tooling but accelerates institutional adaptation and compound value capture.

4. Leverage Fine-Tuning for Contextual Precision


Generic models can deliver strong baselines, but domain adaptation often requires customization. Lowe’s achieved notable improvements in product search relevance by fine-tuning GPT models on their internal product data. The result: a 20% increase in tagging accuracy and a 60% improvement in error detection.

OpenAI highlights this approach as a low-latency pathway to achieve brand consistency, domain fluency, and efficiency across content generation and search tasks.

5. Empower Internal Experts, Not Just Technologists


BBVA exemplifies a decentralized AI adoption model by enabling non-technical employees to build custom GPT-based tools. In just five months, over 2,900 internal GPTs were created, addressing legal, compliance, and customer service needs without requiring engineering support.

This bottom-up strategy empowers subject-matter experts to iterate directly on their workflows, yielding more relevant solutions and reducing development cycles.

6. Streamline Developer Workflows with Dedicated Platforms


Engineering bandwidth remains a bottleneck in many organizations. Mercado Libre addressed this by building Verdi , a platform powered by GPT-4o mini, enabling 17,000 developers to prototype and deploy AI applications using natural language interfaces. The system integrates guardrails, APIs, and reusable components—allowing faster, standardized development.

The platform now supports high-value functions such as fraud detection, multilingual translation, and automated content tagging, demonstrating how internal infrastructure can accelerate AI velocity.

7. Automate Deliberately and Systematically


OpenAI emphasizes setting clear automation targets. Internally, they developed an automation platform that integrates with tools like Gmail to draft support responses and trigger actions. This system now handles hundreds of thousands of tasks monthly, reducing manual workload and enhancing responsiveness.

Their broader vision includes Operator , a browser-agent capable of autonomously interacting with web-based interfaces to complete multi-step processes—signaling a move toward agent-based, API-free automation.

Final Observations


The report concludes with a central theme: effective AI adoption requires iterative deployment, cross-functional alignment, and a willingness to refine strategies through experimentation. While the examples are enterprise-scale, the core principles—starting with evals, integrating deeply, and customizing with context—are broadly applicable.

Security and data governance are also addressed explicitly. OpenAI reiterates that enterprise data is not used for training, offers SOC 2 and CSA STAR compliance, and provides granular access control for regulated environments.

In an increasingly AI-driven landscape, OpenAI’s guide serves as both a mirror and a map—reflecting current best practices and helping enterprises chart a more structured, sustainable path forward.




Check out the Full Guide here .


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,451
Reputation
9,692
Daps
173,295

Is Automated Hallucination Detection in LLMs Feasible? A Theoretical and Empirical Investigation​


By Sana Hassan

May 6, 2025

Recent advancements in LLMs have significantly improved natural language understanding, reasoning, and generation. These models now excel at diverse tasks like mathematical problem-solving and generating contextually appropriate text. However, a persistent challenge remains: LLMs often generate hallucinations—fluent but factually incorrect responses. These hallucinations undermine the reliability of LLMs, especially in high-stakes domains, prompting an urgent need for effective detection mechanisms. While using LLMs to detect hallucinations seems promising, empirical evidence suggests they fall short compared to human judgment and typically require external, annotated feedback to perform better. This raises a fundamental question: Is the task of automated hallucination detection intrinsically difficult, or could it become more feasible as models improve?

Theoretical and empirical studies have sought to answer this. Building on classic learning theory frameworks like Gold-Angluin and recent adaptations to language generation, researchers have analyzed whether reliable and representative generation is achievable under various constraints. Some studies highlight the intrinsic complexity of hallucination detection, linking it to limitations in model architectures, such as transformers’ struggles with function composition at scale. On the empirical side, methods like SelfCheckGPT assess response consistency, while others leverage internal model states and supervised learning to flag hallucinated content. Although supervised approaches using labeled data significantly improve detection, current LLM-based detectors still struggle without robust external guidance. These findings suggest that while progress is being made, fully automated hallucination detection may face inherent theoretical and practical barriers.

Researchers at Yale University present a theoretical framework to assess whether hallucinations in LLM outputs can be detected automatically. Drawing from the Gold-Angluin model for language identification, they show that hallucination detection is equivalent to identifying whether an LLM’s outputs belong to a correct language K. Their key finding is that detection is fundamentally impossible when training uses only correct (positive) examples. However, when negative examples—explicitly labeled hallucinations—are included, detection becomes feasible. This underscores the necessity of expert-labeled feedback and supports methods like reinforcement learning with human feedback for improving LLM reliability.

The approach begins by showing that any algorithm capable of identifying a language in the limit can be transformed into one that detects hallucinations in the limit. This involves using a language identification algorithm to compare the LLM’s outputs against a known language over time. If discrepancies arise, hallucinations are detected. Conversely, the second part proves that language identification is no harder than hallucination detection. Combining a consistency-checking method with a hallucination detector, the algorithm identifies the correct language by ruling out inconsistent or hallucinating candidates, ultimately selecting the smallest consistent and non-hallucinating language.

The study defines a formal model where a learner interacts with an adversary to detect hallucinations—statements outside a target language—based on sequential examples. Each target language is a subset of a countable domain, and the learner observes elements over time while querying a candidate set for membership. The main result shows that detecting hallucinations within the limit is as hard as identifying the correct language, which aligns with Angluin’s characterization. However, if the learner also receives labeled examples indicating whether items belong to the language, hallucination detection becomes universally achievable for any countable collection of languages.

Screenshot-2025-05-06-at-9.04.20%E2%80%AFPM-1-1024x614.png


In conclusion, the study presents a theoretical framework to analyze the feasibility of automated hallucination detection in LLMs. The researchers prove that detecting hallucinations is equivalent to the classic language identification problem, which is typically infeasible when using only correct examples. However, they show that incorporating labeled incorrect (negative) examples makes hallucination detection possible across all countable languages. This highlights the importance of expert feedback, such as RLHF, in improving LLM reliability. Future directions include quantifying the amount of negative data required, handling noisy labels, and exploring relaxed detection goals based on hallucination density thresholds.






 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,451
Reputation
9,692
Daps
173,295

This AI Paper Introduce WebThinker: A Deep Research Agent that Empowers Large Reasoning Models (LRMs) for Autonomous Search and Report Generation​


By Sajjad Ansari

May 6, 2025

Large reasoning models (LRMs) have shown impressive capabilities in mathematics, coding, and scientific reasoning. However, they face significant limitations when addressing complex information research needs when relying solely on internal knowledge. These models struggle with conducting thorough web information retrieval and generating accurate scientific reports through multi-step reasoning processes. So, the deep integration of LRM’s reasoning capabilities with web information exploration is a practical demand, initiating a series of deep research initiatives. However, existing open-source deep search agents use RAG techniques with rigid, predefined workflows, restricting LRMs’ ability to explore deeper web information and hindering effective interaction between LRMs and search engines.

LRMs like OpenAI-o1, Qwen-QwQ, and DeepSeek-R1 enhance performance through extended reasoning capabilities. Various strategies have been proposed to achieve advanced reasoning capabilities, including intentional errors in reasoning during training, distilled training data, and reinforcement learning approaches to develop long chain-of-thought abilities. However, these methods are fundamentally limited by their static, parameterized architectures that lack access to external world knowledge. RAG integrates retrieval mechanisms with generative models, enabling access to external knowledge. Recent advances span multiple dimensions, including retrieval necessity, query reformulation, document compression, denoising, and instruction-following.

Researchers from Renmin University of China, BAAI, and Huawei Poisson Lab have proposed a deep research agent called WebThinker that empowers LRMs to autonomously search the web, navigate web pages, and draft research reports during the reasoning process. WebThinker introduces a Deep Web Explorer module that enables LRMs to dynamically search, navigate, and extract information from the web when they encounter knowledge gaps. It employs an Autonomous Think-Search-and-Draft strategy, allowing models to combine reasoning, information gathering, and report writing in real time smoothly. Moreover, an RL-based training strategy is implemented to enhance research tool utilization through iterative online Direct Preference Optimization.

AD_4nXe_gxTbRrHlrygJRh8THCFkqkADfmwkOOq1szw9_n9RIMc2KIrSWXQ3mWaM3eDpnAD1YyMPycdOnr3nkHbO0pzjBUiCg6jxGF9oz7IzX3hwPEd1ytQQkV4ZlyLs79iCnjEzXDjNHQ


WebThinker framework operates in two primary modes: Problem-Solving Mode and Report Generation Mode. In Problem-Solving Mode, WebThinker addresses complex tasks using the Deep Web Explorer tool, which the LRM can invoke during reasoning. In Report Generation Mode, the LRM autonomously produces detailed reports and employs an assistant LLM to implement report-writing tools. To improve LRMs with research tools via RL, WebThinker generates diverse reasoning trajectories by applying its framework to an extensive set of complex reasoning and report generation datasets, including SuperGPQA, WebWalkerQA, OpenThoughts, NaturalReasoning, NuminaMath, and Glaive. For each query, the initial LRM produces multiple distinct trajectories.

The WebThinker-32B-Base model outperforms prior methods like Search-o1 across all benchmarks on complex problem-solving, with 22.9% improvement on WebWalkerQA and 20.4% on HLE. WebThinker achieves the highest overall score of 8.0, surpassing RAG baselines and advanced deep research systems in scientific report generation tasks, including Gemini-Deep Research (7.9). The adaptability across different LRM backbones is remarkable, with R1-based WebThinker models outperforming direct reasoning and standard RAG baselines. With the DeepSeek-R1-7B backbone, it achieves relative improvements of 174.4% on GAIA and 422.6% on WebWalkerQA compared to direct generation, and 82.9% on GAIA and 161.3% on WebWalkerQA over standard RAG implementations.

In conclusion, researchers introduced WebThinker, which provides LRMs with deep research capabilities, addressing their limitations in knowledge-intensive real-world tasks such as complex reasoning and scientific report generation. The framework enables LRMs to autonomously explore the web and produce comprehensive outputs through continuous reasoning processes. The findings highlight WebThinker’s potential to advance the deep research capabilities of LRMs, creating more powerful intelligent systems capable of addressing complex real-world challenges. Future work includes incorporating multimodal reasoning capabilities, exploring advanced tool learning mechanisms, and investigating GUI-based web exploration.




Check out the Paper .


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,451
Reputation
9,692
Daps
173,295

A Coding Guide to Compare Three Stability AI Diffusion Models (v1.5, v2-Base & SD3-Medium) Diffusion Capabilities Side-by-Side in Google Colab Using Gradio​


By Nikhil

May 5, 2025

In this hands-on tutorial, we’ll unlock the creative potential of Stability AI ’s industry-leading diffusion models, Stable Diffusion v1.5, Stability AI’s v2-base, and the cutting-edge Stable Diffusion 3 Medium , to generate eye-catching imagery. Running entirely in Google Colab with a Gradio interface, we’ll experience side-by-side comparisons of three powerful pipelines, rapid prompt iteration, and seamless GPU-accelerated inference. Whether we’re a marketer looking to elevate our brand’s visual narrative or a developer eager to prototype AI-driven content workflows, this tutorial showcases how Stability AI’s open-source models can be deployed instantly and at no infrastructure cost, allowing you to focus on storytelling, engagement, and driving real-world results.

We install the huggingface_hub library and then import and invoke the notebook_login() function, which prompts you to authenticate your notebook session with your Hugging Face account, allowing you to seamlessly access and manage models, datasets, and other hub resources.

We first force-uninstalls any existing torchvision to clear potential conflicts, then reinstalls torch and torchvision from the CUDA 11.8–compatible PyTorch wheels, and finally upgrades key libraries, diffusers, transformers, accelerate, safetensors, gradio, and pillow, to ensure you have the latest versions for building and running GPU-accelerated generative pipelines and web demos.

We import PyTorch alongside both the Stable Diffusion v1 and v3 pipelines from the Diffusers library, as well as Gradio for building interactive demos. It then checks for CUDA availability and sets the device variable to “cuda” if a GPU is present; otherwise, it falls back to “cpu”, ensuring your models run on the optimal hardware.

We load the Stable Diffusion v1.5 model in half-precision (float16) without the built-in safety checker, transfers it to your selected device (GPU, if available), and then enables attention slicing to reduce peak VRAM usage during image generation.

We load the Stable Diffusion v2 “base” model in 16-bit precision without the default safety filter, transfers it to your chosen device, and activates attention slicing to optimize memory usage during inference.

We pull in Stability AI’s Stable Diffusion 3 “medium” checkpoint in 16-bit precision (skipping the built-in safety checker), transfers it to your selected device, and enables attention slicing to reduce GPU memory usage during generation.

Now, this function runs the same text prompt through all three loaded pipelines (pipe1, pipe2, pipe3) using the specified inference steps and guidance scale, then returns the first image from each, making it perfect for comparing outputs across Stable Diffusion v1.5, v2-base, and v3-medium.

Finally, this Gradio app builds a three-column UI where you can enter a text prompt, adjust inference steps and guidance scale, then generate and display images from SD v1.5, v2-base, and v3-medium side by side. It also features a radio selector, allowing you to select your preferred model output, and displays a simple confirmation message when a choice is made.

AD_4nXdkAqovJnQSlZTvg5mOOg30u_taD3a5fjijyj3rN1ZF0UnniuU9aSK3AXU9NtZbqZ1O-HIqElnX7LSbOGxrLKYeLmNvzzwX8muZhwbyd0gTD9VzFoBbBxelSf0VblICdkZ0C3a6OA
A web interface to compare the three Stability AI models’ output

In conclusion, by integrating Stability AI’s state-of-the-art diffusion architectures into an easy-to-use Gradio app, you’ve seen how effortlessly you can prototype, compare, and deploy stunning visuals that resonate on today’s platforms. From A/B-testing creative directions to automating campaign assets at scale, Stability AI provides the performance, flexibility, and vibrant community support to transform your content pipeline.




Check out the Colab Notebook .


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,451
Reputation
9,692
Daps
173,295

NVIDIA Open Sources Parakeet TDT 0.6B: Achieving a New Standard for Automatic Speech Recognition ASR and Transcribes an Hour of Audio in One Second​


By Asif Razzaq

May 5, 2025

NVIDIA has unveiled Parakeet TDT 0.6B , a state-of-the-art automatic speech recognition (ASR) model that is now fully open-sourced on Hugging Face . With 600 million parameters , a commercially permissive CC-BY-4.0 license , and a staggering real-time factor (RTF) of 3386 , this model sets a new benchmark for performance and accessibility in speech AI.

Blazing Speed and Accuracy


At the heart of Parakeet TDT 0.6B’s appeal is its unmatched speed and transcription quality . The model can transcribe 60 minutes of audio in just one second , a performance that’s over 50x faster than many existing open ASR models. On Hugging Face’s Open ASR Leaderboard , Parakeet V2 achieves a 6.05% word error rate (WER) —the best-in-class among open models.

This performance represents a significant leap forward for enterprise-grade speech applications, including real-time transcription, voice-based analytics, call center intelligence, and audio content indexing.

Technical Overview


Parakeet TDT 0.6B builds on a transformer-based architecture fine-tuned with high-quality transcription data and optimized for inference on NVIDIA hardware. Here are the key highlights:

  • 600M parameter encoder-decoder model
  • Quantized and fused kernels for maximum inference efficiency
  • Optimized for TDT (Transducer Decoder Transformer) architecture
  • Supports accurate timestamp formatting , numerical formatting , and punctuation restoration
  • Pioneers song-to-lyrics transcription , a rare capability in ASR models

The model’s high-speed inference is powered by NVIDIA’s TensorRT and FP8 quantization , enabling it to reach a real-time factor of RTF = 3386 , meaning it processes audio 3386 times faster than real-time .

Benchmark Leadership


On the Hugging Face Open ASR Leaderboard —a standardized benchmark for evaluating speech models across public datasets—Parakeet TDT 0.6B leads with the lowest WER recorded among open-source models . This positions it well above comparable models like Whisper from OpenAI and other community-driven efforts.

Screenshot-2025-05-05-at-10.43.00%E2%80%AFPM-1-1024x433.png
Data based on May 5 2025

This performance makes Parakeet V2 not only a leader in quality but also in deployment readiness for latency-sensitive applications.

Beyond Conventional Transcription


Parakeet is not just about speed and word error rate. NVIDIA has embedded unique capabilities into the model:

  • Song-to-lyrics transcription : Unlocks transcription for sung content, expanding use cases into music indexing and media platforms.
  • Numerical and timestamp formatting : Improves readability and usability in structured contexts like meeting notes, legal transcripts, and health records.
  • Punctuation restoration : Enhances natural readability for downstream NLP applications.

These features elevate the quality of transcripts and reduce the burden on post-processing or human editing, especially in enterprise-grade deployments.

Strategic Implications


The release of Parakeet TDT 0.6B represents another step in NVIDIA’s strategic investment in AI infrastructure and open ecosystem leadership . With strong momentum in foundational models (e.g., Nemotron for language and BioNeMo for protein design), NVIDIA is positioning itself as a full-stack AI company—from GPUs to state-of-the-art models.

For the AI developer community, this open release could become the new foundation for building speech interfaces in everything from smart devices and virtual assistants to multimodal AI agents.

Getting Started


Parakeet TDT 0.6B is available now on Hugging Face , complete with model weights, tokenizer, and inference scripts. It runs optimally on NVIDIA GPUs with TensorRT, but support is also available for CPU environments with reduced throughput.

Whether you’re building transcription services, annotating massive audio datasets, or integrating voice into your product, Parakeet TDT 0.6B offers a compelling open-source alternative to commercial APIs.




Check out the Model on Hugging Face .


 
Top