bnew

Veteran
Joined
Nov 1, 2015
Messages
63,832
Reputation
9,783
Daps
174,065


1/10
@_akhaliq
Google presents LightLab

Controlling Light Sources in Images with Diffusion Models



https://video.twimg.com/amplify_video/1923135795163963392/vid/avc1/1280x720/RHt56hduR4WiOtG2.mp4

2/10
@_akhaliq
discuss with author: Paper page - LightLab: Controlling Light Sources in Images with Diffusion Models



3/10
@GiulioAprin
Wow



4/10
@jaimemguajardo
Wow



5/10
@JonathanKorstad
@Google Stadia is going to be lit



6/10
@jclotetdomingo
How can I use lightlab @grok



7/10
@zhaoyan9394
Interesting use of diffusion models! Shows how AI's reshaping tech tools. Excited to see how this influences future roles in the industry.



8/10
@REVOLVO_OCELOTS
Kinda similar to relight from SD



9/10
@GlaiveSong
cool



10/10
@C12s_AI
Game changer for image editing.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,832
Reputation
9,783
Daps
174,065


1/2
@HuggingPapers
Marigold was just published on Hugging Face

Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis



https://video.twimg.com/ext_tw_video/1923108204906676225/pu/vid/avc1/720x720/6cBQLHktVSb-P_d3.mp4

2/2
@HuggingPapers
Discuss with author: Paper page - Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,832
Reputation
9,783
Daps
174,065



1/10
@victormustar
🤯 It’s here: sub-10 second video generation is now real with LTX-Video-13B-distilled!

⬇️ Try it now on Hugging Face



https://video.twimg.com/amplify_video/1922926265511604224/vid/avc1/1352x1080/BY4lfJy5In-8VFlN.mp4

2/10
@victormustar
LTX Video Fast - a Hugging Face Space by Lightricks



3/10
@kingnish24
What's the prompt ??



4/10
@victormustar
something like "fpv gameplay" (image-to-video)



5/10
@Hathibel
Just tried LTX-Video-13B-distilled out. Took about 30 seconds to generate this.



https://video.twimg.com/amplify_video/1923267445260943363/vid/avc1/768x768/JDNsJk948jm-p9od.mp4

6/10
@Ren_Simmons
It’s incredible



7/10
@kasznare
Is this open source?



8/10
@bradsmithcoach
Sub-10 second video generation is a game changer!



9/10
@turbotardo
How soon? 🐋®️2️⃣



10/10
@picatrix_picori
prompt share, bro 😆




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,832
Reputation
9,783
Daps
174,065


1/2
@HuggingPapers
DeepSeek's Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures



Gq_TEwrXkAAse3L.jpg


2/2
@HuggingPapers
Paper: Paper page - Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

Highlights hardware-aware model co-design & innovations like Multi-head Latent Attention (MLA).




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,832
Reputation
9,783
Daps
174,065



1/7
@_akhaliq
AM-Thinking-v1 just dropped on Hugging Face

Advancing the Frontier of Reasoning at 32B Scale



Gq6dlmRWkAETwF8.jpg


2/7
@_akhaliq
discuss: Paper page - AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale



3/7
@_akhaliq
model: a-m-team/AM-Thinking-v1 · Hugging Face



4/7
@DanielMizr43248
This is cracked



5/7
@unclemusclez
yuge



6/7
@OmarBessa
wow



7/7
@tobeniceman
32B参数就能超过DeepSeek R1,确实牛逼




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,832
Reputation
9,783
Daps
174,065

1/11
@AIatMeta
Announcing the newest releases from Meta FAIR. We’re releasing new groundbreaking models, benchmarks, and datasets that will transform the way researchers approach molecular property prediction, language processing, and neuroscience.

1️⃣ Open Molecules 2025 (OMol25): A dataset for molecular discovery with simulations of large atomic systems.

2️⃣ Universal Model for Atoms: A machine learning interatomic potential for modeling atom interactions across a wide range of materials and molecules.

3️⃣ Adjoint Sampling: A scalable algorithm for training generative models based on scalar rewards.

4️⃣ FAIR and the Rothschild Foundation Hospital partnered on a large-scale study that reveals striking parallels between language development in humans and LLMs.

Read more ➡️ Sharing new breakthroughs and artifacts supporting molecular property prediction, language processing, and neuroscience



https://video.twimg.com/amplify_video/1922686195848613888/vid/avc1/1920x1080/lfaeSnC6ndDdpek9.mp4

2/11
@sidneywright_
@grok describe what this means in simple terms



3/11
@grok
Meta's AI team announced tools to boost science. Open Molecules 2025 is a huge dataset for predicting chemical properties, speeding up drug and material discovery. The Universal Model for Atoms uses this data to design new materials. Adjoint Sampling is a faster way to train AI for creating molecules. A language study with a hospital shows how kids learn speech, like AI learns language. These could transform research but face issues like data bias, high computing costs, and ethical concerns about brain data privacy.

(349 characters)



4/11
@inizkhan
❤️



5/11
@IterIntellectus
what?!



6/11
@amt_c42
@grok explain me the adjoint sampling part



7/11
@jnyryl
Me in 2028 with my glasses…

“Hey, Meta compile a compound that tastes like chocolate chip cookies with the nutritional value of Salad”



8/11
@JeffKirdeikis
Our knowledge of the universe and how it works is now on an exponential trajectory



9/11
@LegalPrimes
Looking forward to exploring these models especially the ones for material science



10/11
@DirtyWaterDegen
wtf timeline am I on?



11/11
@Thorsday008
This is wild!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196










1/10
@AriWagen
Meta's FAIR Chemistry (@OpenCatalyst) just released Open Molecules 2025 (OMol25), a massive high-accuracy dataset + models spanning small molecules, biomolecules, metal complexes, and electrolytes, including 83 elements + charged & open-shell species.

And it's live on @RowanSci!



https://video.twimg.com/amplify_video/1923086044624269312/vid/avc1/1410x720/bR-MA7vUkrRcuqAQ.mp4

2/10
@AriWagen
This is insanely impressive, and a huge push in the right direction—here's why I think it's so timely:

Access to high-quality data to train NNPs on has been limited. Folks have been training models on all the data they can and working hard to squeeze out little improvements.



GrArt90W0AAUimS.jpg


3/10
@AriWagen
OMol25 is a lot of data, and it's a big step towards bridging the divide in ML for chemistry between the molecular+organic realm (think SPICE) and the periodic+inorganic realm (think Materials Project).

I also love the inclusion of charge + spin.



GrAr07sWYAAade_.jpg

GrAsNKtXsAA2mL3.jpg


4/10
@AriWagen
Open sourcing this data will help researchers test ideas in NNP architectures, dataset cleaning, and model training strategies, propelling the whole field forward and making atomistic simulation more useful than ever before.

From myself: a huge congrats and thanks, OMol25 team!



5/10
@AriWagen
To read more about OMol25, check out some of these posts from the team behind the project!

From @mshuaibii:

[Quoted tweet]
Excited to share our latest releases to the FAIR Chemistry’s family of open datasets and models: OMol25 and UMA! @AIatMeta @OpenCatalyst

OMol25: huggingface.co/facebook/OMol…
UMA: huggingface.co/facebook/UMA
Blog: ai.meta.com/blog/meta-fair-s…
Demo: huggingface.co/spaces/facebo…


https://video.twimg.com/amplify_video/1922693624245985281/vid/avc1/1280x720/8wDzePyt7_kUqYo6.mp4

6/10
@AriWagen
From @SamMBlau:

[Quoted tweet]
The Open Molecules 2025 dataset is out! With >100M gold-standard ωB97M-V/def2-TZVPD calcs of biomolecules, electrolytes, metal complexes, and small molecules, OMol is by far the largest, most diverse, and highest quality molecular DFT dataset for training MLIPs ever made 1/N


Gq7QKwiWAAEcjRj.jpg


7/10
@AriWagen
From @nc_frey:

[Quoted tweet]
Introducing Open Molecules 25, a foundational quantum chemistry dataset including >100M DFT calculations across 83M unique molecules, built with 6B core hours of compute!

What does this mean for drug discovery, biology, and BioML?

1/


Gq_kPD9aAAAav1a.jpg


8/10
@AriWagen
And, of course, check out the paper (The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models) as well as the models (facebook/OMol25 · Hugging Face).

And—to quickly run simulations with them—you can use Rowan's web comp chem platform at Rowan Labs.



9/10
@Andrew_S_Rosen
This speaks highly to your software stack with how fast you're all able to implement things!



10/10
@mccrinbc
you guys ship like no other team -- massive respects




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,832
Reputation
9,783
Daps
174,065

This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency​


By Sana Hassan

May 16, 2025

The growth in developing and deploying large language models (LLMs) is closely tied to architectural innovations, large-scale datasets, and hardware improvements. Models like DeepSeek-V3, GPT-4o, Claude 3.5 Sonnet, and LLaMA-3 have demonstrated how scaling enhances reasoning and dialogue capabilities. However, as their performance increases, so do computing, memory, and communication bandwidth demands, placing substantial strain on hardware. Without parallel progress in model and infrastructure co-design, these models risk becoming accessible only to organizations with massive resources. This makes optimizing training cost, inference speed, and memory efficiency a critical area of research.

A core challenge is the mismatch between model size and hardware capabilities. LLM memory consumption grows over 1000% annually, while high-speed memory bandwidth increases by less than 50%. During inference, caching prior context in Key-Value (KV) stores adds to memory strain and slows processing. Dense models activate all parameters per token, escalating computational costs, particularly for models with hundreds of billions of parameters. This results in billions of floating-point operations per token and high energy demands. Time Per Output Token (TPOT), a key performance metric, also suffers, impacting user experience. These problems call for solutions beyond simply adding more hardware.

Techniques like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce memory usage by sharing attention weights. Windowed KV caching lowers memory use by storing only recent tokens, but can limit long-context understanding. Quantized compression with low-bit formats like 4-bit and 8-bit cuts memory further, though sometimes with trade-offs in accuracy. Precision formats such as BF16 and FP8 improve training speed and efficiency. While useful, these techniques often tackle individual issues rather than a comprehensive solution to scaling challenges.

Researchers from DeepSeek-AI introduced a more integrated and efficient strategy with the development of DeepSeek-V3, designed to scale intelligently rather than excessively. Utilizing 2,048 NVIDIA H800 GPUs, the model achieves state-of-the-art performance while focusing on cost-efficiency. Instead of depending on expansive infrastructure, the team engineered the model architecture to work harmoniously with hardware constraints. Central to this effort are innovations such as Multi-head Latent Attention (MLA) for memory optimization, a Mixture of Experts (MoE) framework for computational efficiency, and FP8 mixed-precision training to accelerate performance without sacrificing accuracy. A custom Multi-Plane Network Topology was also employed to minimize inter-device communication overhead. Collectively, these components make DeepSeek-V3 a scalable and accessible solution, capable of rivaling much larger systems while operating on significantly leaner resources.

AD_4nXd8tTmbh4Y0wnbvVzMRNw8iVX2tcrmQ2cXXYG9Z3h9RBjPiF--nCstCYy9mCtBEE7wgSBcjj1bjRswTRv1z24wnnm4FJ_60kNv63FjWTmx58kaRbXVdoqLDvtkOS2Z2nvnvge2kAA


The architecture achieves memory efficiency by reducing the KV cache requirement per token to just 70 KB using MLA, compared to 327 KB and 516 KB in Qwen-2.5 and LLaMA-3.1, respectively. This reduction is accomplished by compressing attention heads into a smaller latent vector jointly trained with the model. Computational efficiency is further boosted with the MoE model, which increases total parameters to 671 billion but only activates 37 billion per token. This contrasts sharply with dense models that require full parameter activation. For example, LLaMA-3.1 needs 2,448 GFLOPS per token, while DeepSeek-V3 operates at just 250 GFLOPS. Also, the architecture integrates a Multi-Token Prediction (MTP) module, enabling the generation of multiple tokens in a single step. The system achieves up to 1.8x improvement in generation speed, and real-world measurements show 80-90% token acceptance for speculative decoding.

AD_4nXelrdi-VUTy7oE78RBIS7ybultyFwyfWurpfwPhn5kFKaRpsLD9bN34_tq1pP_BKmXEqyPlWbj5cRzTDiRZxxWWwAFCF3eOdIriFK3iBCaDD9P0eKLxsDoYXf0sgaBv9h74h9o4


Using a system interconnected by CX7 400 Gbps InfiniBand NICs, DeepSeek-V3 achieves a theoretical TPOT of 14.76 milliseconds, equal to 67 tokens per second. With higher-bandwidth setups like NVIDIA GB200 NVL72 offering 900 GB/s, this number can be reduced to 0.82 milliseconds TPOT, potentially achieving 1,200 tokens per second. The practical throughput is lower due to compute-communication overlap and memory limitations, but the framework lays the foundation for future high-speed implementations. FP8 precision further adds to the speed gains. The training framework applies tile-wise 1×128 and block-wise 128×128 quantization, with less than 0.25% accuracy loss compared to BF16. These results were validated on smaller 16B and 230B parameter versions before integration into the 671B model.

AD_4nXeJEr-EVK1kJMfeA3cLBh1l1Dv6GF0cxuKOobiv8hgW7ZwmNL08z1dMIsLPJY0AG0do_I74eqM_hJAmE5V6OYMJvaxWfnihJ-12l3w_kj0KwfIEoH72kgR9gv6gIDUu3rpbVNBz9Q


Several key takeaways from the research on insights into DeepSeek-V3 include:

  1. MLA compression reduces KV cache size per token from 516 KB to 70 KB, significantly lowering memory demands during inference.
  2. Only 37 billion of the 671 billion total parameters are activated per token, dramatically reducing compute and memory requirements without compromising model performance.
  3. DeepSeek-V3 requires just 250 GFLOPS per token, compared to 2,448 GFLOPS for dense models like LLaMA-3.1, highlighting its computational efficiency.
  4. Achieves up to 67 tokens per second (TPS) on a 400 Gbps InfiniBand network, with the potential to scale to 1,200 TPS using advanced interconnects like NVL72.
  5. Multi-Token Prediction (MTP) improves generation speed by 1.8×, with a token acceptance rate of 80-90%, enhancing inference throughput.
  6. FP8 mixed-precision training enables faster computation with less than 0.25% accuracy degradation, validated through extensive small-scale ablations.
  7. Capable of running on a $10,000 server equipped with a consumer-grade GPU, delivering nearly 20 TPS, making high-performance LLMs more accessible.

In conclusion, the research presents a well-rounded framework for building powerful and resource-conscious large-scale language models. By directly addressing fundamental constraints, such as memory limitations, high computational costs, and inference latency, the researchers demonstrate that intelligent architecture-hardware co-design can unlock high performance without relying on vast infrastructure. DeepSeek-V3 is a clear example of how efficiency and scalability coexist, enabling broader adoption of cutting-edge AI capabilities across diverse organizations. This approach shifts the narrative from scaling through brute force to scaling through smarter engineering.




Check out the Paper . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,832
Reputation
9,783
Daps
174,065

AI Agents Now Write Code in Parallel: OpenAI Introduces Codex, a Cloud-Based Coding Agent Inside ChatGPT​


By Asif Razzaq

May 16, 2025

OpenAI has introduced Codex , a cloud-native software engineering agent integrated into ChatGPT, signaling a new era in AI-assisted software development. Unlike traditional coding assistants, Codex is not just a tool for autocompletion—it acts as a cloud-based agent capable of autonomously performing a wide range of programming tasks, from writing and debugging code to running tests and generating pull requests.

A Shift Toward Parallel, Agent-Driven Development


At the core of Codex is codex-1 , a fine-tuned version of OpenAI’s reasoning model, optimized specifically for software engineering workflows. Codex can handle multiple tasks simultaneously, operating inside isolated cloud sandboxes that are preloaded with the user’s codebase. Each request is handled in its own environment, allowing users to delegate different coding operations in parallel without disrupting their local development environment.

This architecture introduces a fundamentally new approach to software engineering—developers now interact with an agent that behaves more like a collaborative teammate than a static code tool. You can ask Codex to “fix a bug,” “add logging,” or “refactor this module,” and it will return a verifiable response, including diffs, terminal logs, and test results. If the output looks good, you can copy the patch directly into your repository—or ask for revisions.

Embedded Within ChatGPT, Accessible to Teams


Codex lives in the ChatGPT interface, currently available to Pro, Team, and Enterprise users , with broader access expected soon. The interface includes a dedicated sidebar where developers can describe what they want in natural language. Codex then interprets the intent and handles the coding behind the scenes, surfacing results for review and feedback.

This integration offers a significant boost to developer productivity. As OpenAI notes, Codex is designed to take on many of the repetitive or boilerplate-heavy aspects of coding—allowing developers to focus on architecture, design, and higher-order problem solving. In one case, an OpenAI staffer even “checked in two bug fixes written entirely by Codex,” all while working on unrelated tasks.

Codex Understands Your Codebase


What makes Codex more than just a smart code generator is its context-awareness. Each instance runs with full access to your project’s file structure, coding conventions, and style. This allows it to write code that aligns with your team’s standards—whether you’re using Flask or FastAPI, React or Vue, or a custom internal framework.

Codex’s ability to adapt to a codebase makes it particularly useful for large-scale enterprise teams and open-source maintainers. It supports workflows like branch-based pull request generation, test suite execution, and static analysis—all initiated by simple English prompts. Over time, it learns the nuances of the repository it works in, leading to better suggestions and more accurate code synthesis.

Broader Implications: Lowering the Barrier to Software Creation


OpenAI frames Codex as a research preview, but its long-term vision is clear: AI will increasingly take over much of the routine work involved in building software. The aim isn’t to replace developers but to democratize software creation , allowing more people—especially non-traditional developers—to build working applications using natural language alone.

In this light, Codex is not just a coding tool, but a stepping stone toward a world where software development is collaborative between humans and machines. It brings software creation closer to the realm of design and ideation, and further away from syntax and implementation details.

What’s Next?


Codex is rolling out gradually, with usage limits in place during the preview phase. OpenAI is gathering feedback to refine the agent’s capabilities, improve safety, and optimize its performance across different environments and languages.

Whether you’re a solo developer, part of a DevOps team, or leading an enterprise platform, Codex represents a significant shift in how code is written, tested, and shipped. As AI agents continue to mature, the future of software engineering will be less about writing every line yourself—and more about knowing what to build, and asking the right questions.




Check out the Details here . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,832
Reputation
9,783
Daps
174,065

LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks​


By Nikhil

May 16, 2025

Conversational artificial intelligence is centered on enabling large language models (LLMs) to engage in dynamic interactions where user needs are revealed progressively. These systems are widely deployed in tools that assist with coding, writing, and research by interpreting and responding to natural language instructions. The aspiration is for these models to flexibly adjust to changing user inputs over multiple turns, adapting their understanding with each new piece of information. This contrasts with static, single-turn responses and highlights a major design goal: sustaining contextual coherence and delivering accurate outcomes in extended dialogues.

A persistent problem in conversational AI is the model’s inability to handle user instructions distributed across multiple conversation turns. Rather than receiving all necessary information simultaneously, LLMs must extract and integrate key details incrementally. However, when the task is not specified upfront, models tend to make early assumptions about what is being asked and attempt final solutions prematurely. This leads to errors that persist through the conversation, as the models often stick to their earlier interpretations. The result is that once an LLM makes a misstep in understanding, it struggles to recover, resulting in incomplete or misguided answers.

AD_4nXcUVipGgOjoqnMYmS4e0WYdz27UAQnIzHn_Xy7bo5ioMmoi1EKIdLNxCHfpFu8JibE4JoEYCPxDcsaACRZqZ8RxrUG4Q7cL6Ys0ou9rYZGfIjOkQzOmqDQQv4AiI86h6HL4_mQ


Most current tools evaluate LLMs using single-turn, fully-specified prompts, where all task requirements are presented in one go. Even in research claiming multi-turn analysis, the conversations are typically episodic, treated as isolated subtasks rather than an evolving flow. These evaluations fail to account for how models behave when the information is fragmented and context must be actively constructed from multiple exchanges. Consequently, evaluations often miss the core difficulty models face: integrating underspecified inputs over several conversational turns without explicit direction.

Researchers from Microsoft Research and Salesforce Research introduced a simulation setup that mimics how users reveal information in real conversations. Their “sharded simulation” method takes complete instructions from high-quality benchmarks and splits them into smaller, logically connected parts or “shards.” Each shard delivers a single element of the original instruction, which is then revealed sequentially over multiple turns. This simulates the progressive disclosure of information that happens in practice. The setup includes a simulated user powered by an LLM that decides which shard to reveal next and reformulates it naturally to fit the ongoing context. This setup also uses classification mechanisms to evaluate whether the assistant’s responses attempt a solution or require clarification, further refining the simulation of genuine interaction.

AD_4nXdT1ruLcTssCpOuhB38sOH-ZeZLzVdQTCaJnrr9TdKtezyQGoY6pJ4aI-ZSfMCEju6Kvo1WLrY0PEb09VIbs5uQEYjRv0P6hz6GHKZZjUCLmQ7D8itY_57UBQ291XfpzVEYNTT4NQ


The technology developed simulates five types of conversations, including single-turn full instructions and multiple multi-turn setups. In SHARDED simulations, LLMs received instructions one shard at a time, forcing them to wait before proposing a complete answer. This setup evaluated 15 LLMs across six generation tasks: coding, SQL queries, API actions, math problems, data-to-text descriptions, and document summaries. Each task drew from established datasets such as GSM8K, Spider, and ToTTo. For every LLM and instruction, 10 simulations were conducted, totaling over 200,000 simulations. Aptitude, unreliability, and average performance were computed using a percentile-based scoring system, allowing direct comparison of best and worst-case outcomes per model.

Across all tasks and models, a consistent decline in performance was observed in the SHARDED setting. On average, performance dropped from 90% in single-turn to 65% in multi-turn scenarios—a 25-point decline. The main cause was not reduced capability but a dramatic rise in unreliability. While aptitude dropped by 16%, unreliability increased by 112%, revealing that models varied wildly in how they performed when information was presented gradually. For example, even top-performing models like GPT-4.1 and Gemini 2.5 Pro exhibited 30-40% average degradations. Additional compute at generation time or lowering randomness (temperature settings) offered only minor improvements in consistency.

AD_4nXeCr-yyIPogtmJ7umQXn5H0d0jo7VBf8bMzrulhe4Cw-OhaWxCGIi-ubmwOLXrpHYVm-1nzkRbLKMb3gMycTWV-2Gq_vUwNa8Ob0NdT7g58v3vc_69gi7gYDavde8O3LUkcrzeVJA


This research clarifies that even state-of-the-art LLMs are not yet equipped to manage complex conversations where task requirements unfold gradually. The sharded simulation methodology effectively exposes how models falter in adapting to evolving instructions, highlighting the urgent need to improve reliability in multi-turn settings. Enhancing the ability of LLMs to process incomplete instructions over time is essential for real-world applications where conversations are naturally unstructured and incremental.




Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,832
Reputation
9,783
Daps
174,065

Georgia Tech and Stanford Researchers Introduce MLE-Dojo: A Gym-Style Framework Designed for Training, Evaluating, and Benchmarking Autonomous Machine Learning Engineering (MLE) Agents​


By Nikhil

May 15, 2025

Machine learning engineering (MLE) involves developing, tuning, and deploying machine learning systems that require iterative experimentation, model optimization, and robust handling of data pipelines. As model complexity increases, so do the challenges associated with orchestrating end-to-end workflows efficiently. Researchers have explored the automation of MLE tasks using AI agents to handle these demands. Large Language Models (LLMs), particularly those with strong coding and problem-solving abilities, have shown potential to enhance this process significantly. Their role in automating structured workflows is now being tested through rigorous benchmarks and environments tailored to emulate real-world MLE scenarios.

A primary hurdle in automating machine learning engineering lies in the work’s inherently iterative and feedback-driven nature. Tasks such as hyperparameter tuning, model debugging, and data preprocessing cannot be resolved in one step; they require repeated modifications and evaluations. Traditional evaluation tools for AI models often rely on static datasets and do not allow for real-time error feedback or interactive problem-solving. This limitation prevents LLM agents from learning through trial and error, an essential component for mastering engineering tasks that evolve or require multiple attempts for success.

AD_4nXeFGxWYQ3LPZBFXthaBTKr3q7rpToc_LOuSzoKHjVg89N6nQByrreA4ThTgiq1LXK-BUF4z8EAaMhbDoazbOxraVI8zRKIfh0jN1Na6H82AruxPIE5X7ASBYKRqBQy0RqtKW0UVSQ


Earlier tools to evaluate LLMs in engineering or coding tasks have mostly focused on individual subtasks or isolated challenges. These include tools like MLAgentBench and DSBench, which rely on narrow test cases sourced from Kaggle competitions or synthetic datasets. While they cover more than basic tasks, they do not enable agents to perform code execution, debugging, or results interpretation in a live setting. Other environments, like SWE-Gym, focus exclusively on software engineering and lack support for machine learning-specific workflows. These limitations have slowed the creation of versatile, high-performing MLE agents that can handle real-time project complexities.

Researchers from Georgia Institute of Technology and Stanford University have introduced MLE-Dojo, a framework with an interactive environment that connects LLM agents with real-world machine learning tasks derived from over 200 Kaggle competitions. This framework supports tabular data analysis, computer vision, natural language processing, and time-series forecasting challenges. Research introduced MLE-Dojo to allow agents to write, execute, and revise code in a sandboxed, feedback-rich setting. The goal was to replicate the interactive cycles that human engineers follow, enabling structured learning for agents. The environment includes pre-installed dependencies, evaluation metrics, and supports supervised fine-tuning and reinforcement learning strategies.

AD_4nXegXPecgn7qC3KGmGOq8yt6iL1SAci7oVyyp67Wm-ScChhORe2VpNZB4r2G3LdGfKWHqeawLO9xxISU8p6dV73dliSx9iJSEo6IJwRuGdIxsJZGt5-FNtZhEDFV8EZagpNwvP72Jw


MLE-Dojo’s structure consists of modular components that support a wide range of MLE challenges. Each task runs within its own Docker container, isolating it for safety and reproducibility. Agents interact with the environment through a Partially Observable Markov Decision Process, receiving observations, performing actions, and gaining rewards based on performance. The environment supports five primary action types: requesting task information, validating code, executing code, retrieving interaction history, and resetting the environment. It also provides a detailed observation space that includes datasets, execution results, and error messages. The agent receives structured feedback after every interaction, allowing for step-wise improvement. This modular setup helps maintain interoperability and simplifies adding new tasks to the system.

The evaluation included eight frontier LLMs—Gemini-2.5-Pro, DeepSeek-r1, o3-mini, GPT-4o, GPT-4o-mini, Gemini-2.0-Pro, Gemini-2.0-Flash, and DeepSeek-v3—across four core machine learning domains. Gemini-2.5-Pro achieved the highest Elo rating of 1257, followed by DeepSeek-r1 at 1137 and o3-mini at 1108. Regarding HumanRank, Gemini-2.5-Pro led with 61.95%, indicating its superior performance over human benchmarks. Models like GPT-4o-mini executed code only 20% of the time, adopting conservative strategies, while o3-mini performed executions in over 90% of the cases. The average failure rate for Gemini-2.5-Pro remained the lowest across validation and execution phases, reinforcing its robustness. Among domains, computer vision posed the greatest challenge, with most models scoring under 60 in HumanRank. Reasoning models generally produced longer outputs and maintained stronger performance consistency across iterations.

AD_4nXc8cW73_rMq8nzj-xctbTLbyRMV_AnZEaN484RQImvaj9VIzIbSD_xGvx_PV3FV8iBl9nlHmjRnvcZoE8XS5RpUHnj8jZDbvyMYcgN9oWygCFFPmzk7ep5kxCbXOShQvJu2WkRCaw


The research highlights the difficulty of applying LLMs to full machine learning workflows. It outlines a comprehensive solution in MLE-Dojo that enables learning through interaction, not just completion. MLE-Dojo sets a new standard for training and evaluating autonomous MLE agents by simulating engineering environments more accurately.




Check out the Paper , Project Page and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,832
Reputation
9,783
Daps
174,065

Meta AI Open-Sources LlamaFirewall: A Security Guardrail Tool to Help Build Secure AI Agents​


By Asif Razzaq

May 8, 2025

As AI agents become more autonomous—capable of writing production code, managing workflows, and interacting with untrusted data sources—their exposure to security risks grows significantly. Addressing this evolving threat landscape, Meta AI has released LlamaFirewall , an open-source guardrail system designed to provide a system-level security layer for AI agents in production environments.

Addressing Security Gaps in AI Agent Deployments


Large language models (LLMs) embedded in AI agents are increasingly integrated into applications with elevated privileges. These agents can read emails, generate code, and issue API calls—raising the stakes for adversarial exploitation. Traditional safety mechanisms, such as chatbot moderation or hardcoded model constraints, are insufficient for agents with broader capabilities.

LlamaFirewall was developed in response to three specific challenges:

  1. Prompt Injection Attacks : Both direct and indirect manipulations of agent behavior via crafted inputs.
  2. Agent Misalignment : Deviations between an agent’s actions and the user’s stated goals.
  3. Insecure Code Generation : Emission of vulnerable or unsafe code by LLM-based coding assistants.

Core Components of LlamaFirewall


LlamaFirewall introduces a layered framework composed of three specialized guardrails, each targeting a distinct class of risks:

1. PromptGuard 2


PromptGuard 2 is a classifier built using BERT-based architectures to detect jailbreaks and prompt injection attempts. It operates in real time and supports multilingual input. The 86M parameter model offers strong performance, while a 22M lightweight variant provides low-latency deployment in constrained environments. It is designed to identify high-confidence jailbreak attempts with minimal false positives.

2. AlignmentCheck


AlignmentCheck is an experimental auditing tool that evaluates whether an agent’s actions remain semantically aligned with the user’s goals. It operates by analyzing the agent’s internal reasoning trace and is powered by large language models such as Llama 4 Maverick. This component is particularly effective in detecting indirect prompt injection and goal hijacking scenarios.

3. CodeShield


CodeShield is a static analysis engine that inspects LLM-generated code for insecure patterns. It supports syntax-aware analysis across multiple programming languages using Semgrep and regex rules. CodeShield enables developers to catch common coding vulnerabilities—such as SQL injection risks—before code is committed or executed.

Evaluation in Realistic Settings


Meta evaluated LlamaFirewall using AgentDojo , a benchmark suite simulating prompt injection attacks against AI agents across 97 task domains. The results show a clear performance improvement:

  • PromptGuard 2 (86M) alone reduced attack success rates (ASR) from 17.6% to 7.5% with minimal loss in task utility.
  • AlignmentCheck achieved a lower ASR of 2.9%, though with slightly higher computational cost.
  • Combined , the system achieved a 90% reduction in ASR, down to 1.75%, with a modest utility drop to 42.7%.

In parallel, CodeShield achieved 96% precision and 79% recall on a labeled dataset of insecure code completions, with average response times suitable for real-time usage in production systems.

Future Directions


Meta outlines several areas of active development:

  • Support for Multimodal Agents : Extending protection to agents that process image or audio inputs.
  • Efficiency Improvements : Reducing the latency of AlignmentCheck through techniques like model distillation.
  • Expanded Threat Coverage : Addressing malicious tool use and dynamic behavior manipulation.
  • Benchmark Development : Establishing more comprehensive agent security benchmarks to evaluate defense effectiveness in complex workflows.

Conclusion


LlamaFirewall represents a shift toward more comprehensive and modular defenses for AI agents. By combining pattern detection, semantic reasoning, and static code analysis, it offers a practical approach to mitigating key security risks introduced by autonomous LLM-based systems. As the industry moves toward greater agent autonomy, frameworks like LlamaFirewall will be increasingly necessary to ensure operational integrity and resilience.




Check out the Paper , Code and Project Page . Also, don’t forget to follow us on Twitter .

Here’s a brief overview of what we’re building at Marktechpost:


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,832
Reputation
9,783
Daps
174,065

ByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model Designed to Advance General-Purpose Multimodal Understanding and Reasoning​


By Sana Hassan

May 15, 2025

VLMs have become central to building general-purpose AI systems capable of understanding and interacting in digital and real-world settings. By integrating visual and textual data, VLMs have driven advancements in multimodal reasoning, image editing, GUI agents, robotics, and more, influencing sectors like education and healthcare. Despite this progress, VLMs still lag behind human capabilities, particularly in tasks involving 3D reasoning, object counting, creative visual interpretation, and interactive gameplay. A challenge lies in the scarcity of rich, diverse multimodal datasets, unlike the abundant textual resources available to LLMs. Additionally, multimodal data complexity poses significant training and evaluation hurdles.

Researchers at ByteDance have developed Seed1.5-VL, a compact yet powerful vision-language foundation model featuring a 532 M-parameter vision encoder and a 20 B-parameter Mixture-of-Experts LLM. Despite its efficient architecture, Seed1.5-VL achieves top results on 38 out of 60 public VLM benchmarks, excelling in tasks like GUI control, video understanding, and visual reasoning. It is trained on trillions of multimodal tokens using advanced data synthesis and post-training techniques, including human feedback. Innovations in training, such as hybrid parallelism and vision token redistribution, optimize performance. The model’s efficiency and strong reasoning capabilities suit real-world interactive applications like chatbots.

The Seed1.5-VL architecture features a vision encoder, an MLP adapter, and an LLM. Its custom vision encoder, Seed-ViT, supports native-resolution image input using 2D RoPE and processes images through 14×14 patches, followed by average pooling and an MLP. Pretraining involves masked image modeling, contrastive learning, and omni-modal alignment using images, text, and video-audio-caption pairs. The model uses a Dynamic Frame-Resolution Sampling approach for video encoding that adapts frame rates and resolutions based on content complexity, balancing efficiency and detail. This method enables effective spatial-temporal understanding within a token budget, ensuring comprehensive video representation across varied lengths and complexities.

The pre-training of Seed1.5-VL involved curating 3 trillion high-quality tokens across diverse domains. Image-text pairs from the web were filtered using CLIP scores, size/aspect ratio checks, and deduplication to reduce noise. Using domain-based sampling and duplication strategies, rare visual concepts were overrepresented to address class imbalance. Specialized datasets were added for OCR using annotated and synthetic text-rich images, charts, and tables—object grounding and counting tasks utilized bounding boxes, points, and auto-labeled web data. Additional tasks included 3D spatial understanding using depth annotations, and video understanding through multi-frame captioning, QA, and temporal grounding to support dynamic content analysis.

The evaluation highlights Seed-ViT and Seed1.5-VL’s competitive performance across vision-language tasks. Seed-ViT, despite having significantly fewer parameters, matches or outperforms larger models like InternVL-C and EVA-CLIP on zero-shot image classification tasks, showing high accuracy and robustness on datasets such as ImageNet-A and ObjectNet. Seed1.5-VL demonstrates strong capabilities in multimodal reasoning, general VQA, document understanding, and grounding. It achieves state-of-the-art benchmarks, particularly in complex reasoning, counting, and chart interpretation tasks. The model’s “thinking” mode, which incorporates longer reasoning chains, further enhances performance, indicating its strong ability in detailed visual understanding and task generalization.

AD_4nXfzBlnyxnZmRU1qsm_ELlKP6InRzjmolkJgP2U1gNUgZ52aQOTYhTLg3-nEBqO2-1KA22lazKlqm48cSl-AEdJiY8fMhmltpxAaRZ2yIq1q82ux1i_3h4ObTmGvQmypF5CbghtV


In conclusion, Seed1.5-VL is a vision-language foundation model featuring a 532 M-parameter vision encoder and a 20 B-parameter Mixture-of-Experts language model. Despite its compact size, it achieves state-of-the-art results on 38 of 60 public benchmarks and excels in complex reasoning, OCR, diagram interpretation, 3D spatial understanding, and video analysis. It also performs well in agent-driven tasks like GUI control and gameplay, surpassing models like OpenAI CUA and Claude 3.7. The model shows strong generalization to tasks beyond its training scope. The study outlines its architecture, data pipeline, and training methods and identifies future directions, including enhancing tool-use and visual reasoning capabilities.




Check out the Paper and Project Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,832
Reputation
9,783
Daps
174,065

Stability AI Introduces Adversarial Relativistic-Contrastive (ARC) Post-Training and Stable Audio Open Small: A Distillation-Free Breakthrough for Fast, Diverse, and Efficient Text-to-Audio Generation Across Devices​


By Mohammad Asjad

May 15, 2025

Text-to-audio generation has emerged as a transformative approach for synthesizing sound directly from textual prompts, offering practical use in music production, gaming, and virtual experiences. Under the hood, these models typically employ Gaussian flow-based techniques such as diffusion or rectified flows. These methods model the incremental steps that transition from random noise to structured audio. While highly effective in producing high-quality soundscapes, the slow inference speeds have posed a barrier to real-time interactivity. It is particularly limiting when creative users expect an instrument-like responsiveness from these tools.

Latency is the primary issue with these systems. Current text-to-audio models can take several seconds or even minutes to generate a few seconds of audio. The core bottleneck lies in their step-based inference architecture, requiring between 50 and 100 iterations per output. Previous acceleration strategies focus on distillation methods where smaller models are trained under the supervision of larger teacher models to replicate multi-step inference in fewer steps. However, these distillation methods are computationally expensive. They demand large-scale storage for intermediate training outputs or require simultaneous operation of several models in memory, which hinders their adoption, especially on mobile or edge devices. Also, such methods often sacrifice output diversity and introduce over-saturation artifacts.

While a few adversarial post-training methods have been attempted to bypass the cost of distillation, their success has been limited. Most existing implementations rely on partial distillation for initialization or do not scale well to complex audio synthesis. Also, audio applications have seen fewer fully adversarial solutions. Tools like Presto integrate adversarial objectives but still depend on teacher models and CFG-based training for prompt adherence, which restricts their generative diversity.

Researchers from UC San Diego, Stability AI, and Arm introduced Adversarial Relativistic-Contrastive (ARC) post-training . This approach sidesteps the need for teacher models, distillation, or classifier-free guidance. Instead, ARC enhances an existing pre-trained rectified flow generator by integrating two novel training objectives: a relativistic adversarial loss and a contrastive discriminator loss. These help the generator produce high-fidelity audio in fewer steps while maintaining strong alignment with text prompts. When paired with the Stable Audio Open (SAO) framework, the result was a system capable of generating 12 seconds of 44.1 kHz stereo audio in only 75 milliseconds on an H100 GPU and around 7 seconds on mobile devices.

With ARC methodology, they introduced Stable Audio Open Small , a compact and efficient version of SAO tailored for resource-constrained environments. This model contains 497 million parameters and uses an architecture built on a latent diffusion transformer. It consists of three main components: a waveform-compressing autoencoder, a T5-based text embedding system for semantic conditioning, and a DiT (Diffusion Transformer) that operates within the latent space of the autoencoder. Stable Audio Open Small can generate stereo audio up to 11 seconds long at 44.1 kHz. It is designed to be deployed using the ‘stable-audio-tools’ library and supports ping-pong sampling, enabling efficient few-step generation. The model demonstrated exceptional inference efficiency, achieving generation speeds of under 7 seconds on a Vivo X200 Pro phone after applying dynamic Int8 quantization, which also cut RAM usage from 6.5GB to 3.6 GB. This makes it especially viable for on-device creative applications like mobile audio tools and embedded systems.

AD_4nXebRD-cZcgCkLZLGh8ovIzid6PRbzeeLMx59_S9FzAqqeBIWNXP3eyOkqi4CKA1rv6Z7RR_OID1OaBROOCx3P1afXhV-KgHw8r4qOIr2VyCPa9JpHjaBSTCZf6j-bTxH3HSEcaU


The ARC training approach involves replacing the traditional L2 loss with an adversarial formulation where generated and real samples, paired with identical prompts, are evaluated by a discriminator trained to distinguish between them. A contrastive objective teaches the discriminator to rank accurate audio-text pairs higher than mismatched ones to improve prompt relevance. These paired objectives eliminate the need for CFG while achieving better prompt adherence. Also, ARC adopts ping-pong sampling to refine the audio output through alternating denoising and re-noising cycles, reducing inference steps without compromising quality.

ARC’s performance was evaluated extensively. In objective tests, it achieved an FDopenl3 score of 84.43, a KLpasst score of 2.24, and a CLAP score of 0.27, indicating balanced quality and semantic precision. Diversity was notably strong, with a CLAP Conditional Diversity Score (CCDS) of 0.41. Real-Time Factor reached 156.42, reflecting outstanding generation speed, while GPU memory usage remained at a practical 4.06 GB. Subjectively, ARC scored 4.4 for diversity, 4.2 for quality, and 4.2 for prompt adherence in human evaluations involving 14 participants. Unlike distillation-based models like Presto, which scored higher on quality but dropped to 2.7 on diversity, ARC presented a more balanced and practical solution.

Screenshot-2025-05-15-at-11.22.22%E2%80%AFAM-1024x323.png


Several Key Takeaways from the Research by Stability AI on Adversarial Relativistic-Contrastive (ARC) post-training and Stable Audio Open Small include:

  • ARC post-training avoids distillation and CFG, relying on adversarial and contrastive losses.
  • ARC generates 12s of 44.1 kHz stereo audio in 75ms on H100 and 7s on mobile CPUs.
  • It achieves 0.41 CLAP Conditional Diversity Score, the highest among tested models.
  • Subjective scores: 4.4 (diversity), 4.2 (quality), and 4.2 (prompt adherence).
  • Ping-pong sampling enables few-step inference while refining output quality.
  • Stable Audio Open Small offers 497M parameters, supports 8-step generation, and is compatible with mobile deployments.
  • On Vivo X200 Pro, inference latency dropped from 15.3s to 6.6s with half the memory.
  • ARC and SAO Small provide real-time solutions for music, games, and creative tools.

In conclusion, the combination of ARC post-training and Stable Audio Open Small eliminates the reliance on resource-intensive distillation and classifier-free guidance, enabling researchers to deliver a streamlined adversarial framework that accelerates inference without compromising output quality or prompt adherence. ARC enables fast, diverse, and semantically rich audio synthesis in high-performance and mobile environments. With Stable Audio Open Small optimized for lightweight deployment, this research lays the groundwork for integrating responsive, generative audio tools into everyday creative workflows, from professional sound design to real-time applications on edge devices.




Check out the Paper, GitHub Page and Model on Hugging Face . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,832
Reputation
9,783
Daps
174,065

NVIDIA Open-Sources Open Code Reasoning Models (32B, 14B, 7B)​


By Sana Hassan

May 8, 2025

NVIDIA continues to push the boundaries of open AI development by open-sourcing its Open Code Reasoning (OCR) model suite — a trio of high-performance large language models purpose-built for code reasoning and problem-solving. The 32B, 14B, and 7B variants, all released under the Apache 2.0 license .

Benchmarked to Beat the Best


The Open Code Reasoning (OCR) models come with notable benchmark achievements , outperforming OpenAI’s o3-Mini and o1 (low) models on the LiveCodeBench benchmark. LiveCodeBench is a comprehensive evaluation suite for code reasoning tasks such as debugging, code generation, and logic completion in real-world developer environments. In direct comparison, NVIDIA’s 32B OCR model tops the leaderboard in reasoning capability for open models.

This leap in performance is attributed not only to model architecture, but to NVIDIA’s custom “OCR dataset” — a high-quality, code-centric training corpus designed to emphasize instruction-following, reasoning, and multi-step code problem solving. According to NVIDIA, this results in a 30% improvement in token efficiency , allowing the models to produce accurate code and logical outputs with fewer tokens.

A Model Lineup for Every Use Case


The Open Code Reasoning suite comes in three parameter scales :

  • OpenCodeReasoning-Nemotron-32B
  • OpenCodeReasoning-Nemotron-14B
  • OpenCodeReasoning-Nemotron-7B

Each model balances scale with performance. The 32B variant delivers state-of-the-art results for high-performance inference and research; the 14B model provides strong reasoning capabilities with reduced compute requirements, and the 7B variant is ideal for resource-constrained environments while retaining competitive performance on benchmarks.

All models are trained using the Nemotron architecture , NVIDIA’s transformer-based backbone optimized for multilingual, multi-task learning. The model weights and configurations are available on Hugging Face:


Compatible with Open Inference Ecosystems


A key feature of these models is out-of-the-box compatibility with popular inference frameworks:

  • llama.cpp for lightweight CPU/GPU inference
  • vLLM for optimized GPU serving and speculative decoding
  • Transformers by Hugging Face for training and evaluation pipelines
  • TGI (Text Generation Inference) for scalable API deployment

This flexibility allows developers, researchers, and enterprises to plug these models into existing code AI infrastructure with minimal overhead.

A Step Forward for Open Code Intelligence


With this release, NVIDIA contributes significantly to the growing ecosystem of open code models. By targeting code reasoning — a domain historically dominated by proprietary models — and releasing under a fully open and permissive license, NVIDIA empowers the broader AI and developer community to build, fine-tune, and deploy advanced reasoning models in production.

The Open Code Reasoning suite adds to NVIDIA’s growing portfolio of open LLMs and strengthens its stance on accessible, transparent AI development. Whether you’re building developer copilots, automated code review agents, or code generation services, these models offer a high-performing, cost-effective, and community-friendly alternative to closed solutions.




Check out the 32B Model , 14B Model , 7B Model and 32B Instruction-Tuned Variant . Also, don’t forget to follow us on Twitter .

Here’s a brief overview of what we’re building at Marktechpost:


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,832
Reputation
9,783
Daps
174,065

PrimeIntellect Releases INTELLECT-2: A 32B Reasoning Model Trained via Distributed Asynchronous Reinforcement Learning​


By Asif Razzaq

May 12, 2025

As language models scale in parameter count and reasoning complexity, traditional centralized training pipelines face increasing constraints. High-performance model training often depends on tightly coupled compute clusters with fast interconnects, which are costly, limited in availability, and prone to scalability bottlenecks. Furthermore, centralized architectures restrict the possibility of widespread collaboration and experimentation, particularly in open-source research environments. A shift toward decentralized methods could mitigate these challenges, enabling broader participation and more fault-tolerant training regimes.

PrimeIntellect Open Sources INTELLECT-2, a 32B Reasoning Model


PrimeIntellect has released INTELLECT-2, a 32-billion parameter reasoning model post-trained using Generalized Reinforcement Policy Optimization (GRPO) within a fully decentralized, asynchronous reinforcement learning framework. Licensed under Apache 2.0, the release includes not only the model weights but also the full codebase and training logs. INTELLECT-2 exceeds the performance of the previously leading QwQ-32B model in key reasoning benchmarks. The open-source nature of the release is intended to support reproducibility, extensibility, and ongoing research.

Screenshot-2025-05-12-at-10.06.45%E2%80%AFAM-1-1024x821.png


Architecture and Technical Innovations


INTELLECT-2 is developed within a novel training stack purpose-built for distributed environments. Three primary components underpin this system:

  • PRIME-RL : An asynchronous RL engine that separates the stages of rollout generation, training, and parameter distribution. This decoupling removes the need for synchronous updates and allows the system to operate over variable and unreliable network conditions.
  • SHARDCAST : A tree-topology HTTP protocol that supports rapid propagation of model weights across distributed workers, improving communication efficiency without requiring specialized infrastructure.
  • TOPLOC : A verification mechanism based on locality-sensitive hashing, which detects modifications in inference outputs. This is critical for ensuring integrity in distributed and potentially non-deterministic hardware environments.

This architecture enables INTELLECT-2 to be trained across heterogeneous systems with minimal coordination overhead while preserving model quality and inference consistency.

Training Data, Methodology, and Performance


The post-training process for INTELLECT-2 used approximately 285,000 verifiable tasks with a focus on reasoning, coding, and mathematical problem solving. Sources included datasets such as NuminaMath-1.5, Deepscaler, and SYNTHETIC-1. The model underwent reinforcement learning fine-tuning using GRPO with asynchronous updates.

The system applied a two-phase training strategy: new policy weights were broadcast while the existing rollout and training pipelines remained active, minimizing idle time across the network. Stability was improved through two-sided clipping of token probability ratios, reducing the variance associated with large updates.

A combination of heuristics and automated filters was used to select high-quality demonstrations, and a tailored reward model was employed to rank completions. The reinforcement learning loop consistently favored completions with better reasoning structure, contributing to measurable performance improvements over baseline models.

In terms of evaluation, INTELLECT-2 outperforms QwQ-32B on multiple reasoning-centric benchmarks, indicating improved generalization and reasoning accuracy. The gains are particularly evident in math and coding tasks, where the use of asynchronous GRPO fine-tuning and curated reward modeling produced more structured and verifiable outputs. These results suggest that decentralized post-training pipelines can achieve comparable or superior performance to traditional RLHF pipelines while offering improved flexibility and scalability.

Screenshot-2025-05-12-at-10.07.32%E2%80%AFAM-1-1024x289.png


Conclusion


INTELLECT-2 represents a methodologically sound step toward decentralizing large-scale model training. By demonstrating that a 32B parameter model can be post-trained with high performance using distributed, asynchronous reinforcement learning, PrimeIntellect contributes a practical and extensible alternative to centralized RLHF pipelines. The architecture’s modular components—PRIME-RL, SHARDCAST, and TOPLOC—address key challenges in scalability, communication efficiency, and inference verification. As research interest grows in open, decentralized AI development, INTELLECT-2 serves as a reproducible benchmark and a framework for further experimentation in distributed model training.




Check out Paper , Model on Hugging Face and Official Release . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit .

Here’s a brief overview of what we’re building at Marktechpost:


 
Top