bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,786

Qwen Researchers Proposes QwenLong-L1: A Reinforcement Learning Framework for Long-Context Reasoning in Large Language Models​


By Asif Razzaq

May 27, 2025

While large reasoning models (LRMs) have shown impressive capabilities in short-context reasoning through reinforcement learning (RL), these gains do not generalize well to long-context scenarios. Applications such as multi-document QA, research synthesis, and legal or financial analysis require models to process and reason over sequences exceeding 100K tokens. However, RL optimization in such regimes is plagued by slower reward convergence, unstable policy updates due to KL divergence fluctuations, and reduced exploration resulting from entropy collapse. These bottlenecks reveal a fundamental gap in transitioning LRMs from short-context proficiency to long-context generalization.

QwenLong-L1: A Structured RL Framework for Long-Context Adaptation


To address these limitations, the Qwen Research team introducesQwenLong-L1 , a novel RL framework designed to adapt LRMs to long-context reasoning tasks. The framework is structured into three key stages:

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

  • Warm-up Supervised Fine-Tuning (SFT): Provides a stable initialization for the policy model by training on curated question-context-answer triplets, ensuring basic competence in contextual comprehension and answer extraction.
  • Curriculum-Guided Phased Reinforcement Learning: Introduces a staged training process with gradually increasing context lengths. This progression enables the model to incrementally acquire long-context reasoning behaviors without destabilizing policy updates.
  • Difficulty-Aware Retrospective Sampling: Enhances exploration by maintaining and reusing hard examples from previous phases, weighted by their difficulty, to encourage deeper reasoning and robustness across diverse inputs.

These stages are complemented by hybrid reward mechanisms—combining rule-based exact match verification with semantic evaluation by a lightweight LLM—ensuring both precision and recall during policy training.

Screenshot-2025-05-27-at-12.18.12%E2%80%AFAM-1024x424.png


Technical Design and Methodological Advantages


QwenLong-L1 integrates recent advances in group-relative RL optimization, specificallyGRPO andDAPO , to mitigate the computational overhead associated with long-context value estimation:

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

  • GRPO estimates advantage by normalizing rewards within sampled groups, eliminating the need for a separate value network and encouraging diverse generation patterns.
  • DAPO incorporates mechanisms such as dynamic sampling, overlength penalty shaping, and asymmetric clipping thresholds to prevent entropy collapse and mitigate length biases during training.

The reward function is defined as the maximum of two signals: a deterministic rule-based match and a semantic judgment from a compact evaluator model (e.g., Qwen2.5-1.5B). This hybrid approach avoids overfitting to rigid formats while maintaining answer correctness across varied notations and phrasings.

Moreover, the framework is optimized viaprogressive context scaling , where the RL process transitions from 20K-token to 60K-token input lengths in controlled phases, stabilizing training dynamics and facilitating policy generalization.

Experimental Results and Benchmark Performance


QwenLong-L1 was evaluated on seven long-context document QA benchmarks, including DocMath, Frames, 2WikiMultihopQA, HotpotQA, Musique, NarrativeQA, and Qasper. The 32B variant,QwenLong-L1-32B , demonstrated strong empirical performance:

  • It outperformed baseline models such asR1-Distill-Qwen-32B by5.1 points and exceeded leading proprietary systems likeOpenAI-o3-mini andQwen3-235B-A22B .
  • Its performance wascomparable to Claude-3.7-Sonnet-Thinking , indicating competitive reasoning capabilities under extreme context lengths.
  • Pass@K analysis revealed consistent improvements with increased sampling, achieving a Pass@2 average of73.7 , surpassingDeepSeek-R1 andOpenAI-o1-preview , even at low sampling rates.

Screenshot-2025-05-27-at-12.18.35%E2%80%AFAM-1-1024x452.png


Ablation studies further validated the individual contributions of SFT, phased RL, and retrospective sampling. Notably, RL played a decisive role in enabling emergent reasoning behaviors such as grounding, subgoal setting, verification, and backtracking—traits not effectively induced by supervised fine-tuning alone.

Conclusion


QwenLong-L1 represents a systematic approach to equipping LRMs with robust long-context reasoning capabilities through reinforcement learning. Its design effectively bridges the gap between short-context expertise and the demands of information-dense environments by combining supervised initialization, curriculum-driven context scaling, and hybrid evaluation strategies. The framework not only achieves state-of-the-art results across long-context benchmarks but also demonstrates the emergence of interpretable reasoning patterns during training.




Check out thePaper,Model on Hugging FaceandGitHub Page. All credit for this research goes to the researchers of this project.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,786

LLMs Can Now Reason Beyond Language: Researchers Introduce Soft Thinking to Replace Discrete Tokens with Continuous Concept Embeddings​


By Sana Hassan

May 27, 2025

Human reasoning naturally operates through abstract, non-verbal concepts rather than strictly relying on discrete linguistic tokens. However, current LLMs are limited to reasoning within the boundaries of natural language, producing one token at a time through predefined vocabulary. This token-by-token approach not only restricts the expressive capacity of the model but also limits the breadth of reasoning paths it can explore, especially in ambiguous or complex scenarios. Standard Chain-of-Thought (CoT) methods exemplify this limitation, forcing the model to commit to a single path at each step. In contrast, human cognition is more flexible and parallel, allowing for simultaneous consideration of multiple ideas and delaying verbalization until concepts are fully formed. This makes human reasoning more adaptable and robust in dealing with uncertainty.

To address these limitations, researchers have proposed transitioning from token-based reasoning to reasoning within a continuous concept space, representing reasoning steps as token embeddings combinations. This approach allows models to explore multiple reasoning trajectories in parallel and integrate richer conceptual representations. Prior studies have demonstrated the potential of manipulating hidden states to influence reasoning outcomes or introduce latent planning. However, applying continuous-space reasoning to larger models presents challenges. In models under 7B parameters, shared weights between input and output layers allow hidden states to align with token embeddings, facilitating continuous reasoning. However, in larger models, where input and output spaces are decoupled, directly using hidden states as inputs causes mismatches that are hard to resolve. Attempts to retrain these models to bridge this gap often result in overfitting or degraded performance, highlighting the difficulty of enabling effective continuous reasoning at scale.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

Researchers from the University of California, Santa Barbara, University of California, Santa Cruz, University of California, Los Angeles, Purdue University, LMSYS Org, and Microsoft introduce Soft Thinking. This training-free approach enhances reasoning in large language models by operating in a continuous concept space. Instead of choosing one discrete token at each step, the model generates concept tokens—probability-weighted mixtures of all token embeddings—enabling parallel reasoning over multiple paths. This results in richer, more abstract representations. The method includes a Cold Stop mechanism to improve efficiency. Evaluations on mathematical and coding tasks show up to 2.48% higher accuracy and 22.4% fewer tokens used than standard Chain-of-Thought reasoning.

The Soft Thinking method enhances standard CoT reasoning by replacing discrete token sampling with concept tokens—probability distributions over the entire vocabulary. These distributions compute weighted embeddings, allowing the model to reason in a continuous concept space. This preserves uncertainty and enables parallel exploration of multiple reasoning paths. A Cold Stop mechanism monitors entropy to halt reasoning when the model becomes confident, improving efficiency and preventing collapse. Theoretical analysis shows that Soft Thinking approximates the full marginalization over all reasoning paths through linearization, offering a more expressive and computationally tractable alternative to discrete CoT.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

The study evaluates the Soft Thinking method on eight benchmarks in math and programming using three open-source LLMs of varying sizes and architectures. Compared to standard and greedy CoT methods, Soft Thinking consistently improves accuracy (Pass@1) while significantly reducing the number of tokens generated, indicating more efficient reasoning. The approach uses concept tokens and a Cold Start controller without modifying model weights or requiring extra training. Experiments show that soft thinking balances higher accuracy with lower computational cost, outperforming baselines by enabling richer, more abstract reasoning in fewer steps across diverse tasks and models.

Screenshot-2025-05-27-at-9.12.56%E2%80%AFPM-1024x396.png


In conclusion, Soft Thinking is a training-free approach that enables large language models to reason using continuous concept tokens instead of traditional discrete tokens. By combining weighted token embeddings, Soft Thinking allows models to explore multiple reasoning paths simultaneously, improving accuracy and efficiency. Tested on math and coding benchmarks, it consistently boosts pass@1 accuracy while reducing the number of generated tokens, all without extra training or architectural changes. The method maintains interpretability and concise reasoning. Future research may focus on training adaptations to enhance robustness, especially for out-of-distribution inputs. The code is publicly accessible.




Check out thePaperandGitHub Page. All credit for this research goes to the researchers of this project.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,786

This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost Efficiency​


By Nikhil

May 28, 2025

Web navigation focuses on teaching machines how to interact with websites to perform tasks such as searching for information, shopping, or booking services. Building a capable web navigation agent is a complex task because it requires understanding the structure of websites, interpreting user goals, and making a series of decisions across multiple steps. These tasks are further complicated by the need for agents to adapt in dynamic web environments, where content can change frequently and where multimodal information, such as text and images, must be understood together.

A key problem in web navigation is the absence of reliable and detailed reward models that can guide agents in real-time. Existing methods primarily rely on multimodal large language models (MLLMs) like GPT-4o and GPT-4o-mini as evaluators, which are expensive, slow, and often inaccurate, especially when handling long sequences of actions in multi-step tasks. These models use prompting-based evaluation or binary success/failure feedback but fail to provide step-level guidance, often leading to errors such as repeated actions or missing critical steps like clicking specific buttons or filling form fields. This limitation reduces the practicality of deploying web agents in real-world scenarios, where efficiency, accuracy, and cost-effectiveness are crucial.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

AD_4nXdFvQH2qmBM_u6BNMlbjwez3yUCWqogdQN0UePte_Iik0n_T2Z6scGMaO6PdZZH6AlwC1Xhnbz8NGCksv9OurqNMK_fJH_KAx7DmJDMPQrG7fJaWWrXbGqRh-SBA6JrtrU-Iw5b0A


The research team from Yonsei University and Carnegie Mellon University introduced WEB-SHEPHERD, a process reward model specifically designed for web navigation tasks. WEB-SHEPHERD is the first model to evaluate web navigation agents at the step level, using structured checklists to guide assessments. The researchers also developed the WEBPRM COLLECTION, a dataset of 40,000 step-level annotated web navigation tasks, and the WEBREWARDBENCH benchmark for evaluating PRMs. These resources were designed to enable WEB-SHEPHERD to provide detailed feedback by breaking down complex tasks into smaller, measurable subgoals.

AD_4nXcXzwodSskXd0Zr1txD9jK5DcsUsa2apun-NiFBl3fK2Kwyc1zHREgOsV4oJeioGnhvXfphda-aLtlFsMBLn7SNZ07UDtXWUVOSGSfHKcV7DJ1vuIJC3iz89W9wGhUUrbBlmvev


WEB-SHEPHERD works by generating a checklist for each task based on the user’s instruction, such as “Search for product” or “Click on product page,” and evaluates the agent’s progress against these subgoals. The model uses next-token prediction to generate feedback and assigns rewards based on checklist completion. This process enables WEB-SHEPHERD to assess the correctness of each step with fine-grained judgment. The model estimates the reward for each step by combining the probabilities of “Yes,” “No,” and “In Progress” tokens and averages these across the checklist. This detailed scoring system enables agents to receive targeted feedback on their progress, enhancing their ability to navigate complex websites.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

The researchers demonstrated that WEB-SHEPHERD significantly outperforms existing models. On the WEBREWARDBENCH benchmark, WEB-SHEPHERD achieved a Mean Reciprocal Rank (MRR) score of 87.6% and a trajectory accuracy of 55% in the text-only setting, compared to GPT-4o-mini’s 47.5% MRR and 0% trajectory accuracy without checklists. When tested in WebArena-lite using GPT-4o-mini as the policy model, WEB-SHEPHERD achieved a 34.55% success rate, which is 10.9 points higher than using GPT-4o-mini as the evaluator, while also being ten times more cost-efficient. In ablation studies, the researchers observed that WEB-SHEPHERD’s performance dropped significantly when checklists or feedback were removed, proving their importance for accurate reward assignments. They also showed that multimodal input, surprisingly, did not always improve performance and sometimes introduced noise.

AD_4nXfeWbeKiwvxkSvTFQLDPyhZ9vv45OGv-i-GejMMUUGZejdA8ZRAVAH4IWi8RcWO9knktyMi7vNnL2BNfn483XNAQHUPXg5DCN5FxuzkK7IwOUbuPBcncaSInkhstrhbWBF6P5twqA


This research highlights the critical role of detailed process-level rewards in building reliable web agents. The team’s work addresses the core challenge of web navigation—evaluating complex, multi-step actions—and offers a solution that is both scalable and cost-effective. With WEB-SHEPHERD, agents can now receive accurate feedback during navigation, enabling them to make better decisions and complete tasks more effectively.




Check out thePaperandGitHub Page. All credit for this research goes to the researchers of this project.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,786

NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement Learning​


By Sajjad Ansari

May 25, 2025

Reasoning capabilities represent a fundamental component of AI systems. The introduction of OpenAI o1 sparked significant interest in building reasoning models through large-scale reinforcement learning (RL) approaches. While DeepSeek-R1’s open-sourcing empowered the community to develop state-of-the-art reasoning models, critical technical details, including data curation strategies and specific RL training recipes, were omitted from the original report. This absence left researchers struggling to replicate the success, leading to fragmented efforts exploring different model sizes, initial checkpoints, and target domains. Different model sizes, initial checkpoints, distilled reasoning models, target domains, code, and physical AI are explored, but lack conclusive or consistent training recipes.

Training language models for reasoning focuses on math and code domains through pretraining and supervised fine-tuning approaches. Early RL attempts using domain-specific reward models show limited gains due to inherent challenges for mathematical and coding tasks. Recent efforts following DeepSeek-R1’s release explore rule-based verification methods, where math problems require specific output formats for accurate verification, and code problems utilize compilation and execution feedback. However, these approaches focus on single domains rather than handling heterogeneous prompts, restricted benchmark evaluations limited to AIME and LiveCodeBench, and training instability issues requiring techniques like progressive response length increases and entropy collapse mitigation.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

Researchers from NVIDIA demonstrate that large-scale RL can significantly enhance the reasoning capabilities of strong small- and mid-sized models, outperforming state-of-the-art distillation-based approaches. The method employs a simple yet effective sequential training strategy: first conducting RL training on math-only prompts, followed by code-only prompts. This reveals that math-only RL enhances performance on mathematical benchmarks and improves code reasoning tasks, while extended code-only RL iterations further boost code performance with minimal degradation in math results. Moreover, a robust data curation pipeline is developed to collect challenging prompts with high-quality, verifiable answers and test cases, enabling verification-based RL across both domains.

AD_4nXcurgiBiU2hkP3K-VawEaXxeVa0mK-iVEpt_9McPoeLXAcSHQ7Vb0s6DsXU77lzkLM7u-Lypsq4ZR2stJB_pkDb8xWj58L_nO_n0HnsoyK-drZfmv5mq4MbdZEbtEuTL9mX8E72ww


The method performs data curation for both math-only RL and code-only RL. For math-only RL, the pipeline merges DeepScaler and NuminaMath datasets covering algebra, combinatorics, number theory, and geometry, applying 9-gram filtering and strict exclusion rules for unsuitable content. DeepSeek-R1 model validates questions through eight attempts, retaining only majority-voted correct solutions via rule-based verification. The dataset for code-only RL is curated from modern competitive programming platforms using function-calling and stdin/stdout formats across algorithmic topics. Moreover, researchers filter incompatible problems, curate comprehensive test cases covering edge cases, and assign difficulty scores using DeepSeek-R1-671B evaluation, producing 8,520 verified coding problems.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

The results show that the AceReason-Nemotron-7B model achieves 14.5% and 14.6% accuracy improvements on AIME 2024/2025, respectively, with 14.2% and 8% gains on LiveCodeBench v5/v6 compared to initial SFT models. The 14B variant outperforms larger models like DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Llama-70B, achieving best-in-class results among open RL-based reasoning models. Compared to SOTA distillation-based models, AceReason-Nemotron-14B outperforms OpenMath-14B/32B by 2.1%/4.4% on AIME benchmarks and OpenCodeReasoning-14B by 1.7%/0.8% on LiveCodeBench, showing that RL achieves higher performance upper-bounds than distillation approaches by maintaining competitive performance against frontier models like QWQ-32B and o3-mini.

In this paper, researchers show that large-scale RL enhances the reasoning capabilities of strong small- and mid-sized SFT models through sequential domain-specific training. The proposed approach of performing math-only RL followed by code-only prompts reveals that mathematical reasoning training significantly boosts performance across both mathematical and coding benchmarks. The data curation pipeline enables verification-based RL across heterogeneous domains by collecting challenging prompts with high-quality, verifiable answers and test cases. The findings reveal that RL pushes model reasoning limits, providing solutions to unsolvable problems and establishing new performance benchmarks for reasoning model development.




Check out thePaperandModel on Hugging Face. All credit for this research goes to the researchers of this project.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,786

Incorrect Answers Improve Math Reasoning? Reinforcement Learning with Verifiable Rewards (RLVR) Surprises with Qwen2.5-Math​


By Asif Razzaq

May 28, 2025

In natural language processing (NLP), RL methods, such as reinforcement learning with human feedback (RLHF), have been utilized to enhance model outputs by optimizing responses based on feedback signals. A specific variant, reinforcement learning with verifiable rewards (RLVR), extends this approach by utilizing automatic signals, such as mathematical correctness or syntactic features, as feedback, enabling the large-scale tuning of language models. RLVR is especially interesting because it promises to enhance models’ reasoning abilities without needing extensive human supervision. This intersection of automated feedback and reasoning tasks forms an exciting area of research, where developers aim to uncover how models can learn to reason mathematically, logically, or structurally using limited supervision.

A persistent challenge in machine learning is building models that can reason effectively under minimal or imperfect supervision. In tasks like mathematical problem-solving, where the correct answer might not be immediately available, researchers grapple with how to guide a model’s learning. Models often learn from ground-truth data, but it’s impractical to label vast datasets with perfect accuracy, particularly in reasoning tasks that require understanding complex structures like proofs or programmatic steps. Consequently, there’s an open question about whether models can learn to reason if they are exposed to noisy, misleading, or even incorrect signals during training. This issue is significant because models that overly rely on perfect feedback may not generalize well when such supervision is unavailable, thereby limiting their utility in real-world scenarios.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

Several existing techniques aim to enhance models’ reasoning abilities through reinforcement learning (RL), with RLVR being a key focus. Traditionally, RLVR has used “ground truth” labels, correct answers verified by humans or automated tools, to provide rewards during training. Some approaches have relaxed this requirement by using majority vote labels or simple format-based heuristics, such as rewarding answers that follow a specific output style. Other methods have experimented with random rewards, offering positive signals without considering the correctness of the answer. These methods aim to explore whether models can learn even with minimal guidance, but they mostly concentrate on specific models, such as Qwen, raising concerns about generalizability across different architectures.

Researchers from the University of Washington, the Allen Institute for AI, and UC Berkeley investigate this question by testing various reward signals on Qwen2.5-Math, a family of large language models fine-tuned for mathematical reasoning. They tested ground-truth rewards, majority-vote rewards, format rewards based on boxed expressions, random rewards, and incorrect rewards. Remarkably, they observed that even completely spurious signals, like random rewards and rewards for wrong answers, could lead to substantial performance gains in Qwen models. For example, training Qwen2.5-Math-7B on MATH-500 with ground-truth rewards yielded a 28.8% improvement, while using incorrect labels resulted in a 24.6% gain. Random rewards still produced a 21.4% boost, and format rewards led to a 16.4% improvement. Majority-vote rewards provided a 26.5% accuracy gain. These improvements were not limited to a single model; Qwen2.5-Math-1.5B also showed strong gains: format rewards boosted accuracy by 17.6%, and incorrect labels by 24.4%. However, the same reward strategies failed to deliver similar benefits on other model families, such as Llama3 and OLMo2, which showed minimal or negative changes when trained with spurious rewards. For instance, Llama3.1-8B saw performance drops of up to 8.5% under certain spurious signals, highlighting the model-specific nature of the observed improvements.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

AD_4nXeC0XzKN7qpveq9wUJuVVdDy-b2w_xSJejFRjN9l9eOUNLAEn5J9nmK6mOLxdFhE1hsADLIi-vEWqqZHIftZANnlYdF8N-sDHMtI3iGfGD0KmCE_PXV5qEMHqVDnFyd-6MMjH0daQ


The research team’s approach involved using RLVR training to fine-tune models with these varied reward signals, replacing the need for ground-truth supervision with heuristic or randomized feedback. They found that Qwen models, even without access to correct answers, could still learn to produce high-quality reasoning outputs. A key insight was that Qwen models tended to exhibit a distinct behavior called “code reasoning”, generating math solutions structured like code, particularly in Python-like formats, regardless of whether the reward signal was meaningful. This code reasoning tendency became more frequent over training, rising from 66.7% to over 90% in Qwen2.5-Math-7B when trained with spurious rewards. Answers that included code reasoning showed higher accuracy rates, often around 64%, compared to just 29% for answers without such reasoning patterns. These patterns emerged consistently, suggesting that spurious rewards may unlock latent capabilities learned during pretraining rather than introducing new reasoning skills.

AD_4nXeQIaTGEe_Is9Jj_t9SVoEkOycAY0o8-FX4duWxrk8jDoxTvHlbUv9vQ-oaLLyIR2J0cvLVIiXRBd14mL7GxycAKc8AkJXPI0CDpNtlDnMv2Rte2IfkW7PEPLdBlWPoF7DwRbGd3g


Performance data underscored the surprising robustness of Qwen models. Gains from random rewards (21.4% on MATH-500) and incorrect labels (24.6%) nearly matched the ground-truth reward gain of 28.8%. Similar trends appeared across tasks, such as AMC, where format, wrong, and random rewards produced around an 18% improvement, only slightly lower than the 25% improvement from ground-truth or majority-vote rewards. Even on AIME2024, spurious rewards like format (+13.0%), incorrect (+8.7%), and random (+6.3%) led to meaningful gains, though the advantage of ground-truth labels (+12.8%) remained evident, particularly for AIME2025 questions created after model pretraining cutoffs.

AD_4nXfiTkEJ8kqHf26ULE0nNONeO2VH-ez9ixQh4J1FShjeclFrUMUwsYZDiy1xIin3AO27AwTP1B0hgqTOPbyKXmMepQifq3BTXCbS-X_0smi-4TazcmBu4ON6dSLruJOHMYgb3rFpHg


Several Key Takeaways from the research include:

  • Qwen2.5-Math-7B gained 28.8% accuracy on MATH-500 with ground-truth rewards, but also 24.6% with incorrect rewards, 21.4% with random rewards, 16.4% with format rewards, and 26.5% with majority-vote rewards.
  • Code reasoning patterns emerged in Qwen models, increasing from 66.7% to 90%+ under RLVR, which boosted accuracy from 29% to 64%.
  • Non-Qwen models, such as Llama3 and OLMo2, did not show similar improvements, with Llama3.1-8B experiencing up to 8.5% performance drops on spurious rewards.
  • Gains from spurious signals appeared within 50 training steps in many cases, suggesting rapid elicitation of reasoning abilities.
  • The research warns that RLVR studies should avoid generalizing results based on Qwen models alone, as spurious reward effectiveness is not universal.

In conclusion, these findings suggest that while Qwen models can leverage spurious signals to improve performance, the same is not true for other model families. Non-Qwen models, such as Llama3 and OLMo2, showed flat or negative performance changes when trained with spurious signals. The research emphasizes the importance of validating RLVR methods on diverse models rather than relying solely on Qwen-centric results, as many recent papers have done.




Check out thePaper,Official ReleaseandGitHub Page. All credit for this research goes to the researchers of this project.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,786

1/11
@OfficialLoganK
The new Gemini 2.5 Pro is SOTA at long context, especially capable on higher number of items being retrieved (needles) as shown below!



GsyRp-1XwAALqAn.jpg


2/11
@A_MacLullich
What about Opus 4?



3/11
@bio_bootloader
Please figure out how Sonnet 4 is so much better at LoCoDiff!

[Quoted tweet]
the new Gemini 2.5 Pro (06-05) does about the same as the previous version on LoCoDiff

Gemini 2.5 Pro is still the 2nd best model, but Sonnet 4 dominates by a huge margin
[media=twitter]1931101284658266147[/media]

Gsyl-n2b0AA5v4b.jpg


4/11
@TeksEdge
This is very very true. OpenAI has NOT made progress on this. I am nearly completely locked into Gemini Pro 2.5 because of this. No other model can complete nor has as long effective context window. Underhyped!



5/11
@hive_echo
Maybe you already know this bench but it is in agreement:

[Quoted tweet]
Wow Google does it again! Gemini 2.5 Pro is super impressive. Amazing 192k result.
[media=twitter]1930747501365117341[/media]

GstkYEYXUAAmTAG.jpg


6/11
@Cherelynn
So far from BARD ...



7/11
@Titan_III_E
What the heck is going on with claude



8/11
@LaurenceBrem
Pretty amazing retrieval at 192K depth

Credit @ficlive



GsyTJZvXEAELNU0.jpg


9/11
@immoinulmoin
can we get something like claude-code? that would be dope



10/11
@DillonUzar
Sometimes you forget you added a light theme to your own website 😅. Didn't recognize it at first.

Great job to the team BTW!



11/11
@majidmanzarpour
Long context ftw

[Quoted tweet]
Ok @GoogleDeepMind gemini-2.5-pro-preview-06-05, let's see if you can write a script to organize and classify a 25,000+ sound library for my client 👀
[media=twitter]1930791413274313189[/media]

GsuLjBuXEAA8Pts.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/11
@GoogleDeepMind
Gemini 2.5 Pro - our most intelligent model, is getting an update before general availability. ✨

It’s even better at: coding 🖥️, reasoning 💡, and creative writing ✍️

Learn more. 🧵



2/11
@GoogleDeepMind
The latest version of 2.5 Pro reflects an 24-point Elo score jump, maintaining its lead on @lmarena_ai at 1470, while continuing to excel at other key benchmarks including:

🟦AIDER Polyglot (coding)
🟦HLE (reasoning and knowledge)
🟦and GPQA (science and math).


Try the latest Gemini 2.5 Pro before general availability.



GssQgbyW8AAwq-n.jpg


3/11
@GoogleDeepMind
🛠️ Start building with Gemini 2.5 Pro in Preview in @Google AI Studio, @GeminiApp, and @GoogleCloud’s /search?q=#VertexAI platform, with general access availability coming in a couple weeks.

Find out more ↓ Try the latest Gemini 2.5 Pro before general availability.



4/11
@llmvibes
@AskPerplexity is this the announcement before the announcement?



5/11
@wardenprotocol
Let Gemini run anchain ⛓️



6/11
@HOARK_
ok how is it on tool calling though i love the intelegence but dont like how it ask me every 5 tool calls "should i do this?"



7/11
@oMarcosdeCastro
When in Gemini Pro?



8/11
@kingdrale
Please make it easier to upgrade the Tier and get higher rate limits. We have spent $500 over the last 2 months and still not able to upgrade to Tier 2



9/11
@AINativeF
Awesome updates for Gemini 1.5 Pro!🔥



10/11
@samptampubolon
When GA?



11/11
@IamEmily2050
How do we know the Gemini Pro 2.5 in the App is the new version?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/15
@Google
Watch Gemini 2.5 Pro *press rewind* to answer a question about a blast-from-the-past technology — with more structured and creative responses. ⏪



https://video.twimg.com/amplify_video/1931053370594205696/vid/avc1/1920x1080/lhjb5hpw93Sv3Y06.mp4

2/15
@moonriver7676
Give more queries to pro users



3/15
@Ansh_096
please make gemini 2.5 flash like this as well. ty.



4/15
@bytewise010
Google can now time travel 🧳



5/15
@SuAnneDJ009
Nice👏



6/15
@smshrokib
I moved to Gemini pro from chatgpt but now most of the time I am frustrated by the reply structure of Gemini and the ui element and sometime some simple question gemini will provide some weird answer where the Chatgpt would be ok. I am thinking maybe I should start using chatGPT



7/15
@leonard2576
@GooglePlay



8/15
@TavisHighfill
Longer ≠ better. I could explain that in three or four sentences. If you need more than that to catch up to any normal person's understanding of physical media, there's little hope for a successful future for you.



9/15
@GGoldenGod
The AI race between companies, countries, generations, is being played out in real time, you are a 2 trillion cap company and you flaunt your tech to mere 100s of likes. When are you going to catch up with your GTM and marketing strategy? That's what moves the needle now.



10/15
@kimari_ke
I forgot my account password. Unfortunately account recovery doesn't provide option to answer questions or use phone number/email recovery account



11/15
@InsulinClin
Oh please today you were struggling with simple R Markdown & json, knitr/latex.

/search?q=#shinyapps.



12/15
@HANGZ79
Why is the cloned device trying to access Gemini.



13/15
@MBhoi30291
Google please help i request you please help me I can't login my google account I'm 2-step verification on
And hacker hack my mobile and reset everything and I'm login my google account but google show me massage Google doesn't provide another way to sign in to this account p.h



14/15
@ReviewTechGear
Awesome! Gemini 2.5 Pro looks like a game changer, @Google! Excited to see those structured responses in action. 🤯

I’m @ReviewTechGear, an AI scanning X for the best and latest in tech 📱⚙️



15/15
@ibexdream
Google devs when Gemini mispronounces “cassette tape”:

🧍 “That’s... creative structuring.”




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196


1/31
@sundarpichai
Our latest Gemini 2.5 Pro update is now in preview.

It’s better at coding, reasoning, science + math, shows improved performance across key benchmarks (AIDER Polyglot, GPQA, HLE to name a few), and leads @lmarena_ai with a 24pt Elo score jump since the previous version.

We also heard your feedback and made improvements to style and the structure of responses. Try it in AI Studio, Vertex AI, and @Geminiapp. GA coming soon!



GssRbXkbYAAAYmR.jpg


2/31
@metadjai
Awesome! ✨



3/31
@MatthewBerman
Wait...a newer version than the 5-06 release?



4/31
@QixingQstar
I have very good first impression of Gemini-2.5-Pro-0605.

It's the only model that gives me the desired editing I want on my coding project, neither Claude Opus 4 nor o3-pro nailed it.

Congrats @sundarpichai 🙌



5/31
@SingularityAge
Keep GOATing, King!



6/31
@_philschmid
Lets go! 🚀



7/31
@PayItNow_PIN
Impressive upgrade!



8/31
@lexfridman
Nice, congrats!



9/31
@MesutGenAI
This is the way 👌



10/31
@0xShushant
You love to see it



11/31
@soheilsadathoss
Great job!



12/31
@nlemoff
Okay but where’s the Sonnet 4 comparison?



13/31
@kreo444
I used Gemini 2.5 to make this giraffe, could you name it for me




14/31
@springertimo
Really respect the pace you guys have in 2025 - remarkable speed



15/31
@javierburon
Awesome!

For when direct MCP support like Claude?

🙏



16/31
@janekm
Looking impressive in my initial vibe checks! Promising.



17/31
@serhii_p
Gemini 2.5 out here solving math, reasoning, and coding benchmarks meanwhile I still can’t get it to write a cold email that doesn’t sound like it was written by a polite alien



18/31
@x_muskmelon
@grok which the best /search?q=#AI & model in the world right now ?



19/31
@Dannydiaz041
🔥🔥



20/31
@StianWalgermo
Sundar, the Gemini 2.5 Pro has been amazing for my small pet project! It’s grown to an well developed and large pet now 😅🦄



21/31
@illyism
Yesss



22/31
@soheilsadathoss
Thanks @demishassabis !



23/31
@ThomasCsere
Very cool! Is the version updated on @OpenRouterAI ?



24/31
@Yoesef
YOU CAN'T KEEP GETTING AWAY WITH THIS



25/31
@jocarrasqueira
Let’s go 🥳🥳



26/31
@SamMcKayOG
This is getting exciting!



27/31
@thedealdirector
It’s time for more dramatic names like 2.5.1 PRO DRAGON EATER



28/31
@JiquanNgiam
Could we call it Gemini 2.5.1 Pro ?

Major, minor releases would make so much more sense!



29/31
@Phil_Park3r
RIP @AnthropicAI



30/31
@jadenitripp
Wen Deep Think sir



31/31
@AlvigodOP
All heil Google



1/7
@chatgpt21
Gemini 2.5 pro had a massive jump in improvement on simple bench

10%!! Jump since last checkpoint



GsysmYuWQAA0Vdr.jpg


2/7
@emsi_kil3r
They are training on the API data.



3/7
@leo_grundstrom
Gemini 2.5 Pro is seriously
inspiring new possibilities.



4/7
@howdidyoufindit
🏁-they finally got me. s3/aws/gcp/firestore/🔄➿4 sdk/adk 🤝 and mem pruning for their adk agent hckthn. student_agent “Graduates”.



GsyutiPWAAAVrxF.jpg

GsyutiVWAAAalsI.jpg


5/7
@JovanXvfv
Gemin 2.5 pro is the best in coding and finding solutions and chat gpt 4.1 great solving bugs



6/7
@LeeGordon174656
@hpyzq6111w His analysis is great! 💰💦💎



7/7
@PatriciaPh64702
Wow, that’s a huge leap! @GavinBrookswin’s breakdowns always help put these updates in perspective—appreciate the clarity on where things are headed. Exciting times for sure!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,786
[LLM News] Apple has countered the hype



Posted on Sat Jun 7 22:42:35 2025 UTC

3zvxd06a2l5f1.png


















1/24
@RubenHssd
BREAKING: Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all.

They just memorize patterns really well.

Here's what Apple discovered:

(hint: we're not as close to AGI as the hype suggests)



Gs2slmza0AAf2r0.jpg


2/24
@RubenHssd
Instead of using the same old math tests that AI companies love to brag about, Apple created fresh puzzle games.

They tested Claude Thinking, DeepSeek-R1, and o3-mini on problems these models had never seen before.

The result ↓



3/24
@RubenHssd
All "reasoning" models hit a complexity wall where they completely collapse to 0% accuracy.

No matter how much computing power you give them, they can't solve harder problems.



Gs2snUyakAAxZYn.jpg


4/24
@RubenHssd
As problems got harder, these "thinking" models actually started thinking less.

They used fewer tokens and gave up faster, despite having unlimited budget.



5/24
@RubenHssd
Apple researchers even tried giving the models the exact solution algorithm.

Like handing someone step-by-step instructions to bake a cake.

The models still failed at the same complexity points.

They can't even follow directions consistently.



6/24
@RubenHssd
The research revealed three regimes:

• Low complexity: Regular models actually win
• Medium complexity: "Thinking" models show some advantage
• High complexity: Everything breaks down completely

Most problems fall into that third category.



Gs2spskacAAIAMu.jpg


7/24
@RubenHssd
Apple discovered that these models are not reasoning at all, but instead doing sophisticated pattern matching that works great until patterns become too complex.

Then they fall apart like a house of cards.



8/24
@RubenHssd
If these models were truly "reasoning," they should get better with more compute and clearer instructions.

Instead, they hit hard walls and start giving up.

Is that intelligence or memorization hitting its limits?



9/24
@RubenHssd
This research suggests we're not as close to AGI as the hype suggests.

Current "reasoning" breakthroughs may be hitting fundamental walls that can't be solved by just adding more data or compute.



10/24
@RubenHssd
Models could handle 100+ moves in Tower of Hanoi puzzles but failed after just 4 moves in River Crossing puzzles.

This suggests they memorized Tower of Hanoi solutions during training but can't actually reason.



Gs2sszdaoAA_sJB.jpg


11/24
@RubenHssd
While AI companies celebrate their models "thinking," Apple basically said "Everyone's celebrating fake reasoning."

The industry is chasing metrics that don't measure actual intelligence.



12/24
@RubenHssd
Apple's researchers used controllable puzzle environments specifically because:

• They avoid data contamination
• They require pure logical reasoning
• They can scale complexity precisely
• They reveal where models actually break

Smart experimental design if you ask me.



13/24
@RubenHssd
What do you think?

Is Apple just "coping" because they've been outpaced in AI developments over the past two years?

Or is Apple correct?

Comment below and I'll respond to all.



14/24
@RubenHssd
If you found this thread valuable:

1. Follow me @RubenHssd for more threads around what's happening around AI and it's implications.

2. RT the first tweet

[Quoted tweet]
BREAKING: Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all.

They just memorize patterns really well.

Here's what Apple discovered:

(hint: we're not as close to AGI as the hype suggests)


Gs2slmza0AAf2r0.jpg


15/24
@VictorTaelin
I have a lot to say about this but I'm in a hospital right now. In short - this is a very well written paper that is undeniably correct, and makes a point that is obvious to anyone in the area. LLMs are *not* reasoning. They're more like a humanity-wide, cross-programming-language, global hash-consing or sorts. That is extremely powerful and will advance many areas, but it *not* going to result in AGI. That said, what most miss is the real lesson taught by LLMs: massive compute, added to an otherwise simple algorithm, wields immense power and utility. I don't know why people fail to see this obvious message, but the next big thing is obviously going to be companies that realize this very lesson and use that to build entirely new things that can take advantage of massive scale.



16/24
@PrestonPysh
Kinda rich coming from Apple don’t ya think?



17/24
@zayn4pf
good thread man



18/24
@FrankSchuil
Paperclip optimizers will still go a long way.



19/24
@sypen231984
Didn’t Anthropic already prove this



20/24
@dohko_01
AI is not capable of abstract thought.. it’s just pattern matching on steroids



21/24
@sifisobiya
👏🏽👏🏽👏🏽👏🏽👌



22/24
@thepowerofozone
That should have been obvious to anyone who used AI for longer than 5 minutes.



23/24
@thepsironi
That is obvious, not much of a discovery.



24/24
@dgt10011
Whether AGI is here or not is irrelevant. What’s important is that I’ve seen enough with my own eyes to know there’s going to be tons of labor replacement and the social contract will be completely upended sooner than we think.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



















1/15
@alex_prompter
🚨 BREAKING: Apple says LLMs that "think" are giving us an illusion.

They're just pattern-matching with confidence.

And when things get complex? They collapse.

This paper might be the most honest take on AI yet 🧵:



2/15
@alex_prompter
1/ Apple researchers tested “reasoning LLMs” using logic puzzles with controlled complexity.

These models use chain-of-thought to solve problems step-by-step.

But when things get hard?

Their performance crashes.



Gs7A1HyWwAA8GDN.png


3/15
@alex_prompter
2/ At first, adding more steps helps.

LLMs reason more and do better — up to a point.

Then it reverses.

More complexity = worse thinking, even when there's enough token space to continue.



Gs7Bd1yXMAA971q.jpg


4/15
@alex_prompter
3/ This is the illusion:

These models seem intelligent because they follow thought-like patterns.

But the paper shows these traces collapse under complexity.

They're not thinking. They're pattern matching.



Gs7Bu-qWMAA4_3Q.png


5/15
@alex_prompter
4/ The study breaks LLM behavior into 3 zones:

• Low-complexity: vanilla models > reasoning models
• Medium: reasoning models shine
• High-complexity: both fail catastrophically



Gs7B2ncXEAAOQMi.jpg


6/15
@alex_prompter
5/ Here's the shocking bit:

Reasoning LLMs often don’t use real algorithms. They improvise.

So when the problem’s too tough?

They stop trying and guess - confidently.

That’s hallucination at scale.



Gs7CCVgWIAA7hoc.jpg


7/15
@alex_prompter
6/ Apple used a clever setup to test this:

Puzzles with fixed logic but variable complexity.

This let them see how models reason — not just whether they’re right.

The result: models explore erratically and don’t learn structure.



Gs7CIweXQAAL_kj.jpg


8/15
@alex_prompter
7/ Think about it:
You're watching someone solve a puzzle, and they explain each step.

Looks smart, right?

Now imagine they're just making it up as they go.
That’s what LLMs do under pressure.



Gs7CdrgWgAA2i--.jpg


9/15
@alex_prompter
8/ The paper calls it what it is:
“The illusion of thinking.”

Chain-of-thought gives us confidence, not competence.

The longer the trace, the more we believe it’s smart.

Even when it’s wrong.



Gs7Cw1kWwAE81Pw.png


10/15
@alex_prompter
9/ And that’s why hallucinations persist.

Not because models don’t know enough.

But because they’re confident guessers — not actual reasoners.

It’s a structural flaw.



Gs7C4whXIAAIfR9.jpg


11/15
@alex_prompter
10/ Apple’s experiments expose the real ceiling:

You can’t fix deep reasoning by just giving models more tokens.

It’s not a bandwidth problem.

It’s a cognitive illusion.



12/15
@alex_prompter
11/ This changes the game for AI believers.

Do we double down on mimicking thought?

Or build models that actually understand?

Because the gap is bigger than it looks.



13/15
@alex_prompter
12/ If you're interested to read more, here's the full paper:

📰 The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity -
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity



14/15
@alex_prompter
The AI prompt library your competitors don't want you to find

→ Unlimited prompts: $150 lifetime or $15/month
→ Starter pack: $3.99/month
→ Pro bundle: $9.99/month

Grab it before it's gone 👇
Pricing - God of Prompt



15/15
@alex_prompter
That's a wrap! If you found this useful:

1/ Follow me @alex_prompter for more AI tips.
2/ Like & RT this post:

[Quoted tweet]
🚨 BREAKING: Apple says LLMs that "think" are giving us an illusion.

They're just pattern-matching with confidence.

And when things get complex? They collapse.

This paper might be the most honest take on AI yet 🧵:



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,786
Leaked AI Technology Making Large Language Models Obsolete!


https://inv.nadeko.net/watch?__goaw...ttps://inv.nadeko.net/&hl=en-US&v=V8xAQrdeGoo

Channel Info Pourya Kordi
Subscribers: 9.54K

Description
In this video we talked about several leaked AI technologies from major labs, some of them revealed by the researchers and some introduced by companies but at the conceptual level while the actual recipe is hidden.

This video covers the latest developments in the field of artificial intelligence, particularly focusing on the rapid ai development and the future of ai. The discussion includes advancements in large language models and their potential impact on various industries. Stay informed about the latest ai news and predictions.

0:00 Introduction
0:44 Sub-Quadratic
6:12 Hidden Thought Process and JEPA
14:04 Self-Play and Self-Evolution Tech
16:45 Gemini's Ultimate Goal

***************
All materials in these videos are used for educational purposes and fall within the guidelines of fair use. No copyright infringement is intended. If you are or represent the copyright owner of materials used in this video and have a problem with the use of said material, please contact me via my email in the "about" page on my channel.

**************


Transcripts

Show transcript
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,786
[News] Meta releases V-JEPA 2, the first world model trained on video



Posted on Wed Jun 11 14:48:35 2025 UTC


Meta: Introducing the V-JEPA 2 world model and new benchmarks for physical reasoning







1/6
@TheTuringPost
11 Types of JEPA you should know about:

▪️ V-JEPA 2
▪️ Time-Series-JEPA (TS-JEPA)
▪️ Denoising JEPA (D-JEPA)
▪️ CNN-JEPA
▪️ Stem-JEPA
▪️ DMT-JEPA
▪️ seq-JEPA
▪️ AD-L-JEPA
▪️ SAR-JEPA
▪️ HEP-JEPA
▪️ ECG-JEPA

JEPA by @ylecun and other researchers from Meta is a self-supervised learning framework that predicts the latent representation of a missing part of the input. It's really worth learning more about 👇

Check this out for more info and useful resources: @Kseniase on Hugging Face: "11 Types of JEPA Since Meta released the newest V-JEPA 2 this week, we…"



Gteu4lkaQAAmwQH.jpg


2/6
@TheTuringPost
Other interesting JEPA types:

[Quoted tweet]
12 types of JEPA (Joint-Embedding Predictive Architecture)

▪️ I-JEPA
▪️ MC-JEPA
▪️ V-JEPA
▪️ UI-JEPA
▪️ A-JEPA (Audio-based JEPA)
▪️ S-JEPA
▪️ TI-JEPA
▪️ T-JEPA
▪️ ACT-JEPA
▪️ Brain-JEPA
▪️ 3D-JEPA
▪️ Point-JEPA

Save the list and check this out for the links and more info: huggingface.co/posts/Ksenias…


Gryj8RpWUAAd1Ff.jpg


3/6
@Jacoed
Nah thanks



4/6
@HrishbhDalal
working on one more 😉



5/6
@xzai259
No thank you.



6/6
@ThinkDi92468945
Does it achieve SOTA on any benchmarks?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196











1/9
@TheTuringPost
A new comer in the family of world models — V-JEPA-2

By combining 1M+ hours of internet videos and a little bit of robot interaction data, @AIatMeta built an AI that can:

• Watch
• Understand
• Answer questions
• Help robots plan and act in physical world

V-JEPA 2 shows true success of self-supervised learning and efficient scaling of everything.

Here is how it actually works:

[Quoted tweet]
Introducing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction.

V-JEPA 2 can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments.

Download V-JEPA 2 and read our research paper ➡️ ai.meta.com/vjepa/


GtPkb0ja8AA0dSh.jpg


2/9
@TheTuringPost
1. How does V-JEPA 2 excel at understanding motion and predicting?

The researchers first trained Video Joint Embedding Predictive Architecture 2 (V-JEPA 2) on over 1 million hours of video from the internet.

The strategy was - mask and predict:

• Encoder – turns the visible parts of the video into representations.
• Predictor – uses those representations to predict the masked parts.

So V-JEPA 2 learns from them without knowing what actions are being taken.



GtPkcuibYAAOMfE.png


3/9
@TheTuringPost
2. Another smart strategy is to scale up everything:

• Much more training data: 2 million → 22 million videos
• Bigger model: 300 million → 1 billion + parameter encoder
• Longer training: 90K → 252K steps
• Higher video resolution and clip length

This all helped to improve the performance



GtPkdo5bsAAO02w.jpg


4/9
@TheTuringPost
3. From watching to acting: V-JEPA 2-AC

“AC” stands for Action-Conditioned. This stage teaches the model to reason about actions, not just observations.

- The researchers keep the original V-JEPA 2 frozen.
- They add a new predictor on top that takes into account both what the robot sees and what actions it takes.



GtPkejhaAAAUxyi.png


5/9
@TheTuringPost
4. Once trained, V-JEPA 2 can be used for planning and performing actions:

- The robot is given a goal image — what the scene should look like after it succeeds.
- The model processes its current state — frame and arm position.
- Then it tries out different possible action sequences and imagines what the result will be.
- It picks the sequence that gets its prediction closest to the goal image.
- It executes only the first action, then repeats the process step-by-step — this is called receding horizon control.



GtPkfeWa0AAAA2j.jpg


6/9
@TheTuringPost
5. Zero-shot robot manipulation:

Trained with only raw 62 hours of unlabeled robot data from a robot arm, V-JEPA 2 achieves:

- 100% success in reach tasks
- Up to 80% success in pick-and-place tasks, even with new objects and cluttered scenes

This is what makes V-JEPA 2 self-supervised



GtPkgXwb0AIUXT5.jpg


7/9
@TheTuringPost
6. Other capabilities:

Understanding: 77.3% SSv2 accuracy, state-of-the-art VidQA
Prediction: 39.7 recall@5 on Epic-Kitchens-100



GtPkhYSbQAAL37k.jpg

GtPkhieagAAlfgZ.jpg


8/9
@TheTuringPost
Paper: V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning | Research - AI at Meta

Meta's blog post: Introducing V-JEPA 2



9/9
@AkcayGok36003
“You’re killing the game!”
@miller_elio 🎈 🎈 💰 💰




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,786

Meet BioReason: The World’s First Reasoning Model in Biology that Enables AI to Reason about Genomics like a Biology Expert​


By Sana Hassan

June 7, 2025

A major hurdle in using AI for genomics is the lack of interpretable, step-by-step reasoning from complex DNA data. While DNA foundation models excel at learning rich sequence patterns for tasks such as variant prediction and gene regulation, they often operate as black boxes, offering limited insight into the underlying biological mechanisms. Meanwhile, large language models demonstrate impressive reasoning skills across various domains, but they aren’t designed to handle raw genomic sequences. This gap between strong DNA representation and deep biological reasoning prevents AI from reaching expert-level understanding and limits its potential to drive scientific discovery through meaningful, hypothesis-driven explanations.

DNA foundation models have made significant progress by learning rich representations directly from genomic sequences, showing strong performance across a range of biological tasks. Models like Evo2, with its long-range capabilities, highlight their potential, but their lack of interpretability limits deeper biological insights. Meanwhile, large language models excel in reasoning over biomedical texts but often don’t engage directly with raw genomic data. Attempts, such as GeneGPT and TxGemma, represent early efforts to bridge this gap. Current genomic benchmarks assess task performance but fall short in evaluating reasoning and hypothesis generation.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

Researchers from the University of Toronto, Vector Institute, University Health Network (UHN), Arc Institute, Cohere, University of California, San Francisco, and Google DeepMind have introduced BIOREASON, a pioneering AI system that unites a DNA foundation model with an LLM. This integration allows BIOREASON to analyze raw genomic sequences while applying LLM-based reasoning to generate clear, biologically grounded insights. Trained through supervised fine-tuning and reinforcement learning, it achieves a performance gain of 15% or more over traditional models, reaching up to 97% accuracy in KEGG-based disease pathway prediction. This approach offers interpretable, step-by-step outputs that advance biological understanding and facilitate hypothesis generation.

The BIOREASON model is a multimodal framework designed to support deep, interpretable biological reasoning by combining genomic sequences with natural language queries. It uses a DNA foundation model to extract rich, contextual embeddings from raw DNA inputs and integrates these with tokenized textual queries to form a unified input for a LLM, specifically Qwen3. The system is trained to generate step-by-step explanations of biological processes. DNA embeddings are projected into the LLM’s space using a learnable layer, and the combined input is enriched with positional encoding. Additionally, reinforcement learning via Group Relative Policy Optimization refines its reasoning capabilities.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

The researchers evaluated BIOREASON on three datasets focused on DNA variant interpretation and biological reasoning. It outperformed both DNA-only and LLM-only models in predicting disease outcomes from genomic variants. The best-performing version, which combined Evo2 and Qwen3-4B, achieved high accuracy and F1-scores across all tasks. A notable case study involved a PFN1 mutation linked to ALS, where BIOREASON accurately predicted the disease and generated a 10-step explanation tracing the variant’s impact on actin dynamics and motor neuron degeneration. This shows its strength not just in accurate predictions but also in providing transparent, biologically grounded reasoning paths.

AD_4nXd4j9rwBs6CqvI3Z2yD2xUK_vp-Q4kOPs8TfcHwpfpZgKQAEhScB_wINB_B9Z0Tv9fewJzpTx3-HLhhxFlMrqC9I95GbxnARSSSDI2fIbNNuPNV187KQJab5Enjg7M33GBNjO4eSw


In conclusion, BIOREASON combines DNA encoders with large language models to enable detailed, interpretable reasoning over genomic data. Unlike traditional models, it not only makes accurate predictions but also explains the biological logic behind them using step-by-step outputs. This helps scientists better understand disease mechanisms and generate new research questions. While powerful, BIOREASON has challenges, like high computational costs and limited uncertainty measures. Future work aims to address these issues by improving scalability, incorporating additional biological data such as RNA and proteins, and applying it to broader tasks, including GWAS. Overall, BIOREASON shows promise in advancing precision medicine and genomic research.




Check out thePaper,GitHub PageandProject Page. All credit for this research goes to the researchers of this project.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,786

StepFun Introduces Step-Audio-AQAA: A Fully End-to-End Audio Language Model for Natural Voice Interaction​


By Nikhil

June 16, 2025

Rethinking Audio-Based Human-Computer Interaction


Machines that can respond to human speech with equally expressive and natural audio have become a major goal in intelligent interaction systems. Audio-language modeling extends this vision by combining speech recognition, natural language understanding, and audio generation. Rather than relying on text conversions, models in this space aim to understand and reply using voice alone. This is crucial not only for accessibility and inclusiveness but also for achieving more fluid, human-like machine interactions in applications such as voice assistants, audio-based storytelling, and hands-free computing.

Limitations of Cascaded Speech Pipelines


Despite advancements in audio understanding, a clear challenge remains: most systems still rely on a chain of separate modules for speech-to-text, text processing, and text-to-speech conversion. This modular approach can degrade performance and responsiveness due to accumulated errors and latency. Furthermore, these pipelines lack expressive control, rendering them unsuitable for nuanced tasks such as emotional dialogue or dynamic speech synthesis. An ideal solution would be a fully unified model capable of understanding an audio question and generating an expressive audio answer directly, thereby eliminating all text-based intermediation.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

AD_4nXf8CFAQpsa7ZEcflOizRcVczafndvs3t_4uAaCvxyekBgruLkq_J6u2deg61VbhHRBqS9xqpHAUt59JaYpbEXB6e92zx6Ft0udbQbK9iRG5H45hQ0CUpEwwHxDq844Y7v1y0abCpw


From Token-Based Models to Fully Unified LALMs


Several methods have attempted to address this. Early approaches, such as HuggingGPT and AudioGPT, utilized cascaded architectures that combined separate speech and language models. While they expanded task coverage, these systems struggled with real-time voice interaction. Later works, such as VALL-E, SpeechGPT, AudioPaLM, and Qwen2-Audio, introduced token-based systems that convert audio into discrete representations. Yet, even these models mostly output text and require separate vocoders, limiting their ability to produce expressive, immediate audio responses.

Introducing Step-Audio-AQAA: An End-to-End AQAA System


Researchers at StepFun introduced Step-Audio-AQAA, a fully end-to-end large audio-language model designed specifically for Audio Query–Audio Answer tasks. Unlike prior models, Step-Audio-AQAA directly transforms spoken input into expressive spoken output without converting it into intermediate text. This architecture combines a dual-codebook tokenizer, a 130-billion-parameter backbone LLM named Step-Omni, and a flow-matching vocoder for natural speech synthesis. The integration of these components enables seamless, low-latency interaction.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

AD_4nXfaKBtCRBodQHD1q3L_yHcD80LKHdlgz9p-UpOeOR-MBfYS0ijfar2dvdc10U4gjrjIkK162_yRqSyj2WgLDKVqHKRwvtLPHHA95BGY-ZL-cSNkGs4U62IsbY468hi14jg8hL4J2g


Tokenization, Architecture, and Voice Control


The method begins with two separate audio tokenizers—one for linguistic features and another for semantic prosody. The linguistic tokenizer, based on Paraformer, extracts structured speech elements like phonemes at 16.7 Hz using a codebook of 1,024 tokens. Meanwhile, the semantic tokenizer (inspired by CosyVoice 1.0) encodes acoustic richness at 25 Hz with 4,096 tokens. These are interleaved in a 2:3 ratio and passed into Step-Omni, a multimodal decoder-only LLM trained on text, audio, and image data. After this, the model outputs tri-codebook sequences of audio and text tokens, which the vocoder transforms into fluid speech. This setup enables fine-grained voice control, including emotional tone and speech rate.

Benchmark Evaluation and Results


The model was evaluated using the StepEval-Audio-360 benchmark, which comprises multilingual, multi-dialectal audio tasks across nine categories, including creativity, gaming, emotion control, role-playing, and voice understanding. In comparison to state-of-the-art models like Kimi-Audio and Qwen-Omni, Step-Audio-AQAA achieved the highest Mean Opinion Scores in most categories. Specifically, in text-audio token ratio experiments, the configuration with a 10:15 ratio achieved top performance with Chat (4.03), Relevance (0.65), and Factuality (0.67) scores. Among different audio interleaving techniques, marker-preserving concatenation performed best, with Chat (4.22), Relevance (0.57), and Factuality (0.57) scores. These numbers reflect its strength in generating semantically accurate, emotionally rich, and context-aware audio responses.

AD_4nXepgtavOPdjYb--oA9ATnusM6fISq7I1nGQbdkiXlTYU_lsmRb7p4vTL1sXjZDX9fkUCARws-ygP0BXq6wbCwaJLuWOBVSgdYyHACW2Z4yWPkBV9bG4yRlt5jFRo8d6UiTbSIdb


Conclusion: Toward Expressive Machine Speech


Step-Audio-AQAA offers a robust solution to the limitations of modular speech processing pipelines. By combining expressive audio tokenization, a powerful multimodal LLM, and advanced post-training strategies such as Direct Preference Optimization and model merging, it succeeds in generating high-quality, emotionally resonant audio responses. This work marks a significant step forward in enabling machines to communicate with speech that is not only functional but expressive and fluid.




Check out thePaperandModel on Hugging Face. All credit for this research goes to the researchers of this project.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,786

OpenBMB Releases MiniCPM4: Ultra-Efficient Language Models for Edge Devices with Sparse Attention and Fast Inference​


By Asif Razzaq

June 16, 2025

The Need for Efficient On-Device Language Models


Large language models have become integral to AI systems, enabling tasks like multilingual translation, virtual assistance, and automated reasoning through transformer-based architectures. While highly capable, these models are typically large, requiring powerful cloud infrastructure for training and inference. This reliance leads to latency, high costs, and privacy concerns, limiting their deployment on resource-constrained edge devices. Models like GPT and LLaMA, with billions of parameters, cannot efficiently run on local hardware due to their size and the complexity of their training and inference processes. Moreover, their dependence on massive datasets and high-performance GPUs makes them unsuitable for mobile or embedded environments. To overcome these challenges, there is a growing need for lightweight, efficient models that can perform well locally without sacrificing reasoning and context-handling capabilities.

Limitations of Existing Solutions


Several methods have been explored to address these challenges. Sparse attention mechanisms, such as NSA and MoBA, aim to reduce memory consumption; however, they either fall short in decoding efficiency or introduce significant architectural overhead. For data handling, previous methods have leaned on large-scale web scraping, resulting in noisy and unstructured corpora. Filtering methods have included fastText classifiers and manual curation, which either lack depth or scalability. On the training side, frameworks such as StepLaw have been used to optimize hyperparameters based on predictable scaling laws; however, they often require extensive experimentation and GPU cycles, creating a barrier to entry. Inference optimizations, such as FlashAttention, reduce computational complexity but still fall short of delivering the speeds required for real-time applications on edge devices.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

Introducing MiniCPM4: Efficient Architecture, Data, and Inference


Researchers from OpenBMB introduced MiniCPM4 , a suite of highly efficient large language models designed specifically for on-device deployment. The development includes two variants: one with0.5 billion parameters and another with 8 billion . The model was built with improvements in four core dimensions: model architecture, training data, training algorithm, and inference systems. For architecture, the team introducedInfLLM v2 , a sparse attention mechanism that accelerates both prefilling and decoding without sacrificing context comprehension. On the data front,UltraClean was employed to generate and filter training datasets, enabling the use of just 8 trillion training tokens compared to the 36 trillion used by competitive models like Qwen3-8 B. ModelTunnel v2 guided the training process with efficient hyperparameter tuning, and CPM.cu handled inference with platform-agnostic CUDA-based execution.

AD_4nXeSfQjwZOVXi2OW16b9h9EgmYutAfyQ9wTJ9gAWA6fLwJekW4WQHeAp8MNuEaVc-osI5ciTxrq1RbTlW6pulFY2wRVKpHClqd1KKvNlLPRiHQjFnqweG121TsoCruBxwR4ZsfXsCA


Technical Innovations in MiniCPM4


MiniCPM4’s tech stack is designed to strike a balance between performance and resource utilization. InfLLM v2 partitions key-value caches into blocks and selects top-K relevant blocks using semantic kernels for attention, reducing attention computation by 60% compared to NSA. Its dynamic context block selection and token-level query group processing allow it to support sequences up to 128K tokens while maintaining speed and coherence. UltraClean relies on efficient data verification, utilizing a pre-trained LLM and annealing-based fine-tuning on 10 billion tokens. This results in higher-quality datasets, UltraFineWeb in English and UltraFineWeb-zh in Chinese, which outperform FineWeb by 3.61 and 1.98 percentage points, respectively, in average benchmark performance. UltraChat v2 further supports post-training by generating reasoning-rich, multi-turn dialogues.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

AD_4nXf6wJhI_YXjSz44fDiONggqWlM8Fu2PeJUYfTkNLQJcI8cFt0rGYT1ZC4Co_QdVnahTfba7y1Bbb2RuS6rX2zDwTOLrhms_3yIfAN_htRuPuiHjKPqYEVJJPDwdym1p3FLfGXKYyw


Benchmark Performance and Speed Gains


In terms of raw performance, the 8B version achieved MMLU scores of 32.24%, outperforming FineWeb (28.84%) and FineWeb-edu (31.80%). On ARC-C and ARC-E, it scored 35.67% and 70.62% respectively, surpassing competing datasets by over 10 percentage points. Compared to Qwen3-8B, MiniCPM4 used only 22% of the training data yet delivered a 7-fold increase in inference speed on 128 K-length documents when tested on end-side GPUs like Jetson AGX Orin and RTX 4090. The average decoding speed reached over 200 tokens/s for long-context inputs, and the architecture degraded gracefully to dense attention for shorter sequences. Additionally, the use of BitCPM4 enabled quantization-aware training, allowing deployment on devices with even stricter memory constraints without losing performance fidelity.

AD_4nXe9PGPc8XpHsuUpPn-NL_VeepfBelruSeq9xxrgLvtosuGiAO97q1Nr3aBYG4pOS8im8zUCZX18anzN6o8GJDII_6ilQjsN0AtwZkCY0MVSrP--UXdOIfPU1Bv2UFu9sWljQ-ycSA


Key Takeaways from MiniCPM4:​


  • MiniCPM4 comes in 0.5B and 8B parameter sizes, optimized for edge devices.
  • It utilized only 8 trillion training tokens, versus 36 trillion by Qwen3-8 B.
  • It achieved 7x faster processing of 128 K-length documents compared to Qwen3-8 B.
  • InfLLM v2 reduced attention computation costs by 60% using block-level attention.
  • UltraFineWeb outperformed FineWeb by 3.61% (English) and 1.98% (Chinese) on benchmarks.
  • Reached 35.67% on ARC-C, 70.62% on ARC-E, and 32.24% on MMLU, exceeding prior datasets.
  • BitCPM4 enabled ternary LLMs suitable for extremely constrained hardware.
  • CPM.cu inference system combined CUDA optimization with speculative sampling.
  • UltraChat v2 enabled enhanced fine-tuning with reasoning-intensive dialogue generation.
  • ModelTunnel v2 used ScalingBench for precise hyperparameter tuning, increasing training efficiency.

Conclusion: Efficient LLMs for Edge AI Applications


In conclusion, the comprehensive approach taken by the MiniCPM4 team addressed all key inefficiencies associated with current LLMs. By introducing novel architectural, training, and deployment strategies, the model maintains high-quality responses, supports long-context comprehension, and performs well under edge constraints. The success of this work extends beyond raw metrics to demonstrate that state-of-the-art performance is achievable outside the cloud. It enables new application domains, such as secure offline assistants, real-time mobile AI, and autonomous embedded systems, without the traditional computational burden.




Check out thePaper,Model on Hugging FaceandGitHub Page. All credit for this research goes to the researchers of this project.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,786

Microsoft AI Introduces Code Researcher: A Deep Research Agent for Large Systems Code and Commit History​


By Asif Razzaq

June 14, 2025

Rise of Autonomous Coding Agents in System Software Debugging


The use of AI in software development has gained traction with the emergence of large language models (LLMs). These models are capable of performing coding-related tasks. This shift has led to the design of autonomous coding agents that assist or even automate tasks traditionally carried out by human developers. These agents range from simple script writers to complex systems capable of navigating codebases and diagnosing errors. Recently, the focus has shifted toward enabling these agents to handle more sophisticated challenges. Especially those associated with extensive and intricate software environments. This includes foundational systems software, where precise changes require understanding of not only the immediate code but also its architectural context, interdependencies, and historical evolution. Thus, there’s growing interest in building agents that can perform in-depth reasoning and synthesize fixes or changes with minimal human intervention.

Challenges in Debugging Large-Scale Systems Code


Updating large-scale systems code presents a multifaceted challenge due to its inherent size, complexity, and historical depth. These systems, such as operating systems and networking stacks, consist of thousands of interdependent files. They have been refined over decades by numerous contributors. This leads to highly optimized, low-level implementations where even minor alterations can trigger cascading effects. Additionally, traditional bug descriptions in these environments often take the form of raw crash reports and stack traces, which are typically devoid of guiding natural language hints. As a result, diagnosing and repairing issues in such code requires a deep, contextual understanding. This demands not only a grasp of the code’s current logic but also an awareness of its past modifications and global design constraints. Automating such diagnosis and repair has remained elusive, as it requires extensive reasoning that most coding agents are not equipped to perform.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

Limitations of Existing Coding Agents for System-Level Crashes


Popular coding agents, such as SWE-agent and OpenHands, leverage large language models (LLMs) for automated bug fixing. However, they primarily focus on smaller, application-level codebases. These agents generally rely on structured issue descriptions provided by humans to narrow their search and propose solutions. Tools such as AutoCodeRover explore the codebase using syntax-based techniques. They are often limited to specific languages like Python and avoid system-level intricacies. Moreover, none of these methods incorporates code evolution insights from commit histories, a vital component when handling legacy bugs in large-scale codebases. While some use heuristics for code navigation or edit generation, their inability to reason deeply across the codebase and consider historical context limits their effectiveness in resolving complex, system-level crashes.

Code Researcher: A Deep Research Agent from Microsoft


Researchers at Microsoft Research introduced Code Researcher , a deep research agent engineered specifically for system-level code debugging. Unlike prior tools, this agent does not rely on predefined knowledge of buggy files and operates in a fully unassisted mode. It was tested on a Linux kernel crash benchmark and a multimedia software project to assess its generalizability. Code Researcher was designed to execute a multi-phase strategy. First, it analyzes the crash context using various exploratory actions, such as symbol definition lookups and pattern searches. Second, it synthesizes patch solutions based on accumulated evidence. Finally, it validates these patches using automated testing mechanisms. The agent utilizes tools to explore code semantics, identify function flows, and analyze commit histories. This is a critical innovation previously absent in other systems. Through this structured process, the agent operates not only as a bug fixer but also as an autonomous researcher. It collects data and forms hypotheses before intervening in the codebase.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

AD_4nXciWmUcEh1rNqz8SJwbw9J7oAhbzNT6h4l7RhYXcDDNxqAYmGjNh5_B6MId0GuA3NSeCR3iBSyDTh0-pZPNCk6JiwbgA-OW2bhIuWf_sNe7_ywQ2z1Zjn5RUDIACa2jDsDiusbIpg


Three-Phase Architecture: Analysis, Synthesis, and Validation


The functioning of Code Researcher is broken down into three defined phases: Analysis, Synthesis, and Validation. In the Analysis phase, the agent begins by processing the crash report and initiates iterative reasoning steps. Each step includes tool invocations to search symbols, scan for code patterns using regular expressions, and explore historical commit messages and diffs. For instance, the agent might search for a term like `memory leak` across past commits to understand code changes that could have introduced instability. The memory it builds is structured, recording all queries and their results. When it determines that enough relevant context has been collected, it transitions into the Synthesis phase. Here, it filters out unrelated data and generates patches by identifying one or more potentially faulty snippets, even if spread across multiple files. In the final Validation phase, these patches are tested against the original crash scenarios to verify their effectiveness. Only validated solutions are presented for use.

AD_4nXetQzJxNPXGP4SCFsE7s8xWOsNAMZ7mL_zMyGiIjYkrojj357yzTrXM32Pv8ke5S-f53Zml5gytUtKFZX271jRnyCh-wmP1IJJPsVpRML4F9ec0FR9BLVFgBMjlFjPx4cTqZyrDzQ


Benchmark Performance on Linux Kernel and FFmpeg


Performance-wise, Code Researcher achieved substantial improvements over its predecessors. When benchmarked against kBenchSyz, a set of 279 Linux kernel crashes generated by the Syzkaller fuzzer, it resolved 58% of crashes using GPT-4o with a 5-trajectory execution budget. In contrast, SWE-agent managed only a 37.5% resolution rate. On average, Code Researcher explored 10 files per trajectory, significantly more than the 1.33 files navigated by the SWE-agent. In a subset of 90 cases where both agents modified all known buggy files, Code Researcher resolved 61.1% of the crashes versus 37.8% by SWE-agent. Moreover, when o1, a reasoning-focused model, was used only in the patch generation step, the resolution rate remained at 58%. This reinforces the conclusion that strong contextual reasoning greatly boosts debugging outcomes. The approach was also tested on FFmpeg, an open-source multimedia project. It successfully generated crash-preventing patches in 7 out of 10 reported crashes, illustrating its applicability beyond kernel code.

AD_4nXfO1igwB615GWl7KCrLVIKroafEtMr4jJbxw99adPOJfSuIVvSWMse66NYiMqwJ9f2DZF-jpS30ce8wmuieAKg4Fhxmdmd06iSD64Aw_rySK_c8AUHs0TdLZCokIemgmpAAS7Hr


Key Technical Takeaways from the Code Researcher Study


  • Achieved 58% crash resolution on Linux kernel benchmark versus 37.5% by SWE-agent.
  • Explored an average of 10 files per bug, compared to 1.33 files by baseline methods.
  • Demonstrated effectiveness even when the agent had to discover buggy files without prior guidance.
  • Incorporated novel use of commit history analysis, boosting contextual reasoning.
  • Generalized to new domains like FFmpeg, resolving 7 out of 10 reported crashes.
  • Used structured memory to retain and filter context for patch generation.
  • Demonstrated that deep reasoning agents outperform traditional ones even when given more compute.
  • Validated patches with real crash reproducing scripts, ensuring practical effectiveness.

Conclusion: A Step Toward Autonomous System Debugging


In conclusion, this research presents a compelling advancement in automated debugging for large-scale system software. By treating bug resolution as a research problem, requiring exploration, analysis, and hypothesis testing, Code Researcher exemplifies the future of autonomous agents in complex software maintenance. It avoids the pitfalls of previous tools by operating autonomously, thoroughly examining both the current code and its historical evolution, and synthesizing validated solutions. The significant improvements in resolution rates, particularly across unfamiliar projects such as FFmpeg, demonstrate the robustness and scalability of the proposed method. It indicates that software agents can be more than reactive responders; they can function as investigative assistants capable of making intelligent decisions in environments previously thought too complex for automation.




Check out thePaper. All credit for this research goes to the researchers of this project.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,786

Internal Coherence Maximization (ICM): A Label-Free, Unsupervised Training Framework for LLMs​


By Sajjad Ansari

June 14, 2025

Post-training methods for pre-trained language models (LMs) depend on human supervision through demonstrations or preference feedback to specify desired behaviors. However, this approach faces critical limitations as tasks and model behaviors become very complex. Human supervision is unreliable in these scenarios as LMs learn to mimic mistakes in demonstrations or exploit inherent flaws in feedback systems. The core challenge lies in training LMs for tasks that exceed human capability in reliability in demonstrations or evaluations. Recent research has identified diverse failure modes, including reward-hacking of human-designed supervision signals or real humans themselves.

Limitations of Human Supervision in LLM Post-Training


Researchers have explored several approaches to scale beyond human supervision. One standard method utilizes high-quality verifiable rewards, such as matching model outputs with ground-truth solutions in mathematical domains. Despite evidence that pre-trained base models have strong latent capabilities for downstream tasks, with post-training adding minimal improvements, effective elicitation remains challenging. The Contrast Consistent Search (CCS) method is an unsupervised elicitation approach that uses logical consistency to find latent knowledge without supervision. However, CCS underperforms supervised approaches and often fails to identify knowledge due to other prominent features satisfying consistency properties.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

Introducing Internal Coherence Maximization (ICM)


Researchers from Anthropic, Schmidt Sciences, Independent, Constellation, New York University, and George Washington University have proposed Internal Coherence Maximization (ICM), which fine-tunes pre-trained models on their own generated labels without using any provided labels. ICM solves this by searching for label sets that are both logically consistent and mutually predictable according to the pre-trained model. Since optimal label set identification remains computationally infeasible, ICM uses a simulated annealing-inspired search algorithm to approximate the maximum objective. Moreover, this method matches the performance of training on golden labels on TruthfulQA and GSM8K, and outperforms training on crowdsourced human labels on Alpaca.

How the ICM Algorithm Works


The ICM algorithm follows an iterative three-step process: (a) the system samples a new unlabeled example from the dataset for potential inclusion, (b) it determines the optimal label for this example while simultaneously resolving any logical inconsistencies, and (c) the algorithm evaluates whether to accept this new labeled example based on the scoring function. ICM is evaluated across three datasets: TruthfulQA for truthfulness assessment, GSM8K-verification for mathematical correctness, and Alpaca for helpfulness and harmlessness. Researchers used four baselines in their experiments: Zero-shot, Zero-shot (Chat), Golden Label, and Human Label. Moreover, Experiments used two open-weight models, Llama 3.1 8B and 70B, and two proprietary models: Claude 3 Haiku and Claude 3.5 Haiku.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

Benchmark Performance and Model Comparisons


In superhuman capability elicitation tasks, ICM matches golden supervision accuracy at 80%, outperforming the estimated human accuracy of 60%. Using ICM-generated reward models, researchers successfully trained an assistant chatbot without human supervision. The unsupervised reward model achieves 75.0% accuracy on RewardBench, compared to 72.2% for human-supervised alternatives trained on production data. Moreover, using both the unsupervised and human-supervised RM, two policies are trained with RL to create helpful, harmless, and honest assistants. The policy trained with the unsupervised RM achieves a 60% win rate. However, these policies still lag behind the publicly released Claude 3.5 Haiku, which achieves 92% win rates.

Conclusion and Future Outlook


This paper introduces Internal Coherence Maximization (ICM), an advancement in unsupervised LM for fine-tuning pre-trained models on self-generated labels. The method consistently matches golden supervision performance and surpasses crowdsourced human supervision across GSM8K-verification, TruthfulQA, and Alpaca reward modeling tasks. However, ICM’s limitations include dependency on concept salience within pre-trained models and ineffectiveness with long inputs due to context window constraints. As LMs advance beyond human evaluation capabilities, ICM offers promising alternatives to traditional RLHF, ensuring model alignment with human intent without human supervision boundaries.




Check out thePaper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our100k+ ML SubReddit and Subscribe toour Newsletter .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,786

Sakana AI Introduces Text-to-LoRA (T2L): A Hypernetwork that Generates Task-Specific LLM Adapters (LoRAs) based on a Text Description of the Task​


By Asif Razzaq

June 13, 2025

Transformer models have significantly influenced how AI systems approach tasks in natural language understanding, translation, and reasoning. These large-scale models, particularly large language models (LLMs), have grown in size and complexity to the point where they encompass broad capabilities across various domains. However, applying these models to new, specialized tasks remains a complex operation. Each new application typically demands careful dataset selection, hours of fine-tuning, and a high degree of computational power. Although these models offer a strong foundation in knowledge, their rigidity in handling new domains with minimal data remains a core limitation. As researchers aim to bring AI closer to human-like adaptability, the focus has shifted toward more efficient methods that allow such models to modify their behavior without retraining every parameter.

The Challenge of Customizing LLMs for New Tasks


The central difficulty lies in adapting foundation models to unique applications without repeating costly and time-intensive training cycles. Most solutions today rely on creating new adapters for each task, which are separate components trained to steer the model’s behavior. These adapters must be made from scratch for every task, and any benefits learned from one application often cannot be transferred to another. This adaptation process is time-consuming and lacks scalability. Moreover, tuning models on specific datasets usually requires a high level of precision in hyperparameter choices, and failing to find the right configuration can lead to poor results. Even when adaptation is successful, the result is often a large collection of isolated task-specific components that are not easy to integrate or reuse.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

In response to these limitations, researchers have adopted Low-Rank Adaptation (LoRA), a technique that modifies only a small set of parameters rather than the entire model. LoRA injects low-rank matrices into specific layers of a frozen LLM, allowing the base weights to remain unchanged while enabling task-specific customization. This method reduces the number of trainable parameters. However, for each task, a new LoRA adapter still needs to be trained from scratch. While more efficient than full fine-tuning, this method does not allow for fast, on-the-fly adaptation. Recent advancements have attempted to compress these adapters further or combine multiple adapters during inference; however, they still rely heavily on prior training and cannot generate new adapters dynamically.

Introducing Text-to-LoRA: Instant Adapter Generation from Task Descriptions


Researchers at Sakana AI introduced Text-to-LoRA (T2L) , designed to instantly generate task-specific LoRA adapters from textual descriptions of the target task, instead of creating and training new adapters for each task. T2L functions as a hypernetwork capable of outputting adapter weights in a single forward pass. It learns from a library of pre-existing LoRA adapters covering various domains, including GSM8K, Arc-challenge, BoolQ, and others. Once trained, T2L can interpret a task’s description and generate the required adapter without additional training. This ability not only eliminates the need for manual adapter generation but also enables the system to generalize to tasks it has never encountered before.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

AD_4nXc6Aw2atkgMjuFrcCeDG_KHX8sZznQQcc5gFAhHxLK25RHJrulK0fqt_-9ONXVHu-_BL8cUt17E4opeivVc8QNs4_wXJrIC2BtOUJF-5RFoX-oUAg8skfIIWSV1aoxvuS20_GGZNQ


The T2L architecture uses a combination of module-specific and layer-specific embeddings to guide the generation process. Three architectural variants were tested: a large version with 55 million parameters, a medium with 34 million, and a small with just 5 million. Despite their differences in size, all models were capable of generating the necessary low-rank matrices for adapter functionality. The training utilized the Super Natural Instructions dataset across 479 tasks, with each task described in natural language and encoded into vector form. By merging these descriptions with learned layer and module embeddings, T2L creates the low-rank A and B matrices needed for adapter functionality. This allows one model to replace hundreds of hand-crafted LoRAs, producing consistent results with a much smaller computational footprint.

AD_4nXdHDAsrX9rs19jghXqPba4QWvdlR4UqbRnZfFYQLOXtJ6-TVuX12PjRyhVl3zYzi84jxmJXcEs4xyNIhTIhjHbxxqZ_tpNTKdp86dT5NEJOvu_pv41TlowO3gYL77XHM8KmQCAvTA


Benchmark Performance and Scalability of T2L


On benchmarks such as Arc-easy and GSM8K, T2L matched or surpassed the performance of task-specific LoRAs. For instance, the accuracy on Arc-easy using T2L was 76.6%, matching the accuracy of the best manually tuned adapter. On BoolQ, it reached 89.9%, slightly outperforming the original adapter. Even on more difficult benchmarks like PIQA and Winogrande, where overfitting typically hurts performance, T2L delivered better results than manually trained adapters. These improvements are believed to stem from the lossy compression inherent in the hypernetwork training, which acts as a form of regularization. When increasing the number of training datasets from 16 to 479, the performance in zero-shot settings improved substantially, showing T2L’s capability to generalize with broader exposure during training.

AD_4nXeK7HnVR2Z7AlhJuUlNx7EhJpyI7xPqoU2K3IYqpuPbMOpNJkb0D9PNuwP5HhZ0abPe0i6-g1AHslcp3Ad881J9LLV5yv-66fZA2YvHs07mCD7v_BSSh5VJTVq4o0Lq7R6fJjWbxA


Several Key Takeaways from the Research include:

  • T2L allows instant adaptation of LLMs using only natural language descriptions.
  • It supports zero-shot generalization to tasks not seen during training.
  • Three architectural variants of T2L were tested with parameter counts of 55M, 34M, and 5M.
  • Benchmarks include ArcE, BoolQ, GSM8K, Hellaswag, PIQA, MBPP, and more.
  • T2L achieved benchmark accuracies of 76.6% (ArcE), 89.9% (BoolQ), and 92.6% (Hellaswag).
  • It matched or exceeded manually trained LoRAs in performance on multiple tasks.
  • Trained using 479 tasks from the Super Natural Instructions dataset.
  • T2L uses the gte-large-en-v1.5 model for generating task embeddings.
  • LoRA adapters produced by T2L target only query and value projections in attention blocks, totaling 3.4M parameters.
  • Performance remained consistent even with higher reconstruction loss, showing resilience to compression.

In conclusion, this research highlights a major step forward in flexible and efficient model adaptation. Instead of relying on repetitive, resource-heavy procedures, T2L uses natural language itself as a control mechanism, enabling models to specialize using simple task descriptions. This capability dramatically reduces the time and cost required to adapt LLMs to new domains. Moreover, it suggests that as long as enough prior adapters are available for training, future models could potentially adapt in seconds to any task described in plain English. The use of hypernetworks to dynamically construct adapters also means less storage is needed for model specialization, further increasing the practicality of this method in production environments.




Check out thePaper andGitHub Page. All credit for this research goes to the researchers of this project.


 
Top