bnew

Veteran
Joined
Nov 1, 2015
Messages
63,322
Reputation
9,672
Daps
173,076

Meta AI Introduces ReasonIR-8B: A Reasoning-Focused Retriever Optimized for Efficiency and RAG Performance​


By Asif Razzaq

April 30, 2025

Addressing the Challenges in Reasoning-Intensive Retrieval


Despite notable progress in retrieval-augmented generation ( RAG ) systems, retrieving relevant information for complex, multi-step reasoning tasks remains a significant challenge. Most retrievers today are trained on datasets composed of short factual questions, which align well with document-level lexical or semantic overlaps. However, they fall short when faced with longer, abstract, or cross-domain queries that require synthesizing dispersed knowledge. In such cases, retrieval errors can propagate through the pipeline, impairing downstream reasoning by large language models (LLMs). While LLM-based rerankers can improve relevance, their substantial computational cost often renders them impractical in real-world deployments.

Meta AI Introduces ReasonIR-8B, a Retriever Built for Reasoning


Meta AI has released ReasonIR-8B , a retriever model designed explicitly for reasoning-intensive information retrieval. Trained from LLaMA3.1-8B, the model establishes new performance standards on the BRIGHT benchmark, achieving a normalized Discounted Cumulative Gain (nDCG@10) of 36.9 when used with a lightweight Qwen2.5 reranker. Notably, it surpasses leading reranking models such as Rank1-32B while offering 200× lower inference-time compute , making it significantly more practical for scaled RAG applications.

ReasonIR-8B is trained using a novel data generation pipeline, ReasonIR-SYNTHESIZER , which constructs synthetic queries and document pairs that mirror the challenges posed by real-world reasoning tasks. The model is released open-source on Hugging Face , along with training code and synthetic data tools, enabling further research and reproducibility.

Screenshot-2025-04-30-at-11.17.59%E2%80%AFPM-1-1024x763.png


Model Architecture, Training Pipeline, and Key Innovations


ReasonIR-8B employs a bi-encoder architecture , where queries and documents are encoded independently into embeddings and scored via cosine similarity. The model’s training relies heavily on synthetically generated data tailored to reasoning scenarios. The ReasonIR-SYNTHESIZER pipeline produces two primary types of training instances:

  • Varied-Length (VL) Queries : These are long, information-rich queries (up to 2000 tokens), paired with corresponding documents, encouraging the retriever to handle extended contexts effectively.
  • Hard Queries (HQ) : Derived from curated documents with high educational value, these queries are designed to require logical inference. Multi-turn prompts are used to construct hard negatives —documents that appear superficially relevant but do not contain the necessary reasoning pathways.

This approach contrasts with conventional negative sampling methods, which often rely on lexical overlap and are less effective for abstract or multi-hop questions.

Screenshot-2025-04-30-at-11.18.18%E2%80%AFPM-1-1024x607.png


Additionally, the model’s attention mask is modified from LLaMA’s causal configuration to a bi-directional one , allowing the encoder to consider the full query context symmetrically, which is beneficial for non-sequential semantic alignment.

Empirical Results on IR and RAG Benchmarks


ReasonIR-8B achieves strong performance across several benchmarks:

  • BRIGHT Benchmark (Reasoning-Intensive Retrieval):
    • 24.4 nDCG@10 on original queries
    • 29.9 with GPT-4 rewritten queries
    • 36.9 with Qwen2.5 reranking , outperforming larger LLM rerankers at a fraction of the cost
  • Retrieval-Augmented Generation (RAG) Tasks:
    • +6.4% improvement on MMLU over a closed-book baseline
    • +22.6% improvement on GPQA

These gains are consistent across both standard and rewritten queries, with further improvements observed when combining REASONIR-8B with a sparse retriever like BM25 or a lightweight reranker.

Screenshot-2025-04-30-at-11.18.34%E2%80%AFPM-1-1024x715.png


Importantly, the model continues to improve as query lengths scale , unlike other retrievers whose performance plateaus or declines. This suggests that ReasonIR-8B can better exploit information-rich queries, making it particularly well-suited for test-time techniques such as query rewriting.

Conclusion


ReasonIR-8B addresses a key bottleneck in reasoning-focused information retrieval by introducing a retriever optimized not only for relevance but also for computational efficiency. Its design—rooted in synthetic training tailored for reasoning, coupled with architectural and data-centric improvements—enables consistent gains in both retrieval and RAG tasks.

By releasing the model, codebase, and training data generation pipeline as open-source tools, Meta AI encourages the research community to extend this work toward more robust, multilingual, and multimodal retrievers. For applications requiring cost-effective and high-quality retrieval under reasoning constraints, ReasonIR-8B represents a compelling and practical solution.




Check out the Paper , HuggingFace Page and GitHub Page . Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Gr oup . Don’t Forget to join our 90k+ ML SubReddit .

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,322
Reputation
9,672
Daps
173,076

Training LLM Agents Just Got More Stable: Researchers Introduce StarPO-S and RAGEN to Tackle Multi-Turn Reasoning and Collapse in Reinforcement Learning​


By Mohammad Asjad

May 1, 2025

Large language models (LLMs) face significant challenges when trained as autonomous agents in interactive environments. Unlike static tasks, agent settings require sequential decision-making, cross-turn memory maintenance, and adaptation to stochastic environmental feedback. These capabilities are essential for developing effective planning assistants, robotics applications, and tutoring agents that can self-improve through experience. While reinforcement learning (RL) has been applied to LLMs using rule-based rewards, training self-evolving agents that can reason and adapt remains underexplored. Current approaches suffer from training instability, complex reward signal interpretation, and limited generalisation across varying prompts or changing environments, particularly during multi-turn interactions with unpredictable feedback. The fundamental question emerges: which design elements are crucial for creating LLM agents that learn effectively and maintain stability throughout their evolution?

Through diverse methodologies, RL has significantly advanced LLMs’ reasoning capabilities. PPO maintains training stability by clipping policy updates, while GRPO enhances systematic problem-solving abilities. SAC employs entropy-regularised objectives for robust exploration, and meta tokens facilitate structured thinking. PRM and MCTS-based approaches have further improved systematic reasoning. Simultaneously, chain-of-thought techniques like STaR iteratively utilise small rationale examples alongside larger datasets. At the same time, DAPO, Dr. GRPO, and Open Reasoner Zero demonstrate that minimalist RL techniques with decoupled clipping and simple reward schemes can substantially enhance reasoning performance.

LLM agent architectures have evolved from basic reasoning-action frameworks to structured planning approaches and complex multi-agent systems. Testing environments range from specialised platforms like Sokoban and FrozenLake to general-purpose frameworks like HuggingGPT, enabling applications from web navigation to coding assistance and embodied tasks. Despite these advances, challenges persist in architectural complexity and self-correction, particularly for diverse multi-step reasoning tasks where maintaining coherence across interactions remains problematic.

Researchers have approached agent learning through StarPO (State-Thinking-Actions-Reward Policy Optimisation) , a unified framework for trajectory-level agent training with flexible control over reasoning processes, reward mechanisms, and prompt structures. Building on this framework, they developed RAGEN , a modular system implementing complete training loops for analysing LLM agent dynamics in multi-turn stochastic environments. To isolate learning factors from confounding variables like pretrained knowledge, evaluation focuses on three controlled gaming environments: Bandit (single-turn, stochastic), Sokoban (multi-turn, deterministic), and Frozen Lake (multi-turn, stochastic). These minimalistic environments require policy learning through interaction rather than relying on pre-existing knowledge. The analysis reveals three critical dimensions of agent learning: gradient stability issues in multi-turn reinforcement learning, the importance of rollout frequency and diversity in shaping agent evolution, and the need for carefully designed reward signals to develop genuine reasoning capabilities rather than shallow action selection or hallucinated thinking processes.

AD_4nXczHbSWqzz1jksxmS6Ydp7IXvyyHgVEpsOufGaVV2a_dmQIuhCrASA9U8iuLKlUzOvL5mb-d9eQ3N8cDUpmPSk1v4C2jtqkA44suIrMBvSGhyu0hHxaC5mKm035olLE6Y8baYjb


StarPO represents a unique framework designed specifically for optimising multi-turn interaction trajectories in LLM agents. Unlike traditional approaches that treat each action independently, StarPO optimises entire trajectories—including observations, reasoning traces, actions, and feedback—as coherent units. This trajectory-level approach is particularly suited for interactive environments where agents must maintain memory across turns and adapt to stochastic feedback. StarPO’s objective function focuses on maximising expected rewards across complete trajectories rather than individual steps, making it directly compatible with autoregressive LLMs through decomposition into token-level likelihoods. The framework integrates reasoning-guided structured outputs that combine both intermediate thinking processes and executable actions, enabling agents to develop more sophisticated decision-making capabilities while maintaining learning stability in complex environments.

AD_4nXehnKKuKoISRugwdIreNisAwuDZedf7OwxCc1eMd4EAroQhQ6Hw6l3Xg8AOPyP2Yh2220gLsKJHgYNJMt3tLjb0_xF6LrZpECRmxjCOhqGserEHTq3m-PtViuxcNQlUL0ZFv3vIXQ


AD_4nXf7ZSbq3qSkgd6_4OMC2Sq2tUPsJ6tHw4VlrUQEOf_rSQrJmrBkR5gvE-DSlKICkfhD6gSpd3L8oSmJNFvfxRO6mCiDSqGvdbBlAe7jaBpWDZNBA1JKEpMtqZK6H9eSPFs6QHW5ow


Experimental results reveal that StarPO-S significantly outperforms vanilla StarPO across multiple agent tasks. By implementing uncertainty-based instance filtering, KL term removal, and asymmetric clipping, StarPO-S effectively delays performance collapse and enhances final task outcomes. The stabilised approach demonstrates particular effectiveness in complex environments like FrozenLake and Sokoban, where retaining only 25-50% of high-variance rollouts dramatically improves training stability while reducing computational requirements by up to 50%.

Task diversity and interaction granularity significantly impact performance. Models trained with higher task diversity and 4-6 actions per turn demonstrate superior generalisation capabilities across novel vocabulary and larger environments. Also, frequent rollout updates prove critical for maintaining alignment between optimisation targets and policy behavior. Agents trained with up-to-date rollouts every 1-10 updates achieve faster convergence and higher success rates compared to those relying on outdated trajectory data.

Symbolic reasoning benefits vary substantially between single-turn and multi-turn tasks. While reasoning traces significantly improve generalisation in single-turn Bandit environments, they provide limited advantage in complex multi-turn settings like Sokoban and FrozenLake. Analysis shows reasoning length consistently declines during training, suggesting models gradually suppress their thought processes when rewards are sparse and delayed. This highlights the need for reward mechanisms that directly reinforce intermediate reasoning steps rather than relying solely on outcome-based feedback.

AD_4nXfAPp5nQea4q95ZEYhq1FOel6MEFVdcSYtJI_Q0riMm7liVFoVGCXBlpQ3n07N3x1vDcYraNvPX2z6E3UH-XnkWYvNZbXgO_s5S62P1JMG7h5MN17buFIWfeXGhjAszhrAbaUz68A


AD_4nXe7-ZzOzA1YMT9xANRyjM7Pl-fc068EA0hCpHryALaWpUr0wkUG3TErKBUcnziNv-aBXkWUMZuWFm3RDN-NZ9lXOk5A30rqYp_qhtFXRkJC7bqiXcDkzlbyEVseAwYLRG5I1Mti


AD_4nXe7npyAxfXAC3NVbJ_ppHdo4BfAbrYbXtqYWaBB-qJxpOEPA7kn_bszxHkRSQTjD424EU30A-G1QEo2V_iiyyx6kO7dAdq6yhRPtq7hm6bQEOu156oh2QIYXOcrhLPusqd8x2udGA


AD_4nXfVoxtMx-MZbrMsnOhpQ7j5wINhlTNrRj59uGPPwxCE0dUE8StyJvlsgTsgUx8otJjQrF8BGybnRm9bgfF_cpwrbZaNtuXr8xzYbqmNgIka2hrhtt0CS39MweigqBGFQGnfjfEh6A


AD_4nXfDydWXw7Rcm4zsG_OF-RgslTYqiy590qOoa3_-xSi46iqQEWBlrtTXOSog9f-9WRfAcz-QB5sbsC7sbHbO-BweUpse_mFlnnlrAriFA2fqLbmFuR8lJZS_ffQchDWLAv8CnGo3cw


AD_4nXfa4SQzdPibJvcDTC0MwEyjtuTG-aK_BKcoHO8TUiNBgJ4sIwqGGxttfcpPMIwdo5pfk17docJEO8ThIPruleodp2NVJ4pq1PVfp979kk4b1f_xi1GdwdZ7eMpYmEPHzwgT6qd4RA


AD_4nXfNXreROzSlwxTDQNHAcumVPkG3QlQBj5vwM3WAyt0vbWNqys8L_o39qsQ296HIHzP5Z8IiyLGhzmUB3oRxcWf43S-qEnTkKsd_aJWsiFAwVBPR3RiNZdXCJ3GzReWCrOmfXuc27A


This research establishes reinforcement learning as a viable approach for training language agents in complex, stochastic environments. StarPO-S represents a significant advancement in stabilising multi-turn agent training through uncertainty-based sampling and exploration encouragement. By transitioning from human supervision to verifiable outcome-based rewards, this framework creates opportunities for developing more capable AI systems across theorem proving, software engineering, and scientific discovery. Future work should focus on multi-modal inputs, enhanced training efficiency, and applications to increasingly complex domains with verifiable objectives.




Check out the Paper and GitHub Page . Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Gr oup . Don’t Forget to join our 90k+ ML SubReddit .

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,322
Reputation
9,672
Daps
173,076

Multimodal AI on Developer GPUs: Alibaba Releases Qwen2.5-Omni-3B with 50% Lower VRAM Usage and Nearly-7B Model Performance​


By Asif Razzaq

April 30, 2025

Multimodal foundation models have shown substantial promise in enabling systems that can reason across text, images, audio, and video. However, the practical deployment of such models is frequently hindered by hardware constraints. High memory consumption, large parameter counts, and reliance on high-end GPUs have limited the accessibility of multimodal AI to a narrow segment of institutions and enterprises. As research interest grows in deploying language and vision models at the edge or on modest computing infrastructure, there is a clear need for architectures that offer a balance between multimodal capability and efficiency.

Alibaba Qwen Releases Qwen2.5-Omni-3B: Expanding Access with Efficient Model Design


In response to these constraints, Alibaba has released Qwen2.5-Omni-3B , a 3-billion parameter variant of its Qwen2.5-Omni model family. Designed for use on consumer-grade GPUs—particularly those with 24GB of memory—this model introduces a practical alternative for developers building multimodal systems without large-scale computational infrastructure.

Screenshot-2025-04-30-at-3.14.17%E2%80%AFPM-1-1024x588.png


Available through GitHub , Hugging Face , and ModelScope , the 3B model inherits the architectural versatility of the Qwen2.5-Omni family. It supports a unified interface for language, vision, and audio input, and is optimized to operate efficiently in scenarios involving long-context processing and real-time multimodal interaction.

Model Architecture and Key Technical Features


Qwen2.5-Omni-3B is a transformer-based model that supports multimodal comprehension across text, images, and audio-video input. It shares the same design philosophy as its 7B counterpart, utilizing a modular approach where modality-specific input encoders are unified through a shared transformer backbone. Notably, the 3B model reduces memory overhead substantially, achieving over 50% reduction in VRAM consumption when handling long sequences (~25,000 tokens).

overview-1024x956.png


Key design characteristics include:

  • Reduced Memory Footprint : The model has been specifically optimized to run on 24GB GPUs, making it compatible with widely available consumer-grade hardware (e.g., NVIDIA RTX 4090).
  • Extended Context Processing : Capable of processing long sequences efficiently, which is particularly beneficial in tasks such as document-level reasoning and video transcript analysis.
  • Multimodal Streaming : Supports real-time audio and video-based dialogue up to 30 seconds in length, with stable latency and minimal output drift.
  • Multilingual Support and Speech Generation : Retains capabilities for natural speech output with clarity and tone fidelity comparable to the 7B model.

Performance Observations and Evaluation Insights


According to the information available on ModelScope and Hugging Face , Qwen2.5-Omni-3B demonstrates performance that is close to the 7B variant across several multimodal benchmarks. Internal assessments indicate that it retains over 90% of the comprehension capability of the larger model in tasks involving visual question answering, audio captioning, and video understanding.

In long-context tasks, the model remains stable across sequences up to ~25k tokens, making it suitable for applications that demand document-level synthesis or timeline-aware reasoning. In speech-based interactions, the model generates consistent and natural output over 30-second clips, maintaining alignment with input content and minimizing latency—a requirement in interactive systems and human-computer interfaces.

Screenshot-2025-04-30-at-3.14.50%E2%80%AFPM-1-1024x774.png


While the smaller parameter count naturally leads to a slight degradation in generative richness or precision under certain conditions, the overall trade-off appears favorable for developers seeking a high-utility model with reduced computational demands.

Conclusion


Qwen2.5-Omni-3B represents a practical step forward in the development of efficient multimodal AI systems. By optimizing performance per memory unit, it opens opportunities for experimentation, prototyping, and deployment of language and vision models beyond traditional enterprise environments.

This release addresses a critical bottleneck in multimodal AI adoption—GPU accessibility—and provides a viable platform for researchers, students, and engineers working with constrained resources. As interest grows in edge deployment and long-context dialogue systems, compact multimodal models such as Qwen2.5-Omni-3B will likely form an important part of the applied AI landscape.




Check out the model on GitHub , Hugging Face , and ModelScope .
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,322
Reputation
9,672
Daps
173,076

Exploring the Sparse Frontier: How Researchers from Edinburgh, Cohere, and Meta Are Rethinking Attention Mechanisms for Long-Context LLMs​


By Sana Hassan

April 30, 2025

Sparse attention is emerging as a compelling approach to improve the ability of Transformer-based LLMs to handle long sequences. This is particularly important because the standard self-attention mechanism, central to LLMs, scales poorly with sequence length—its computational cost grows quadratically during the prefilling phase, increasing time-to-first-token and making deployment expensive. During the decoding phase, dense attention leads to a cache that expands linearly with the sequence length, resulting in significant memory bandwidth usage for accessing key-value pairs. These inefficiencies pose substantial challenges for both long-context modeling and scaling at inference time.

Sparse attention attempts to reduce this computational burden by approximating dense attention using only a subset of key-query pairs. This has the potential to significantly accelerate long-sequence processing and reduce memory requirements, while still preserving model accuracy. However, despite its promise, sparse attention has yet to be thoroughly evaluated at scale. Existing studies have only scratched the surface, often focusing on limited model sizes, restricted sequence lengths, and specific applications such as multi-turn dialogue. Furthermore, the datasets used in these studies usually vary in length, making it difficult to analyze how performance scales with longer sequences. As a result, the practical viability and robustness of sparse attention strategies remain underexplored.

Researchers from the University of Edinburgh, Cohere, and Meta conducted an extensive evaluation of training-free sparse attention methods across various model sizes, sequence lengths, and sparsity levels. Their study involved nine long-context tasks, including new natural language-based benchmarks designed for controlled and realistic testing. Key findings reveal that for long sequences, large, sparse models outperform smaller, dense ones under fixed computational budgets. While higher sparsity is more tolerable during decoding, no single sparse strategy works universally across tasks. They also introduce scaling laws for sparse attention and release standardized implementations to support reproducible research and guide informed deployment decisions.

Sparse attention aims to reduce computational and memory costs in Transformers by selectively computing only important query–key interactions. This helps speed up full-sequence “prefilling” and reduce memory load during “decoding.” Key techniques include selecting which parts of the attention matrix to retain (e.g., blocks, windows), estimating importance using fixed or dynamic patterns, and allocating computational budgets either uniformly or adaptively across layers and heads. For decoding, methods either evict less useful key–value pairs to conserve memory or maintain the full cache and load only the necessary parts, balancing speed, memory efficiency, and information retention during generation.

The study investigates sparse attention methods in long-context models, analyzing performance under fixed computational budgets. At shorter sequence lengths (32k tokens), smaller dense models perform more efficiently, while at longer lengths (128k), larger sparse models are preferable. Compression tolerance varies by model size and task, with larger models maintaining performance even at 20× sparsity. However, some tasks remain sensitive to high compression. No single method consistently excels; chunk-based methods, such as Quest, perform best in decoding, while Vertical-Slash works well in prefilling for simple tasks. A log-linear scaling law effectively predicts accuracy trends across model size, sequence length, and compression ratio.

AD_4nXff8qMFWiavafajQkpUtrcrdY8w2xmbb6OLAQKASisvE-7TA0-lKimcxG5oQThLm6Pjt28SsYbr9Mm5aGYy7CEIbCiJfXYoByK_h2dT32NpKspNNJ3X6wi3qtFRpFLw7lEswsDo9g


In conclusion, the study presents a comprehensive evaluation of sparse attention methods across various model sizes (up to 72 billion parameters), sequence lengths (up to 128 kilobytes), and sparsity levels (up to 95%) on diverse long-sequence tasks. It finds that, under fixed compute (isoFLOPS), large sparse models outperform smaller dense ones for long contexts. While high sparsity (10–15×) can retain accuracy, performance drops significantly on some tasks even at moderate compression. The best sparsity strategy varies by task and phase (prefilling versus decoding), highlighting the absence of a universal solution. The authors also propose reliable scaling laws, suggesting sparse attention is promising but requires careful, task-specific application.




Check out the Paper .
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,322
Reputation
9,672
Daps
173,076

LLMs Can Now Learn without Labels: Researchers from Tsinghua University and Shanghai AI Lab Introduce Test-Time Reinforcement Learning (TTRL) to Enable Self-Evolving Language Models Using Unlabeled Data​


By Sana Hassan

April 22, 2025

Despite significant advances in reasoning capabilities through reinforcement learning (RL), most large language models (LLMs) remain fundamentally dependent on supervised data pipelines. RL frameworks such as RLHF have pushed model alignment and instruction-following performance but rely heavily on human feedback and labeled datasets. As LLMs are increasingly applied in dynamic environments—ranging from educational settings to scientific workflows—they are required to generalize beyond curated training data.

However, existing models often exhibit performance gaps when confronted with distribution shifts or novel reasoning tasks. While techniques like Test-Time Scaling (TTS) and Test-Time Training (TTT) have been proposed to mitigate this, the absence of reliable reward signals during inference poses a core challenge for deploying RL in unsupervised settings.

Test-Time Reinforcement Learning (TTRL): Leveraging Model Priors for Self-Adaptation


Researchers from Tsinghua University and Shanghai AI Lab introduced Test-Time Reinforcement Learning (TTRL). TTRL is a training framework that applies RL during inference, using only unlabeled test data. It leverages the intrinsic priors of pre-trained language models to estimate pseudo-rewards through majority voting across sampled outputs.

Instead of relying on explicit labels, TTRL constructs reward functions by aggregating multiple model-generated responses to a given query. A consensus answer, obtained via majority voting, is treated as a pseudo-label. Model responses that align with this pseudo-label are positively reinforced. This formulation transforms test-time inference into an adaptive, self-supervised learning process, allowing LLMs to improve over time without additional supervision.

Screenshot-2025-04-22-at-10.32.33%E2%80%AFPM-1-1024x451.png


TTRL has a two-stage approach:

  • Label Estimation via Majority Voting : For each prompt, the model samples multiple outputs. The most frequent prediction is treated as the estimated label.
  • Reward Assignment and Policy Optimization : A binary reward is assigned based on whether each sampled response matches the estimated label. The model is updated using gradient-based RL algorithms (e.g., PPO or GRPO) to maximize agreement with the pseudo-labels.

This approach is notable for its simplicity and compatibility with standard RL methods. The reward function, though approximate, provides sufficient learning signal when aggregated over multiple samples. Experimental setups used temperature-controlled sampling (typically temperature = 1.0), with 64 samples for voting and 16 subsampled responses for training updates. No ground-truth labels are involved at any stage.

Screenshot-2025-04-22-at-10.32.50%E2%80%AFPM-1024x499.png


Empirical Findings across Mathematical Reasoning Tasks


TTRL was evaluated on three mathematical benchmarks: AIME 2024, AMC, and MATH-500. The results are consistent across both smaller and larger models:

  • For Qwen2.5-Math-7B , performance on AIME 2024 increased from 16.7% to 43.3% (pass@1), an improvement of 159.3% without any labeled data.
  • On average, across the three benchmarks, the same model achieved a relative gain of 84.1% .
  • Notably, even a smaller model, Qwen2.5-Math-1.5B , improved from 33.0% to 80.0% on MATH-500.

These gains demonstrate that TTRL supports model improvement even in the absence of supervised training signals. Moreover, TTRL often outperforms the upper bound implied by its own training signal—i.e., the accuracy of the majority-voted predictions. This suggests a self-reinforcing learning loop that can extract richer supervision from noisy consensus signals.

Additional analyses showed that TTRL generalizes beyond the dataset it was applied to. When trained on one benchmark and evaluated on others, performance improvements persisted. This cross-task transfer indicates that TTRL does not lead to narrow overfitting but supports broader generalization.

Screenshot-2025-04-22-at-10.33.49%E2%80%AFPM-1-1024x846.png


Conclusion: Toward Self-Adaptive and Label-Free Learning


TTRL represents a novel shift in how reinforcement learning can be applied to LLMs in real-world settings. By reusing the model’s own generations as a proxy for supervision, it removes the need for expensive human annotations while enabling continual adaptation. The approach scales naturally with model size, is compatible with different RL algorithms, and shows promising robustness across tasks of varying difficulty.

While this study focuses on mathematical reasoning, the underlying ideas—self-estimated supervision, test-time adaptation, and reinforcement learning without labels—may generalize to other domains. As language models increasingly encounter tasks beyond their pre-training distribution, frameworks like TTRL offer a scalable path forward.

Further exploration is needed to understand the theoretical convergence properties of TTRL and to evaluate its applicability in interactive or multi-agent scenarios. Nonetheless, TTRL provides a technically sound and computationally efficient foundation for enabling LLMs to evolve continuously from their own outputs.




Check out the Paper and GitHub Page .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,322
Reputation
9,672
Daps
173,076

AgentA/B: A Scalable AI System Using LLM Agents that Simulate Real User Behavior to Transform Traditional A/B Testing on Live Web Platforms​


By Sana Hassan

April 25, 2025

Designing and evaluating web interfaces is one of the most critical tasks in today’s digital-first world. Every change in layout, element positioning, or navigation logic can influence how users interact with websites. This becomes even more crucial for platforms that rely on extensive user engagement, such as e-commerce or content streaming services. One of the most trusted methods for assessing the impact of design changes is A/B testing. In A/B testing, two or more versions of a webpage are shown to different user groups to measure their behavior and determine which variant performs better. It’s not just about aesthetics but also functional usability. This method enables product teams to gather user-centered evidence before fully rolling out a feature, allowing businesses to optimize user interfaces systematically based on observed interactions.

Despite being a widely accepted tool, the traditional A/B testing process brings several inefficiencies that have proven problematic for many teams. The most significant challenge is the volume of real-user traffic needed to yield statistically valid results. In some scenarios, hundreds of thousands of users must interact with webpage variants to identify meaningful patterns. For smaller websites or early-stage features, securing this level of user interaction can be nearly impossible. The feedback cycle is also notably slow. Even after launching an experiment, it might take weeks to months before results can be confidently assessed due to the requirement of long observation periods. Also, these tests are resource-heavy; only a few variants can be evaluated due to the time and manpower required. Consequently, numerous promising ideas go untested because there’s simply no capacity to explore them all.

Several methods have been explored to overcome these limitations; however, each has its shortcomings. For example, offline A/B testing techniques depend on rich historical interaction logs, which are not always available or reliable. Tools that enable prototyping and experimentation, such as Apparition and Fuse, have accelerated early design exploration but are primarily useful for prototyping physical interfaces. Algorithms that reframe A/B testing as a search problem through evolutionary models help automate some aspects but still depend on historical or real-user deployment data. Other strategies, like cognitive modeling with GOMS or ACT-R frameworks, require high levels of manual configuration and do not easily adapt to the complexities of dynamic web behavior. These tools, although innovative, have not provided the scalability and automation necessary to address the deeper structural limitations in A/B testing workflows.

Researchers from Northeastern University, Pennsylvania State University, and Amazon introduced a new automated system named AgentA/B . This system offers an alternative approach to traditional user testing, utilizing Large Language Model (LLM)-based agents. Rather than depending on live user interaction, AgentA/B simulates human behavior using thousands of AI agents. These agents are assigned detailed personas that mimic characteristics such as age, educational background, technical proficiency, and shopping preferences. These personas enable agents to simulate a wide range of user interactions on real websites. The goal is to provide researchers and product managers with an efficient and scalable method for testing multiple design variants without relying on live user feedback or extensive traffic coordination.

The system architecture of AgentA/B is structured into four main components. First, it generates agent personas based on the input demographics and behavioral diversity specified by the user. These personas are fed into the second stage, where testing scenarios are defined—this includes assigning agents to control and treatment groups and specifying which two webpage versions should be tested. The third component executes the interactions: agents are deployed into real browser environments, where they process the content through structured web data (converted into JSON observations) and take action like real users. They can search, filter, click, and even simulate purchases. The fourth and final component involves analyzing the results, where the system provides metrics like the number of clicks, purchases, or interaction durations to assess design effectiveness.

AD_4nXfrDT_4lkycngqzO3Jh003sZCSlfo6XH7sp5Dg77o1VknCM0_xoc8tifmfc9SnHEkX7HrzeKfzSrW0BRvzTISCmWMRFXPStOtWNkHzunRHIUNVSkNtz6kOZNcgzV50dAR-0GAIl5A


During their testing phase, researchers used Amazon.com to demonstrate the tool’s practical value. A total of 100,000 virtual customer personas were generated, and 1,000 were randomly selected from this pool to act as LLM agents in the simulation. The experiment compared two different webpage layouts: one with all product filter options shown in a left-hand panel and another with only a reduced set of filters. The outcome was compelling. The agents interacting with the reduced-filter version performed more purchases and filter-based actions than those with the full list. Also, these virtual agents were significantly more efficient. Compared with one million real user interactions, LLM agents took fewer actions on average to complete tasks, indicating more goal-oriented behavior. These results mirrored the behavioral direction observed in human A/B tests, strengthening the case for AgentA/B as a valid complement to traditional testing.

AD_4nXecHH_MuXTaX0dkkAbUB1ReZ1VrEV6JyMPz04SdyB91One1wmz-fR_Qix28r1ejIgF8VBe5nP8c24MW7SlOCy-kx2sktYod2zuTPLijk3saRiY2O4vz1i4rOEfWBbaoKjwWLKYF2w


This research demonstrates a compelling advancement in interface evaluation. It doesn’t aim to replace live user A/B testing but instead proposes a supplementary method that offers rapid feedback, cost efficiency, and broader experimental coverage. By using AI agents instead of live participants, the system enables product teams to test numerous interface variations that would otherwise be infeasible. This model can significantly compress the design cycle, allowing ideas to be validated or rejected at a much earlier stage. It addresses the practical concerns of long wait times, traffic limitations, and testing resource constraints, making the web design process more data-informed and less prone to bottlenecks.

AD_4nXfBZIFskhfkVdDrwrr5K9jn5_McBK1Hl2QicdL8FmkeCIUN_pI-SwbwquXtdFr0MeLRc8jaKb7IIxjiXaj_Lumdm6lNtsTbX-UrYiA5l3PWvsXqJgmZ9LrijYtOzi39Dkv-ATy-ag


Some Key Takeaways from the Research on AgentA/B include:

  • AgentA/B uses LLM-based agents to simulate realistic user behavior on live webpages.
  • The system allows automated A/B testing with no need for live user deployment.
  • 100,000 user personas were generated, and 1,000 were selected for live testing simulation.
  • The system compared two webpage variants on Amazon.com: full filter panel vs. reduced filters.
  • LLM agents in the reduced-filter group made more purchases and performed more filtering actions.
  • Compared to 1 million human users, LLM agents showed shorter action sequences and more goal-directed behavior.
  • AgentA/B can help evaluate interface changes before real user testing, saving months of development time.
  • The system is modular and extensible, allowing it to be adaptable to various web platforms and testing goals.
  • It directly addresses three core A/B testing challenges: long cycles, high user traffic needs, and experiment failure rates.




Check out the Paper .


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,322
Reputation
9,672
Daps
173,076

NVIDIA AI Releases OpenMath-Nemotron-32B and 14B-Kaggle: Advanced AI Models for Mathematical Reasoning that Secured First Place in the AIMO-2 Competition and Set New Benchmark Records​


By Asif Razzaq

April 24, 2025

Mathematical reasoning has long presented a formidable challenge for AI, demanding not only an understanding of abstract concepts but also the ability to perform multi-step logical deductions with precision. Traditional language models, while adept at generating fluent text, often struggle when tasked with solving complex mathematical problems that require both deep domain knowledge and structured reasoning. This gap has driven research toward specialized architectures and training regimens designed to imbue models with robust mathematical capabilities. By focusing on targeted datasets and fine-tuning strategies, AI developers aim to bridge the gap between natural language understanding and formal mathematical problem-solving.

NVIDIA has introduced OpenMath-Nemotron-32B and OpenMath-Nemotron-14B-Kaggle , each meticulously engineered to excel in mathematical reasoning tasks. Building on the success of the Qwen family of transformer models, these Nemotron variants utilize large-scale fine-tuning on an extensive corpus of mathematical problems, collectively known as the OpenMathReasoning dataset. The design philosophy underlying both releases centers on maximizing accuracy across competitive benchmarks while maintaining practical considerations for inference speed and resource efficiency. By offering multiple model sizes and configurations, NVIDIA provides researchers and practitioners with a flexible toolkit for integrating advanced math capabilities into diverse applications.

OpenMath-Nemotron-32B represents the flagship of this series, featuring 32.8 billion parameters and leveraging BF16 tensor operations for efficient hardware utilization. It is built by fine-tuning Qwen2.5-32B on the OpenMathReasoning dataset, a curated collection that emphasizes challenging problems drawn from mathematical Olympiads and standardized exams. This model achieves state-of-the-art results on several rigorous benchmarks, including the American Invitational Mathematics Examination (AIME) 2024 and 2025, the Harvard–MIT Mathematics Tournament (HMMT) 2024-25, and the Harvard–London–Edinburgh Mathematics Exam (HLE-Math) series. In its tool-integrated reasoning (TIR) configuration, OpenMath-Nemotron-32B achieves an average pass@1 score of 78.4 percent on AIME24, with a majority-voting accuracy of 93.3 percent, surpassing previous top-performing models by notable margins.

To accommodate different inference scenarios, OpenMath-Nemotron-32B supports three distinct modes: chain-of-thought (CoT), tool-integrated reasoning (TIR), and generative solution selection (GenSelect). In CoT mode, the model generates intermediate reasoning steps before presenting a final answer, achieving a pass@1 accuracy of 76.5% on AIME24. When augmented with GenSelect, which produces multiple candidate solutions and selects the most consistent answer, the model’s performance improves further, achieving a remarkable 93.3% accuracy on the same benchmark. These configurations enable users to balance between explanation richness and answer precision, catering to research environments that require transparency as well as production settings that prioritize speed and reliability.

Complementing the 32 billion-parameter variant, NVIDIA has also released OpenMath-Nemotron-14B-Kaggle, a 14.8 billion-parameter model fine-tuned on a strategically selected subset of the OpenMathReasoning dataset to optimize for competitive performance. This version served as the cornerstone of NVIDIA’s first-place solution in the AIMO-2 Kaggle competition, a contest that focused on automated problem-solving techniques for advanced mathematical challenges. By calibrating the training data to emphasize problems reflective of the competition’s format and difficulty, the 14B-Kaggle model demonstrated exceptional adaptability, outpacing rival approaches and securing the top leaderboard position.

AD_4nXeqxRIxm_humP8sctqEXAS8HFFAXgoqZDLCtoL4wqMpbYMnm2__CNsTB6k81a1nX6eR32o6362I_r9VJVzH-GYWZ5_ZvdBeXNU1TiNNFDrbbAgmY6-VxhqGuvk1_NJhK0wbEepDvg
Image Source

Performance benchmarks for OpenMath-Nemotron-14B-Kaggle mirror those of its larger counterpart, with the model achieving a pass@1 accuracy of 73.7% on AIME24 in CoT mode and improving to 86.7% under GenSelect protocols. On the AIME25 benchmark, it achieves a pass rate of 57.9 percent (majority at 64 of 73.3 percent), and on HMMT-24-25, it attains 50.5 percent (majority at 64 of 64.8 percent). These figures highlight the model’s ability to deliver high-quality solutions, even with a more compact parameter footprint, making it well-suited for scenarios where resource constraints or inference latency are critical factors.

Both OpenMath-Nemotron models are accompanied by an open‐source pipeline, enabling full reproducibility of data generation, training procedures, and evaluation protocols. NVIDIA has integrated these workflows into its NeMo-Skills framework, providing reference implementations for CoT, TIR, and GenSelect inference modes. With example code snippets that demonstrate how to instantiate a transformer pipeline, configure dtype and device mapping, and parse model outputs, developers can rapidly prototype applications that query these models for step-by-step solutions or streamlined final answers.

Under the hood, both models are optimized to run efficiently on NVIDIA GPU architectures, ranging from the Ampere to the Hopper microarchitectures, leveraging highly tuned CUDA libraries and TensorRT optimizations. For production deployments, users can serve models via Triton Inference Server, enabling low-latency, high-throughput integrations in web services or batch processing pipelines. The adoption of BF16 tensor formats strikes an ideal balance between numerical precision and memory footprint, enabling these large-scale models to fit within GPU memory constraints while maintaining robust performance across various hardware platforms.

Several Key Takeaways from the release of OpenMath-Nemotron-32B and OpenMath-Nemotron-14B-Kaggle include:

  1. NVIDIA’s OpenMath-Nemotron series addresses the longstanding challenge of equipping language models with robust mathematical reasoning through targeted fine-tuning on the OpenMathReasoning dataset.
  2. The 32 B-parameter variant achieves state-of-the-art accuracy on benchmarks like AIME24/25 and HMMT, offering three inference modes (CoT, TIR, GenSelect) to balance explanation richness and precision.
  3. The 14 B-parameter “Kaggle” model, fine-tuned on a competition-focused subset, secured first place in the AIMO-2 Kaggle competition while maintaining high pass@1 scores, demonstrating efficiency in a smaller footprint.
  4. Both models are fully reproducible via an open-source pipeline integrated into NVIDIA’s NeMo-Skills framework, with reference implementations for all inference modes.
  5. Optimized for NVIDIA GPUs (Ampere and Hopper), the models leverage BF16 tensor operations, CUDA libraries, TensorRT, and Triton Inference Server for low-latency, high-throughput deployments.
  6. Potential applications include AI-driven tutoring systems, academic competition preparation tools, and integration into scientific computing workflows requiring formal or symbolic reasoning.
  7. Future directions may expand to advanced university-level mathematics, multimodal inputs (e.g., handwritten equations), and tighter integration with symbolic computation engines to verify and augment generated solutions.




Check out the OpenMath-Nemotron-32B and OpenMath-Nemotron-14B-Kaggle .


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,322
Reputation
9,672
Daps
173,076

Google DeepMind Research Introduces QuestBench: Evaluating LLMs’ Ability to Identify Missing Information in Reasoning Tasks​


By Mohammad Asjad

April 25, 2025

Large language models (LLMs) have gained significant traction in reasoning tasks, including mathematics, logic, planning, and coding. However, a critical challenge emerges when applying these models to real-world scenarios. While current implementations typically operate under the assumption that all necessary information is provided upfront in well-specified tasks, reality often presents incomplete or ambiguous situations. Users frequently omit crucial details when formulating math problems, and autonomous systems like robots must function in environments with partial observability. This fundamental mismatch between idealised complete-information settings and the incomplete nature of real-world problems necessitates LLMs to develop proactive information-gathering capabilities. Recognising information gaps and generating relevant clarifying questions represents an essential but underdeveloped functionality for LLMs to effectively navigate ambiguous scenarios and provide accurate solutions in practical applications.

Various approaches have attempted to address the challenge of information gathering in ambiguous scenarios. Active learning strategies acquire sequential data through methods like Bayesian optimisation, reinforcement learning, and robot planning with partially observable states. Research on ambiguity in natural language has explored semantic uncertainties, factual question-answering, task-oriented dialogues, and personalised preferences. Question-asking methods for LLMs include direct prompting techniques, information gain computation, and multi-stage clarification frameworks. However, most existing benchmarks focus on subjective tasks where multiple valid clarifying questions exist, making objective evaluation difficult. These approaches address ambiguous or knowledge-based tasks rather than underspecified reasoning problems, where an objectively correct question is determinable.

QuestBench presents a robust approach to evaluating LLMs’ ability to identify and acquire missing information in reasoning tasks. The methodology formalises underspecified problems as Constraint Satisfaction Problems (CSPs) where a target variable cannot be determined without additional information. Unlike semantic ambiguity, where multiple interpretations exist but each yields a solvable answer, underspecification renders problems unsolvable without supplementary data. QuestBench specifically focuses on “1-sufficient CSPs” – problems requiring knowledge of just one unknown variable’s value to solve for the target variable. The benchmark comprises three distinct domains: Logic-Q (logical reasoning tasks), Planning-Q (blocks world planning problems with partially observed initial states), and GSM-Q/GSME-Q (grade-school math problems in verbal and equation forms) . The framework strategically categorises problems along four axes of difficulty: number of variables, number of constraints, search depth required, and expected guesses needed by brute-force search. This classification offers insights into LLMs’ reasoning strategies and performance limitations.

AD_4nXfb1jzdhb0EoeN5Dj5NN3GrIi9OUiGCtUVb6JJE55djlI84k9PXLTYcZwQagKH_hrzclztdE5559s_WsHkBCWO0ZI5BXhVXJ58ALzOTXeWcAreIOrAXRu-n3C8dqjqTyLDGru053w


QuestBench employs a formal Constraint Satisfaction Problem framework, precisely identify and evaluate information gaps in reasoning tasks. A CSP is defined as a tuple ⟨X, D, C, A, y⟩ where X represents variables, D denotes their domains, C encompasses constraints, A consists of variable assignments, and y is the target variable to solve. The framework introduces the “Known” predicate, indicating when a variable’s value is determinable either through direct assignment or derivation from existing constraints. A CSP is classified as underspecified when the target variable y cannot be determined from available information. The methodology focuses specifically on “1-sufficient CSPs”, where knowing just one additional variable is sufficient to solve for the target.

The benchmark measures model performance along four difficulty axes that correspond to algorithmic complexity: total number of variables (|X|), total number of constraints (|C|), depth of backwards search tree (d), and expected number of random guesses needed (𝔼BF). These metrics provide quantitative measures of problem complexity and help differentiate between semantic ambiguity (multiple valid interpretations) and underspecification (missing information). For each task, models must identify the single sufficient variable that, when known, enables solving for the target variable, requiring both recognition of information gaps and strategic reasoning about constraint relationships.

AD_4nXcsrPaIDFNswCoN1o2qdOPdRfEUEB3ZIgywF8ydxV9TidYftrYWBveVVuwcVDmgvfHLDG6aeIX6ajJ8chyHoeRTqOe9AsbpfcYp_WMSK9Ggwk-dTcnX2r9M8KoKnzgWV-6mLzmwag


AD_4nXcxmwrxk4DT1c2QTFFIF0RKoenYnxaC224LvCbZ8Y7YE3FvEhZ0rkTWSpjN0pQ-Ky4QFaOnX9FdHjxv6zg8NgvnzpqMUU1iXF1L4cvjMRB3V27zaYFJ56ial5ike1H0WyTq_Oua9w


Experimental evaluation of QuestBench reveals varying capabilities among leading large language models in information-gathering tasks. GPT-4o, GPT-4-o1 Preview, Claude 3.5 Sonnet, Gemini 1.5 Pro/Flash, Gemini 2.0 Flash Thinking Experimental, and open-sourced Gemma models were tested across zero-shot, chain-of-thought, and four-shot settings. Tests were conducted on representative subsets of 288 GSM-Q and 151 GSME-Q tasks between June 2024 and March 2025. Performance analysis along the difficulty axes demonstrates that models struggle most with problems featuring high search depths and complex constraint relationships. Chain-of-thought prompting generally improved performance across all models, suggesting that explicit reasoning pathways help identify information gaps. Among the evaluated models, Gemini 2.0 Flash Thinking Experimental achieved the highest accuracy, particularly on planning tasks, while open-source models showed competitive performance on logical reasoning tasks but struggled with complex math problems requiring deeper search.

AD_4nXeXNoeu4Sri0SGvvZ2-R7jCtC79eCwTXWlYR_eJjt99R7jRUyCj2aig0F9RJG05B8IXimiGckxWX9oXcYvAGyrVpJsQBSV-nG0yQpxbK4MOIpD3bX8T-RpIZm1_fP1gcDngA0LjeA


QuestBench provides a unique framework for evaluating LLMs’ ability to identify underspecified information and generate appropriate clarifying questions in reasoning tasks. Current state-of-the-art models demonstrate reasonable performance on simple algebra problems but struggle significantly with complex logic and planning tasks. Performance deteriorates as problem complexity increases along key dimensions like search depth and expected number of brute-force guesses. These findings highlight that while reasoning ability is necessary for effective question-asking, it alone may not be sufficient. Significant advancement opportunities exist in developing LLMs that can better recognize information gaps and request clarification when operating under uncertainty.




Check out the Paper .


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,322
Reputation
9,672
Daps
173,076

Long-Context Multimodal Understanding No Longer Requires Massive Models: NVIDIA AI Introduces Eagle 2.5, a Generalist Vision-Language Model that Matches GPT-4o on Video Tasks Using Just 8B Parameters​


By Asif Razzaq

April 21, 2025

In recent years, vision-language models (VLMs) have advanced significantly in bridging image, video, and textual modalities. Yet, a persistent limitation remains: the inability to effectively process long-context multimodal data such as high-resolution imagery or extended video sequences. Many existing VLMs are optimized for short-context scenarios and struggle with performance degradation, inefficient memory usage, or loss of semantic detail when scaled to handle longer inputs. Addressing these limitations requires not only architectural flexibility but also dedicated strategies for data sampling, training, and evaluation.

Eagle 2.5: A Generalist Framework for Long-Context Learning


NVIDIA introduces Eagle 2.5, a family of vision-language models designed for long-context multimodal learning. Unlike models that simply accommodate more input tokens, Eagle 2.5 demonstrates measurable and consistent performance improvements as input length increases. The system is developed with a focus on both video and image understanding at scale, targeting tasks where the richness of long-form content is critical.

Eagle 2.5 operates with a relatively compact 8B parameter count and yet achieves strong results across established benchmarks. On Video-MME (with 512-frame input), the model scores 72.4%, approaching or matching results from significantly larger models such as Qwen2.5-VL-72B and InternVL2.5-78B. Notably, these gains are achieved without relying on task-specific compression modules, reflecting the model’s generalist design philosophy.

Screenshot-2025-04-21-at-11.31.40%E2%80%AFPM-1-1024x712.png


Training Strategy: Context-Aware Optimization


The effectiveness of Eagle 2.5 stems from two complementary training strategies: information-first sampling and progressive post-training .

  • Information-First Sampling prioritizes retention of critical visual and semantic content. It introduces Image Area Preservation (IAP) , a tiling scheme that maintains over 60% of the original image area while minimizing aspect ratio distortion. Additionally, Automatic Degradation Sampling (ADS) dynamically balances visual and textual inputs based on context length constraints, preserving full textual sequences and adaptively optimizing visual granularity.
  • Progressive Post-Training incrementally increases the model’s context window—moving through 32K, 64K, and 128K token stages. This gradual exposure allows the model to develop consistent capabilities across input lengths. The method avoids overfitting to any single context range and helps maintain stable performance in diverse inference scenarios.

These approaches are underpinned by an architecture based on SigLIP for vision encoding and MLP projection layers for alignment with the language model backbone. The system forgoes domain-specific compression components to retain flexibility across varied task types.

Screenshot-2025-04-21-at-11.32.42%E2%80%AFPM-1-1024x363.png


Eagle-Video-110K: Structured Data for Extended Video Comprehension


A key component of Eagle 2.5 is its training data pipeline, which integrates both open-source resources and a custom-curated dataset: Eagle-Video-110K . This dataset is constructed to support long-form video understanding and adopts a dual annotation scheme:

  • A top-down approach introduces story-level segmentation using human-annotated chapter metadata and GPT-4-generated dense captions and question-answer pairs.
  • A bottom-up method generates QA pairs for short clips using GPT-4o, augmented with time and textual context anchors to capture spatiotemporal detail.

The dataset collection emphasizes diversity over redundancy. A cosine similarity-based selection process filters novel content from sources such as InternVid, Shot2Story, and VidChapters. This results in a corpus with both narrative coherence and granular annotations, enabling models to capture hierarchical information across time.

Screenshot-2025-04-21-at-11.32.01%E2%80%AFPM-1-1024x470.png


Screenshot-2025-04-21-at-11.32.18%E2%80%AFPM-1-1024x389.png


Performance and Benchmarking


Eagle 2.5-8B exhibits robust performance across multiple video and image understanding tasks. On video benchmarks, it scores 74.8 on MVBench, 77.6 on MLVU, and 66.4 on LongVideoBench. On image benchmarks, the model attains 94.1 on DocVQA, 87.5 on ChartQA, and 80.4 on InfoVQA, among others.

Ablation studies confirm the importance of Eagle’s sampling strategies. Removal of IAP leads to performance degradation in high-resolution benchmarks, while omitting ADS reduces effectiveness in tasks requiring dense supervision. The model also benefits from progressive training: sequentially increasing context lengths provides more stable gains compared to one-shot long-context training. Importantly, the addition of Eagle-Video-110K notably enhances performance at higher frame counts (≥128 frames), underscoring the value of dedicated long-form datasets.

Conclusion


Eagle 2.5 presents a technically grounded approach to long-context vision-language modeling. Its emphasis on preserving contextual integrity, gradual training adaptation, and dataset diversity enables it to achieve strong performance while maintaining architectural generality. Without relying on model scaling alone, Eagle 2.5 demonstrates that careful training strategies and data design can yield competitive, efficient systems for complex multimodal understanding tasks. This positions Eagle 2.5 as a valuable step forward in building more context-aware AI systems suited for real-world multimedia applications.




Check out the Paper , GitHub Page and Project Page .


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,322
Reputation
9,672
Daps
173,076

LLMs Can Now Simulate Massive Societies: Researchers from Fudan University Introduce SocioVerse, an LLM-Agent-Driven World Model for Social Simulation with a User Pool of 10 Million Real Individuals​


By Sajjad Ansari

April 26, 2025

Human behavior research strives to comprehend how individuals and groups act in social contexts, forming a foundational social science element. Traditional methodologies like surveys, interviews, and observations face significant challenges, including high costs, limited sample sizes, and ethical concerns. These challenges have pushed researchers toward alternative approaches for studying human behavior. For example, Social simulation is an effective method to solve the problem of studying human behaviour. This method utilizes agents to model human behavior, observe reactions, and translate findings into meaningful insights.

Recent studies have explored social simulation across various levels, from mimicking specific individuals to modeling large-scale social dynamics. However, these simulations consistently face a critical challenge of maintaining alignment between the simulated environment and the real world. This alignment issue manifests across multiple dimensions and raises the following questions:

  • How should the simulated environment be aligned with the real world?
  • How should the simulated agents be aligned with target users, precisely?
  • How should the interaction mechanism be aligned with the real world among different scenarios?
  • How should the behavioral pattern be aligned with the real-world groups?

Researchers from Fudan University, Shanghai Innovation Institute, University of Rochester, Indiana University, and Xiaohongshu Inc. have proposed SocioVerse, a world model for social simulation powered by LLM-based agents built upon a large-scale real-world user pool. Modular components are designed to address the above four questions. The Social Environment component incorporates up-to-date external real-world information into simulations, while the User Engine and Scenario Engine reconstruct realistic user contexts and arrange simulation processes to align with reality. Based on this rich contextual setup, the Behavior Engine drives agents to reproduce human behaviors. To support this framework, researchers have constructed a massive user pool containing 10 million individuals based on real social media data, comparable to the entire populations of Hungary or Greece.

AD_4nXe0S1BKAjXeO8_RgbCzKcKlKC6yndHhW8okWE-BPxtJB6lp5gX2MSM3LUNky3-PV0LLkvIemuHFrcOJEHI8uroQ98KdlyO5M1Wak2EJ_bLxqTPbPe2-l7jVi-gKTcnvyNMKI301uQ


The SocioVerse is validated through three simulations: presidential election prediction, breaking news feedback, and national economic survey. Researchers designed a questionnaire based on established polls from various media and research institutes for the presidential election prediction in America. Its evaluation metrics are Accuracy rate and Root Mean Square Error (RMSE). The breaking news feedback simulation utilizes the ABC attitude model (Affect, Behavior, Cognition) combined with a 5-point Likert scale, and its evaluation metrics are Normalized RMSE and KL-divergence. For the national economic survey of China, spending details from the China Statistical Yearbook 2024 are categorized into eight parts, including food, clothing, housing, etc. The evaluation metrics are NRMSE and KL-divergence.

For the presidential election prediction, GPT-4o-mini and Qwen2.5-72b show competitive performance in the Accuracy and RMSE metrics. Following the winner-takes-all rule, over 90% of state voting results are predicted correctly, achieving high-precision macroscopic alignment with real-world election outcomes. In the breaking news feedback scenario, GPT-4o and Qwen2.5-72b most closely aligned with real-world perspectives in KL-Divergence and NRMSE, successfully capturing public trends and opinions. For the national economic survey, Llama3-70b shows superior performance. Models generally perform better in developed regions (top 10 GDP regions) than overall, showing SocioVerse’s ability to reproduce individual spending habits accurately.

In conclusion, researchers introduce a generalized social simulation framework called SocioVerse and evaluate its performance across three distinct real-world scenarios. Their findings indicate that state-of-the-art LLMs show a notable ability to simulate human responses in complex social contexts. Future research needs to incorporate a broader range of scenarios and develop more fine-grained evaluations built upon the current analytic engine to explore and expand the boundaries of LLMs’ simulation capabilities further. Such efforts could pave the way for establishing LLMs as reliable tools for large-scale social simulation, transforming how researchers approach the study of human behavior in diverse social environments.




Check out the Paper and GitHub Page .


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,322
Reputation
9,672
Daps
173,076

This AI Paper from China Proposes a Novel Training-Free Approach DEER that Allows Large Reasoning Language Models to Achieve Dynamic Early Exit in Reasoning​


By Sana Hassan

April 26, 2025

Recent progress in large reasoning language models (LRLMs), such as DeepSeek-R1 and GPT-O1, has greatly improved complex problem-solving abilities by extending the length of CoT generation during inference. These models benefit from test-time scaling laws, allowing richer and more diverse reasoning paths. However, generating overly long CoT sequences leads to computational inefficiency and increased latency, making the deployment of real-world systems challenging. Moreover, excessive reasoning often introduces redundant or irrelevant steps, which can cause models to deviate from correct answers, ultimately reducing accuracy. This overthinking problem stems from traditional supervised fine-tuning and reinforcement learning approaches that do not prioritize dynamic control over reasoning length. Research has shown that in many cases, reasoning could be halted earlier, at what the authors call “pearl reasoning” points, without sacrificing correctness. Identifying and stopping at these critical points could significantly improve efficiency while maintaining model performance.

Existing approaches to improve inference efficiency generally fall into three categories: post-training, prompt-based, and output-based methods. Post-training techniques involve retraining models with variable-length CoT examples or length rewards, but they are often computationally intensive and risk overfitting. Prompt-based methods adjust CoT length by modifying the input prompts based on task difficulty, achieving more concise reasoning without sacrificing much accuracy. Output-based methods typically focus on sampling techniques, such as early stopping when multiple outputs converge on the same answer. However, with newer models like R1, reliance on best-of-N sampling has decreased. Recent works have explored early exiting strategies, but they often require separate verification models or are only effective in limited settings. In contrast, the discussed approach aims to empower models to recognize optimal stopping points during their reasoning process, providing a more seamless and generalizable solution.

Researchers from the Institute of Information Engineering, the University of Chinese Academy of Sciences, and Huawei Technologies have proposed DEER, a simple, training-free method to enable LRLMs to dynamically exit early during reasoning. DEER monitors key transition points, such as the generation of “Wait” tokens, and prompts the model to produce trial answers at these moments. If the model shows high confidence, reasoning is halted; otherwise, it continues. This approach integrates seamlessly with existing models, such as DeepSeek, and reduces CoT length by 31–43%, while improving accuracy by 1.7–5.7% across benchmarks including MATH-500, AIME 2024, and GPQA Diamond.

The DEER (Dynamic Early Exit in Reasoning) method enables large reasoning language models to exit reasoning early by evaluating their confidence in trial answers at key transition points. It uses three modules: a reasoning transition monitor to detect “thought switch” signals, an answer inducer to prompt a trial conclusion, and a confidence evaluator to assess if the reasoning is sufficient. If confidence exceeds a threshold, reasoning stops; otherwise, it continues. To reduce latency from trial answer generation, DEER also employs branch-parallel decoding with dynamic cache management, thereby improving efficiency without sacrificing accuracy, particularly for tasks such as code generation.

The experiments evaluated models on four major reasoning benchmarks: MATH-500, AMC 2023, AIME 2024, and GPQA Diamond, as well as programming benchmarks HumanEval and BigCodeBench. Tests were conducted using DeepSeek-R1-Distill-Qwen models of varying sizes (1.5B to 32B parameters) under a Zero-shot Chain-of-Thought setup. DEER significantly improved performance by reducing reasoning length by 31–43% while increasing accuracy by 1.7–5.7% compared to standard CoT. A detailed analysis revealed that DEER corrected more responses through early exits, particularly for smaller models and simpler tasks. On programming benchmarks, DEER also reduced reasoning length by over 60% with minimal or no loss in accuracy, demonstrating its robustness across various tasks.

AD_4nXcQnNCESczhx95eDtVxQ1oT0z3g5eMP5SIPOyqpZomis7ZB6cDvWp4j5Cfi8DwdSSEG9NpFxqb5RBsiM9Z4TDChyO2Y0cSVoG1wYXYoOXmI_uuYcSercePt-zWyYMpxx5j73o5JWA


In conclusion, the study validates the idea of using early exits during CoT generation through pilot studies. Based on these findings, it introduces a training-free dynamic early exit method that enables models to stop reasoning once enough information is gathered. Tested across various model sizes and six major reasoning benchmarks, the method achieves better accuracy with fewer tokens, effectively balancing efficiency and performance. Unlike traditional approaches that rely on long CoT for complex tasks, this method dynamically monitors model confidence to determine when to stop reasoning, thereby avoiding unnecessary steps. Experiments show significant reductions in reasoning length while boosting overall accuracy.




Check out the Paper .


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,322
Reputation
9,672
Daps
173,076

Tiny Models, Big Reasoning Gains: USC Researchers Introduce Tina for Cost-Effective Reinforcement Learning with LoRA​


By Sana Hassan

April 27, 2025

Achieving strong, multi-step reasoning in LMs remains a major challenge, despite notable progress in general task performance. Such reasoning is crucial for complex problem-solving domains, such as scientific research and strategic planning. Traditionally, enhancing reasoning skills involves supervised fine-tuning (SFT), where models learn by imitating step-by-step reasoning demonstrations from more advanced models, such as o1. While effective, this method heavily depends on the availability of high-quality reasoning traces, which are costly and risk promoting shallow mimicry over genuine logical exploration. RL offers an alternative by enabling models to learn directly from reward signals, encouraging broader reasoning exploration. However, RL approaches are often resource-heavy and complex, raising the question of how to build reasoning-capable models cost-effectively.

Following the release of strong models like o1-preview, several open-source efforts such as STILL, Sky-T1, SimpleRL, PRIME, and DeepScaleR have explored efficient strategies to replicate or surpass o1’s reasoning capabilities. Techniques include lightweight imitation learning, scalable instruction tuning, and simplified RL methods. Meanwhile, newer innovations, such as Group Relative Policy Optimization (GRPO), enhance RL training efficiency by eliminating the need for separate value networks, as seen in models like DeepSeek-R1. To further lower training costs, researchers are also investigating Low-Rank Adaptation (LoRA) methods, which update only a small subset of model parameters, maintaining modularity while preserving reasoning ability. This approach enables efficient fine-tuning without the computational demands of full-parameter updates.

Researchers from the University of Southern California introduce Tina, a family of compact reasoning models that achieve strong performance with minimal cost. Using RL enhanced by LoRA on a 1.5B parameter base model, Tina models outperform or match state-of-the-art models at a fraction of the computational expense. Their best model improves reasoning performance by over 20% and achieves 43.33% Pass@1 on AIME24, with a post-training cost of just $9. By leveraging LoRA’s efficiency to adapt reasoning formats while preserving base knowledge, Tina highlights a highly accessible, cost-effective approach, with all resources fully open-sourced.

Tina is a family of tiny reasoning models built by post-training the DeepSeek-R1-Distill-Qwen-1.5B model using LoRA during reinforcement learning with a GRPO-style approach. The framework emphasizes minimalism—tiny models, small parameter updates, and a low hardware and budget footprint. Tina models were trained using public datasets and replicated setups from models like STILL-3, DeepScaleR, and Open-RS. Training leveraged the OpenR1 codebase, minimal hyperparameter tuning, and just two NVIDIA L40S GPUs, occasionally RTX 6000 Ada GPUs. Training and evaluation costs were low, averaging well under a $100 budget per experiment, making Tina a highly accessible platform for reasoning research.

To ensure fair comparisons, the authors reevaluated baseline reasoning models using a consistent setup with the LightEval framework and vLLM engine, thereby eliminating variations introduced by previous studies. Six reasoning benchmarks, including AIME 24/25, AMC 23, MATH 500, GPQA, and Minerva, were utilized. They then evaluated Tina models—small, LoRA-trained versions of baseline models—showing that Tina models often outperformed their full-parameter counterparts despite using minimal training (19–57% of an epoch). Further ablation studies revealed that smaller, high-quality datasets, appropriate learning rates, moderate LoRA ranks, and careful choice of RL algorithm significantly impacted performance, confirming the efficiency and robustness of their LoRA-based reasoning approach.

AD_4nXd5trs8FbRtPbAizMokJDQ4VM4RHoCoq-iYwGQE8wqfxUdSaXTOKREsuJVU06XZpEBT5uZQ5ZduUQEJgmZl5vMg8RINwj5ENefq3W_9tfWeZZY5_sE5aaT3ukxSI5pEtknWxkj7Kw


In conclusion, Tina, a series of lightweight reasoning models that achieve strong performance using minimal computational resources. By applying LoRA during RL on a 1.5 B-parameter base model, they achieve reasoning abilities competitive with larger state-of-the-art models at a post-training cost of just $9. Tina models show over a 20% improvement in reasoning and 43.33% Pass@1 accuracy on AIME24. While showcasing impressive cost-performance efficiency, limitations remain, including the smaller model scale, limited diversity in reasoning tasks, and minimal hyperparameter tuning. All code, logs, and model checkpoints are open-sourced to promote accessible research and further exploration.




Check out the Paper and GitHub Page .


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,322
Reputation
9,672
Daps
173,076

Alibaba Qwen Team Just Released Qwen3: The Latest Generation of Large Language Models in Qwen Series, Offering a Comprehensive Suite of Dense and Mixture-of-Experts (MoE) Models​


By Asif Razzaq

April 28, 2025

Despite the remarkable progress in large language models (LLMs), critical challenges remain. Many models exhibit limitations in nuanced reasoning, multilingual proficiency, and computational efficiency. Often, models are either highly capable in complex tasks but slow and resource-intensive, or fast but prone to superficial outputs. Furthermore, scalability across diverse languages and long-context tasks continues to be a bottleneck, particularly for applications requiring flexible reasoning styles or long-horizon memory. These issues limit the practical deployment of LLMs in dynamic real-world environments.

Qwen3 Just Released: A Targeted Response to Existing Gaps


Qwen3 , the latest release in the Qwen family of models developed by Alibaba Group, aims to systematically address these limitations. Qwen3 introduces a new generation of models specifically optimized for hybrid reasoning, multilingual understanding, and efficient scaling across parameter sizes.

The Qwen3 series expands upon the foundation laid by earlier Qwen models, offering a broader portfolio of dense and Mixture of Experts (MoE) architectures. Designed for both research and production use cases, Qwen3 models target applications that require adaptable problem-solving across natural language, coding, mathematics, and broader multimodal domains.

Screenshot-2025-04-28-at-6.04.43%E2%80%AFPM-1-1024x552.png


Screenshot-2025-04-28-at-6.05.04%E2%80%AFPM-1024x544.png


Technical Innovations and Architectural Enhancements


Qwen3 distinguishes itself with several key technical innovations:

  • Hybrid Reasoning Capability :
    A core innovation is the model’s ability to dynamically switch between “thinking” and “non-thinking” modes. In “thinking” mode, Qwen3 engages in step-by-step logical reasoning—crucial for tasks like mathematical proofs, complex coding, or scientific analysis. In contrast, “non-thinking” mode provides direct and efficient answers for simpler queries, optimizing latency without sacrificing correctness.
  • Extended Multilingual Coverage :
    Qwen3 significantly broadens its multilingual capabilities, supporting over 100 languages and dialects, improving accessibility and accuracy across diverse linguistic contexts.
  • Flexible Model Sizes and Architectures :
    The Qwen3 lineup includes models ranging from 0.5 billion parameters (dense) to 235 billion parameters (MoE). The flagship model, Qwen3-235B-A22B , activates only 22 billion parameters per inference, enabling high performance while maintaining manageable computational costs.
  • Long Context Support :
    Certain Qwen3 models support context windows up to 128,000 tokens , enhancing their ability to process lengthy documents, codebases, and multi-turn conversations without degradation in performance.
  • Advanced Training Dataset :
    Qwen3 leverages a refreshed, diversified corpus with improved data quality control, aiming to minimize hallucinations and enhance generalization across domains.

Additionally, the Qwen3 base models are released under an open license (subject to specified use cases), enabling the research and open-source community to experiment and build upon them.

Empirical Results and Benchmark Insights


Benchmarking results illustrate that Qwen3 models perform competitively against leading contemporaries:

  • The Qwen3-235B-A22B model achieves strong results across coding (HumanEval, MBPP), mathematical reasoning (GSM8K, MATH), and general knowledge benchmarks, rivaling DeepSeek-R1 and Gemini 2.5 Pro series models.
  • The Qwen3-72B and Qwen3-72B-Chat models demonstrate solid instruction-following and chat capabilities, showing significant improvements over the earlier Qwen1.5 and Qwen2 series.
  • Notably, the Qwen3-30B-A3B , a smaller MoE variant with 3 billion active parameters, outperforms Qwen2-32B on multiple standard benchmarks, demonstrating improved efficiency without a trade-off in accuracy.

Screenshot-2025-04-28-at-6.05.47%E2%80%AFPM-1-1024x656.png


Early evaluations also indicate that Qwen3 models exhibit lower hallucination rates and more consistent multi-turn dialogue performance compared to previous Qwen generations.

Conclusion


Qwen3 represents a thoughtful evolution in large language model development. By integrating hybrid reasoning, scalable architecture, multilingual robustness, and efficient computation strategies, Qwen3 addresses many of the core challenges that continue to affect LLM deployment today. Its design emphasizes adaptability—making it equally suitable for academic research, enterprise solutions, and future multimodal applications.

Rather than offering incremental improvements, Qwen3 redefines several important dimensions in LLM design, setting a new reference point for balancing performance, efficiency, and flexibility in increasingly complex AI systems.




Check out the Blog , Models on Hugging Face and GitHub Page . Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Gr oup . Don’t Forget to join our 90k+ ML SubReddit .

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

 
Top