bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,778

Crome: Google DeepMind’s Causal Framework for Robust Reward Modeling in LLM Alignment​


By Sajjad Ansari

July 3, 2025

Reward models are fundamental components for aligning LLMs with human feedback, yet they face the challenge of reward hacking issues. These models focus on superficial attributes such as response length or formatting rather than identifying true quality indicators like factuality and relevance. This problem arises because standard training objectives fail to differentiate between spurious correlations present in training data and genuine causal drivers of response quality. The failure to separate these factors leads to brittle reward models (RMs) that generate misaligned policies. Moreover, there is a need for a method that utilizes a causal understanding of preference formation to train RMs that are sensitive to causal quality attributes and invariant to various spurious cues.

Limitations of Existing RM Approaches and the Need for Causal Robustness


Existing methods try to solve reward hacking issues in standard RLHF systems that rely on Bradley-Terry or pairwise ranking methods. This includes architectural modifications, such as Odin, policy-level adjustments, and data-centric methods involving ensembles or consistency checks. Recent causal-inspired methods use MMD regularization against pre-specified spurious factors or estimate causal effects through corrected rewrites. However, these methods target only predetermined spurious factors, missing unknown correlates. While augmentation strategies remain coarse, and evaluation-focused methods fail to equip reward models with robust training mechanisms against diverse spurious variations.

Introducing Crome: Causally Robust Reward Modeling for LLMs


Researchers from Google DeepMind, McGill University, and MILA – Quebec AI Institute have proposed Crome (Causally Robust Reward Modeling), a framework built on an explicit causal model of answer generation. Crome trains RMs to differentiate genuine quality drivers from superficial cues by adding preference datasets with targeted, LLM-generated counterfactual examples. Moreover, it creates two types of synthetic training pairs: (a) Causal Augmentations, which introduce changes along specific causal attributes, such as factuality to enforce sensitivity to true quality shifts, and (b) Neutral Augmentations that enforce invariance along spurious attributes like style using tie-labels. Crome enhances robustness, increasing RewardBench accuracy by up to 4.5%, enhancing safety and reasoning.

Technical Approach: Counterfactual Augmentation and Composite Loss Optimization


The Crome operates through two main phases: generating attribute-aware counterfactual data based on a causal model and training the reward model with a specialized loss on combined data. It provides a theoretical analysis on how causal augmentation isolates true reward drivers from spurious correlates under an idealized model. Crome utilizes the UltraFeedback dataset with counterfactuals generated using Gemini 2.0 Flash, and evaluates performance on RewardBench and reWordBench. Researchers utilize diverse base LLMs in their experiments, including Gemma-2-9B-IT, Qwen2.5-7B, and Gemma-2-2B for both Pairwise Preference and Bradley-Terry reward models, with downstream alignment impact through Best-of-N selection on multiple tasks.

unnamed.png


Performance Gains: From RewardBench to WildGuardTest


On RewardBench, Crome achieves improvements in ranking accuracy over RRM across diverse base models, with significant gains in Safety (up to 13.18%) and Reasoning (up to 7.19%) categories. Crome shows aggregate accuracy gains of up to 9.1% on reWordBench with Gemma-2-9B-IT in PairPM settings and superior performance on 21 out of 23 transformations. Moreover, it shows a smaller decrease in ranking accuracy from RewardBench to reWordBench compared to RRM (19.78% versus 21.54%). Crome shows excellent safety improvements on WildGuardTest with Best-of-N selection, achieving lower attack success ratios on harmful prompts while maintaining similar refusal rates on benign prompts.

Conclusion and Future Directions in Causal Data Augmentation


In conclusion, researchers introduced Crome, a causal framework that solves reward hacking issues during RM training. It employs two targeted synthetic data augmentation strategies: Causal Augmentations and Neutral Augmentations. Crome outperforms strong baselines across multiple base models and reward modeling techniques on RewardBench, and superior robustness on reWordBench against spurious correlations. This dataset curation-centered training method (i.e, Crome) opens new research directions in synthetic data generation for base model training, where causal attribute verification could prove highly beneficial for future developments in robust language model alignment.




Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,778

Baidu Researchers Propose AI Search Paradigm: A Multi-Agent Framework for Smarter Information Retrieval​


By Nikhil

July 1, 2025

The Need for Cognitive and Adaptive Search Engines


Modern search systems are evolving rapidly as the demand for context-aware, adaptive information retrieval grows. With the increasing volume and complexity of user queries, particularly those requiring layered reasoning, systems are no longer limited to simple keyword matching or document ranking. Instead, they aim to mimic the cognitive behaviors humans exhibit when gathering and processing information. This transition towards a more sophisticated, collaborative approach marks a fundamental shift in how intelligent systems are designed to respond to users.

Limitations of Traditional and RAG Systems


Despite these advances, current methods still face critical limitations. Retrieval-augmented generation ( RAG ) systems, while useful for direct question answering, often operate in rigid pipelines. They struggle with tasks that involve conflicting information sources, contextual ambiguity, or multi-step reasoning. For example, a query that compares the ages of historical figures requires understanding, calculating, and comparing information from separate documents—tasks that demand more than simple retrieval and generation. The absence of adaptive planning and robust reasoning mechanisms often leads to shallow or incomplete answers in such cases.

Screenshot-2025-07-01-at-7.11.04%E2%80%AFPM-1024x386.png


The Emergence of Multi-Agent Architectures in Search


Several tools have been introduced to enhance search performance, including Learning-to-Rank systems and advanced retrieval mechanisms utilizing Large Language Models (LLMs). These frameworks incorporate features like user behavior data, semantic understanding, and heuristic models. However, even advanced RAG methods, including ReAct and RQ-RAG, primarily follow static logic, which limits their ability to effectively reconfigure plans or recover from execution failures. Their dependence on one-shot document retrieval and single-agent execution further restricts their ability to handle complex, context-dependent tasks.

Introduction of the AI Search Paradigm by Baidu


Researchers from Baidu introduced a new approach called the “AI Search Paradigm,” designed to overcome the limitations of static, single-agent models. It comprises a multi-agent framework with four key agents: Master, Planner, Executor, and Writer. Each agent is assigned a specific role within the search process. The Master coordinates the entire workflow based on the complexity of the query. The Planner structures complex tasks into sub-queries. The Executor manages tool usage and task completion. Finally, the Writer synthesizes the outputs into a coherent response. This modular architecture enables flexibility and precise task execution that traditional systems lack.

unnamed.png


Screenshot-2025-07-01-at-7.11.24%E2%80%AFPM-1-1024x661.png


Use of Directed Acyclic Graphs for Task Planning


The framework introduces a Directed Acyclic Graph (DAG) to organize complex queries into dependent sub-tasks. The Planner chooses relevant tools from the MCP servers to address each sub-task. The Executor then invokes these tools iteratively, adjusting queries and fallback strategies when tools fail or data is insufficient. This dynamic reassignment ensures continuity and completeness. The Writer evaluates the results, filters inconsistencies, and compiles a structured response. For example, in a query asking who is older than Emperor Wu of Han and Julius Caesar, the system retrieves birthdates from different tools, performs the age calculation, and delivers the result—all in a coordinated, multi-agent process.

Qualitative Evaluations and Workflow Configurations


The performance of this new system was evaluated using several case studies and comparative workflows. Unlike traditional RAG systems, which operate in a one-shot retrieval mode, the AI Search Paradigm dynamically replans and reflects on each sub-task. The system supports three team configurations based on complexity: Writer-Only, Executor-Inclusive, and Planner-Enhanced. For the Emperor age comparison query, the Planner decomposed the task into three sub-steps and assigned tools accordingly. The final output stated that Emperor Wu of Han lived for 69 years and Julius Caesar for 56 years, indicating a 13-year difference—an output accurately synthesized across multiple sub-tasks. While the paper focused more on qualitative insights than numeric performance metrics, it demonstrated strong improvements in user satisfaction and robustness across tasks.

Screenshot-2025-07-01-at-7.12.23%E2%80%AFPM-1-1024x602.png


Conclusion: Toward Scalable, Multi-Agent Search Intelligence


In conclusion, this research presents a modular, agent-based framework that enables search systems to surpass document retrieval and emulate human-style reasoning. The AI Search Paradigm represents a significant advancement by incorporating real-time planning, dynamic execution, and coherent synthesis. It not only solves current limitations but also offers a foundation for scalable, trustworthy search solutions driven by structured collaboration between intelligent agents.




Check out the Paper. All credit for this research goes to the researchers of this project.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,778

ReasonFlux-PRM: A Trajectory-Aware Reward Model Enhancing Chain-of-Thought Reasoning in LLMs​


By Nikhil

July 2, 2025

Understanding the Role of Chain-of-Thought in LLMs


Large language models are increasingly being used to solve complex tasks such as mathematics and scientific reasoning through structured chain-of-thought approaches. These models do not just jump to answers—they reason through intermediate steps that simulate logical thought processes. This technique allows for improved reasoning accuracy and clearer error tracing. As models become more sophisticated, it has become essential to evaluate not just final responses but also the reasoning steps that lead to them.

Limitations of Traditional PRMs in Reasoning Evaluation


One pressing issue is that most current reward models only assess final answers, ignoring how those conclusions were reached. However, frontier models like Deepseek-R1 now output extensive reasoning paths before delivering final responses. These trajectory-response pairs are being reused to train smaller models. The problem is that current Process Reward Models (PRMs) are not built to evaluate these full trajectories. This mismatch leads to unreliable supervision, which can degrade the performance of smaller models trained on trajectory-response data.

Challenges in Handling Disorganized Reasoning Chains


Traditional PRMs are primarily calibrated for structured, clean outputs rather than the lengthy and sometimes disorganized reasoning chains generated by advanced LLMs. Even advanced PRMs, such as Qwen2.5-Math-PRM-72B, show a limited ability to distinguish between high- and low-quality intermediate reasoning. When applied to trajectory-response outputs from Gemini or Deepseek-R1, these models often produce overlapping reward scores, indicating weak discrimination. Their limited sensitivity leads to poor data selection for downstream fine-tuning, and experiments confirm that models trained on PRM-selected data perform worse than those trained on human-curated datasets.

Introducing ReasonFlux-PRM for Trajectory-Level Supervision


Researchers from the University of Illinois Urbana-Champaign (UIUC), Princeton University, Cornell University, and ByteDance Seed introduced ReasonFlux-PRM. The research introduced ReasonFlux-PRM as a trajectory-aware model that evaluates both intermediate reasoning steps and final answers. It integrates step-level and trajectory-level scoring, enabling a more nuanced understanding of reasoning quality. ReasonFlux-PRM is trained on a 10,000-sample dataset of carefully curated math and science problems explicitly designed to mirror real-world trajectory-response formats.

unnamed.png


Technical Framework of ReasonFlux-PRM


Technically, ReasonFlux-PRM operates by scoring each intermediate step in a trajectory concerning its contribution to the final answer. It uses a reference reward function that considers the prompt, prior reasoning steps, and final output to assign step-level scores. These are then aggregated to produce a total trajectory reward. The model supports multiple applications, including offline filtering of high-quality training data, dense reward provision during reinforcement learning using GRPO-based policy optimization, and Best-of-N test-time response selection to enhance inference quality. These capabilities make ReasonFlux-PRM more flexible and comprehensive than prior PRMs.

Empirical Results on Reasoning Benchmarks


In performance evaluations across tasks like AIME, MATH500, and GPQA-Diamond, ReasonFlux-PRM-7B outperformed Qwen2.5-Math-PRM-72B and human-curated data in several key metrics. Specifically, it achieved a 12.1% accuracy gain in supervised fine-tuning, a 4.5% improvement during reinforcement learning, and a 6.3% increase during test-time scaling. These gains are particularly considerable given that ReasonFlux-PRM is smaller in model size. Table 1 shows that the Qwen2.5-14B-Instruct model, when trained on data selected by ReasonFlux-PRM, achieved performance levels close to or exceeding human-curated baselines. In contrast, other PRMs resulted in significant drops of up to 26.6% in certain benchmarks.

Impact and Future Direction of ReasonFlux-PRM


This research addresses a crucial limitation in the training and evaluation of modern reasoning models. By enabling supervision over both thinking trajectories and final answers, ReasonFlux-PRM enhances the quality of training data and the reliability of model responses. It sets a new direction for systematically evaluating and improving reasoning processes in large models.




Check out the PaperandGitHub Page. All credit for this research goes to the researchers of this project.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,778

ByteDance Researchers Introduce Seed-Coder: A Model-Centric Code LLM Trained on 6 Trillion Tokens​


By Sana Hassan

June 25, 2025

Reframing Code LLM Training through Scalable, Automated Data Pipelines


Code data plays a key role in training LLMs, benefiting not just coding tasks but also broader reasoning abilities. While many open-source models rely on manual filtering and expert-crafted rules to curate code datasets, these approaches are time-consuming, biased, and hard to scale across languages. Proprietary models like Claude 3.7 and OpenAI o3 excel at coding tasks but don’t share details about their data. Even open-source models like DeepSeek and Qwen2.5 still depend heavily on human-designed filters. However, this reliance limits progress, echoing “The Bitter Lesson” that real breakthroughs come from scalable, data-driven methods, not handcrafted heuristics.

Seed-Coder’s Model-First Pipeline Minimizes Human Dependency in Pretraining


Researchers at ByteDance introduce Seed-Coder, a family of 8B open-source LLMs including base, instruction, and reasoning models, designed to reduce human involvement in code data curation. Instead of relying on manual rules, their model-centric pipeline utilizes LLMs to score and filter large-scale code data from sources such as GitHub and code-related websites, resulting in a 6-trillion-token dataset. The instruction model is fine-tuned using synthetic data and preference optimization, while the reasoning model enhances multi-step code logic via Long-Chain-of-Thought reinforcement learning. Seed-Coder achieves top performance for its size, often surpassing larger models, and is openly shared to encourage further research and development.

6-Trillion Token Corpus Built with LLM Quality Filters across GitHub and Web Data


Seed-Coder is trained using a model-driven approach that minimizes manual intervention. The pretraining corpus comprises approximately 6 trillion tokens, sourced from various sources, including GitHub code, commit histories, and code-related web data. Initially, basic filtering removes files with syntax issues or inappropriate content. Then, large language models are used to evaluate and score the remaining code, ensuring high-quality data without relying on hand-crafted rules. Pretraining occurs in two stages: first, with core code and web data, and later, with more complex structures, such as full repositories and long-context tasks, like fill-in-the-middle, to enhance the model’s coding capabilities.

Post-Training via Instruction Tuning and LongCoT Enables Multi-Step Code Understanding


After pretraining, Seed-Coder undergoes further refinement through two post-training stages. First, the instruction model is trained using supervised fine-tuning on a diverse set of synthetic instruction data generated and filtered by LLMs, helping it better understand and follow human prompts. Then, its performance is enhanced using direct preference optimization (DPO), which aligns model responses more closely with human preferences. For complex reasoning tasks, the reasoning model is improved using LongCoT reinforcement learning, which strengthens its ability to handle multi-step coding challenges. These steps significantly boost Seed-Coder’s performance across various code generation and reasoning tasks.

unnamed.png


Seed-Coder Excels in Code Generation, Editing, and Multi-Step Reasoning Benchmarks


The evaluation reveals that the three Seed-Coder models, Base, Instruct, and Reasoning, perform exceptionally well across a range of coding tasks. The Base model outperforms other open-source models of similar size on code generation tasks, achieving strong scores on benchmarks like HumanEval and MultiPL-E. The Instruct model excels in tasks requiring code editing and instruction-following, leading in evaluations such as CodeEditorBench and FullStack. The Reasoning model, trained with long-chain-of-thought techniques, demonstrates outstanding multi-step problem-solving skills, particularly on challenging benchmarks like LiveCodeBench and Codeforces, even surpassing models that are several times larger in size.

Screenshot-2025-06-25-at-1.07.55%E2%80%AFAM-1024x521.png


Open-Source Release Encourages Community-Driven Advancements in Code LLMs


In conclusion, Seed-Coder is a family of efficient and high-performing open-source language models designed specifically for coding tasks. These models stand out by relying largely on LLMs rather than humans to filter and curate training data, significantly reducing manual effort. Despite being trained on fewer tokens compared to some larger models, Seed-Coder exhibits exceptional performance in tasks such as code generation, completion, editing, and reasoning. However, its abilities in general language understanding are still limited due to the absence of broad web data and mathematical content. Future updates aim to expand the model family and improve its capabilities across different model sizes.




Check out thePaper,Model Series,GitHub PageandProject Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our100k+ ML SubReddit and Subscribe toour Newsletter .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,778

Baidu Open Sources ERNIE 4.5: LLM Series Scaling from 0.3B to 424B Parameters​


By Asif Razzaq

July 1, 2025

Baidu has officially open-sourced its latest ERNIE 4.5 series, a powerful family of foundation models designed for enhanced language understanding, reasoning, and generation. The release includes ten model variants ranging from compact 0.3B dense models to massive Mixture-of-Experts (MoE) architectures, with the largest variant totaling 424B parameters. These models are now freely available to the global research and developer community through Hugging Face, enabling open experimentation and broader access to cutting-edge Chinese and multilingual language technology.

Technical Overview of ERNIE 4.5 Architecture


The ERNIE 4.5 series builds on Baidu’s previous iterations of ERNIE models by introducing advanced model architectures, including both dense and sparsely activated MoE designs. The MoE variants are particularly notable for scaling parameter counts efficiently: the ERNIE 4.5-MoE-3B and ERNIE 4.5-MoE-47B variants activate only a subset of experts per input token (typically 2 of 64 experts), keeping the number of active parameters manageable while retaining model expressivity and generalization capabilities.

ERNIE 4.5 models are trained using a mixture of supervised fine-tuning (SFT), reinforcement learning with human feedback (RLHF), and contrastive alignment techniques. The training corpus spans 5.6 trillion tokens across diverse domains in both Chinese and English, using Baidu’s proprietary multi-stage pretraining pipeline. The resulting models demonstrate high fidelity in instruction-following, multi-turn conversation, long-form generation, and reasoning benchmarks.

Screenshot-2025-07-01-at-8.30.00%E2%80%AFAM-2-1024x968.png


Model Variants and Open-Source Release


The ERNIE 4.5 release includes the following ten variants:

unnamed.png


  • Dense Models: ERNIE 4.5-0.3B, 0.5B, 1.8B, and 4B
  • MoE Models: ERNIE 4.5-MoE-3B, 4B, 6B, 15B, 47B, and 424B total parameters (with varying active parameters)

The MoE-47B variant, for instance, activates only 3B parameters during inference while having a total of 47B. Similarly, the 424B model—the largest ever released by Baidu—employs sparse activation strategies to make inference feasible and scalable. These models support both FP16 and INT8 quantization for efficient deployment.

Performance Benchmarks


ERNIE 4.5 models show significant improvements on several key Chinese and multilingual NLP tasks. According to the official technical report:

  • On CMMLU , ERNIE 4.5 surpasses previous ERNIE versions and achieves state-of-the-art accuracy in Chinese language understanding.
  • On MMLU , the multilingual benchmark, ERNIE 4.5-47B demonstrates competitive performance with other leading LLMs like GPT-4 and Claude.
  • For long-form generation , ERNIE 4.5 achieves higher coherence and factuality scores when evaluated using Baidu’s internal metrics.

In instruction-following tasks, the models benefit from contrastive fine-tuning, showing improved alignment with user intent and reduced hallucination rates compared to earlier ERNIE versions.

Screenshot-2025-07-01-at-8.38.36%E2%80%AFAM-1-1024x794.png


Applications and Deployment


ERNIE 4.5 models are optimized for a broad range of applications:

  • Chatbots and Assistants : Multilingual support and instruction-following alignment make it suitable for AI assistants.
  • Search and Question Answering : High retrieval and generation fidelity allow for integration with RAG pipelines.
  • Content Generation : Long-form text and knowledge-rich content generation are improved with better factual grounding.
  • Code and Multimodal Extension : Although the current release focuses on text, Baidu indicates that ERNIE 4.5 is compatible with multimodal extensions.

With support for up to 128K context length in some variants, the ERNIE 4.5 family can be used in tasks requiring memory and reasoning across long documents or sessions.

Conclusion


The ERNIE 4.5 series represents a significant step in open-source AI development, offering a versatile set of models tailored for scalable, multilingual, and instruction-aligned tasks. Baidu’s decision to release models ranging from lightweight 0.3B variants to a 424B-parameter MoE model underscores its commitment to inclusive and transparent AI research. With comprehensive documentation, open availability on Hugging Face, and support for efficient deployment, ERNIE 4.5 is positioned to accelerate global advancements in natural language understanding and generation.




Check out the PaperandModels on Hugging Face. All credit for this research goes to the researchers of this project.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,778

ByteDance Researchers Introduce ProtoReasoning: Enhancing LLM Generalization via Logic-Based Prototypes​


By Sana Hassan

June 24, 2025

Why Cross-Domain Reasoning Matters in Large Language Models (LLMs)


Recent breakthroughs in LRMs, especially those trained using Long CoT techniques, show they can generalize impressively across different domains. Interestingly, models trained on tasks such as math or coding often perform well in unrelated areas, like logical puzzles or creative writing. However, what enables this flexibility isn’t fully clear. One possible explanation is that these models learn core reasoning patterns, known as abstract reasoning prototypes, which cut across domains. These shared cognitive structures enable the model to focus less on how problems are presented and more on the similar thought processes required to solve them, allowing for broader transfer.

From CoT to RL: A Shift in How LLMs Learn to Reason


Recent progress in large language model reasoning has shifted from simple CoT and supervised fine-tuning to RL. Models like DeepSeek-R1 and Seed-Thinking-v1.5 have enhanced Long CoT reasoning through mathematical problems, logic tasks, and code execution. These models utilize RL techniques guided by verifiable rewards, such as accuracy from ground-truth answers, to explore complex reasoning paths. This approach enables models to learn from errors, break down complex problems, and refine solutions through iteration. In contrast to past methods, this work introduces the concept of “reasoning prototypes” to understand better the core thinking patterns that enable models to generalize across vastly different domains.

ProtoReasoning Framework: Structured Reasoning with Prolog and PDDL


Researchers from ByteDance Seed and Shanghai Jiao Tong University have developed ProtoReasoning, a framework designed to enhance reasoning in large language models by utilizing structured prototype representations, such as Prolog and PDDL. This system includes an automated pipeline to translate problems into these formats, a reliable verification setup using interpreters, and scalable problem synthesis without manual labeling. The models trained on these prototypes demonstrated notable improvements across various tasks, including logical reasoning (+4.7%), planning (+6.3%), general reasoning (+4.0%), and math (+1.0%). Crucially, training within this structured “prototype space” led to better generalization across similar tasks, supporting the idea that abstract reasoning patterns enhance cross-domain performance.

Architecture Overview: Prototype Constructor and Verifier System


The ProtoReasoning framework boosts reasoning in LLMs by using structured prototypes, Prolog for logic, and PDDL for planning. It includes two core modules: a Prototype Constructor that translates natural language problems into formal representations, and a Verification System that checks solution correctness. For Prolog, a four-step pipeline generates diverse logic problems, which are verified using SWI-Prolog. For planning, tasks such as plan generation, Completion, and Reordering are built using PDDL, with correctness checked via the VAL validator. The training process includes teacher model distillation for reasoning paths, difficulty-based sampling, and filtering to ensure only high-quality data fine-tunes the model for robust generalization.

unnamed.png


Evaluations Show Measurable Improvements in Reasoning and Planning


The ProtoReasoning framework was evaluated through experiments using a 150B parameter Mixture-of-Experts model (15B active), trained on a curated set of high-quality Prolog and PDDL samples. Results showed consistent improvements across logical reasoning, planning, and general benchmarks, including MMLU and AIME 2024. A key ablation study compared Prolog-based training with NL versions on matched datasets. Both formats significantly outperformed the baseline, with Prolog achieving near-equal performance to NL. This demonstrates that structured prototype training can be applied to natural language tasks. However, explicit reasoning (e.g., chain-of-thought) is crucial, and low-sample categories showed weaker gains due to insufficient data.

Screenshot-2025-06-24-at-2.35.13%E2%80%AFPM-2-1024x519.png


Key Findings and Theoretical Implications of Reasoning Prototypes


In conclusion, ProtoReasoning, a framework built on the idea that abstract reasoning prototypes like Prolog for logic and PDDL for planning enable large language models to generalize across domains. By training models on these structured representations, the study observed notable improvements in logical reasoning, planning, and general problem-solving tasks. The results support the hypothesis that shared reasoning patterns across domains facilitate knowledge transfer in models. While the empirical results are promising, the exact nature of reasoning prototypes remains theoretically underexplored. Future work will aim to formalize these concepts mathematically and validate findings using open-source models and datasets.




Check out the Paper. All credit for this research goes to the researchers of this project.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,778

OMEGA: A Structured Math Benchmark to Probe the Reasoning Limits of LLMs​


By Sajjad Ansari

July 1, 2025

Introduction to Generalization in Mathematical Reasoning


Large-scale language models with long CoT reasoning, such as DeepSeek-R1, have shown good results on Olympiad-level mathematics. However, models trained through Supervised Fine-Tuning or Reinforcement Learning depend on limited techniques, such as repeating known algebra rules or defaulting to coordinate geometry in diagram problems. Since these models follow learned reasoning patterns rather than showing true mathematical creativity, they face challenges with complex tasks that demand original insights. Current math datasets are poorly suited for analyzing math skills that RL models can learn. Large-scale corpora integrate a range of math questions varying in topic and difficulty, making it challenging to isolate specific reasoning skills.

Limitations of Current Mathematical Benchmarks


Current methods, such as out-of-distribution generalization, focus on handling test distributions that differ from training data, which is crucial for mathematical reasoning, physical modeling, and financial forecasting. Compositional generalization techniques aim to help models systematically combine learned skills. Researchers have created datasets through various methods to benchmark mathematical abilities, which include hiring humans to write problems like GSM8K and MinervaMath, collecting exam questions such as AIME and OlympiadBench, and scraping and filtering exam corpora like NuminaMath and BigMath. However, these approaches either lack sufficient challenge for modern LLMs or fail to provide analysis granularity.

Introducing OMEGA: A Controlled Benchmark for Reasoning Skills


Researchers from the University of California, Ai2, the University of Washington, and dmodel.ai have proposed OMEGA, a benchmark designed to evaluate three dimensions of Out-of-Distribution generalization, inspired by Boden’s typology of creativity. It creates matched training and test pairs designed to isolate specific reasoning skills across three dimensions: Exploratory, Compositional, and Transformative. OMEGA’s test and train problems are constructed using carefully engineered templates, allowing precise control over diversity, complexity, and the specific reasoning strategies required for solutions. Moreover, it employs 40 templated problem generators across six mathematical domains: arithmetic, algebra, combinatorics, number theory, geometry, and logic & puzzles.

Evaluation on Frontier LLMs and Reinforcement Learning Setup


Researchers evaluate four frontier models, including DeepSeek-R1, Claude-3.7-Sonnet, OpenAI-o3-mini, and OpenAI-o4-mini, across different complexity levels. For RL generalization experiments, the framework applies the GRPO algorithm on 1,000 training problems using Qwen2.5-7B-Instruct and Qwen2.5-Math-7B models. Exploratory generalization trains on restricted complexity levels and evaluates on higher complexity problems. Compositional generalization involves training models on individual skills in isolation and testing their ability to combine and apply those skills effectively. Transformational generalization trains on conventional solution approaches and evaluates performance on problems that need unconventional strategies.

unnamed.png


Performance Observations and Model Behavior Patterns


Reasoning LLMs tend to perform worse as problem complexity increases, often finding correct solutions early but spending too many tokens on unnecessary verification. RL applied only on low-complexity problems enhances generalization to medium-complexity problems, with larger gains on in-domain examples than out-of-distribution ones, indicating RL’s effectiveness at reinforcing familiar patterns. For instance, in the Zebra Logic domain, the base model achieves only 30% accuracy. However, RL training increased performance by 61 points on in-domain examples and 53 points on out-of-distribution examples without SFT.

Conclusion: Toward Advancing Transformational Reasoning


In conclusion, researchers introduced OMEGA, a benchmark that isolates and evaluates three axes of out-of-distribution generalization in mathematical reasoning: explorative, compositional, and transformative. The empirical study reveals three insights: (a) RL fine-tuning significantly improves performance on in-distribution and exploratory generalization tasks, (b) RL’s benefits for compositional tasks are limited, and (c) RL fails to induce genuinely new reasoning patterns. These findings highlight a fundamental limitation: RL can amplify problem-solving breadth and depth, but it falls short in enabling the creative leaps essential for transformational reasoning. Future work should explore curriculum scaffolding and meta-reasoning controllers.




Check out the Paper,Project PageandGitHub Page. All credit for this research goes to the researchers of this project.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,778

Google AI Releases Gemma 3n: A Compact Multimodal Model Built for Edge Deployment​


By Asif Razzaq

June 26, 2025

Google has introduced Gemma 3n, a new addition to its family of open models, designed to bring large multimodal AI capabilities to edge devices. Built from the ground up with a mobile-first design philosophy, Gemma 3n can process and understand text, images, audio, and video on-device, without relying on cloud compute. This architecture represents a significant leap in the direction of privacy-preserving, real-time AI experiences across devices like smartphones, wearables, and smart cameras.

Key Technical Highlights of Gemma 3n


The Gemma 3n series includes two versions:Gemma 3n E2B and Gemma 3n E4B , optimized to deliver performance on par with traditional 5B and 8B parameter models respectively, while utilizing fewer resources. These models integrate architectural innovations that drastically reduce memory and power requirements, enabling high-quality inference locally on edge hardware.

  • Multimodal Capabilities: Gemma 3n supports multimodal understanding in 35 languages, and text-only tasks in over 140 languages.
  • Reasoning Proficiency: The E4B variant breaks a 1300 score barrier on academic benchmarks like MMLU, a first for sub-10B parameter models.
  • High Efficiency: The model’s compact architecture allows it to operate with less than half the memory footprint of comparable models, while retaining high quality across use cases.

Screenshot-2025-06-26-at-10.46.57%E2%80%AFPM-1-1024x643.png


Model Variants and Performance


  • Gemma 3n E2B: Designed for high efficiency on devices with limited resources. Performs like a 5B model while consuming less energy.
  • Gemma 3n E4B: A high-performance variant that matches or exceeds 8B-class models in benchmarks. It is the first model under 10B to surpass a 1300 score on MMLU.

Screenshot-2025-06-26-at-10.26.10%E2%80%AFPM-1-1024x652.png


Both models are fine-tuned for:

  • Complex math ,coding , and logical reasoning tasks
  • Advanced vision-language interactions (image captioning, visual Q&A)
  • Real-time speech and video understanding

Screenshot-2025-06-26-at-10.48.42%E2%80%AFPM-1-1024x584.png


Developer-Centric Design and Open Access


Google has made Gemma 3n available through platforms like Hugging Face with preconfigured training checkpoints and APIs. Developers can easily fine-tune or deploy the models across hardware, thanks to compatibility with TensorFlow Lite, ONNX, and NVIDIA TensorRT.

unnamed.png


The official developer guide provides support for implementing Gemma 3n into diverse applications, including:

  • Environment-aware accessibility tools
  • Intelligent personal assistants
  • AR/VR real-time interpreters

Applications at the Edge


Gemma 3n opens new possibilities for edge-native intelligent applications:

  • On-device accessibility: Real-time captioning and environment-aware narration for users with hearing or vision impairments
  • Interactive education: Apps that combine text, images, and audio to enable rich, immersive learning experiences
  • Autonomous vision systems: Smart cameras that interpret motion, object presence, and voice context without sending data to the cloud

These features make Gemma 3n a strong candidate for privacy-first AI deployments, where sensitive user data never leaves the local device.

Screenshot-2025-06-26-at-10.51.23%E2%80%AFPM-1-1024x596.png


Training and Optimization Insights


Gemma 3n was trained using a robust, curated multimodal dataset combining text, images, audio, and video sequences. Leveraging data-efficient fine-tuning strategies, Google ensured that the model maintained high generalization even with a relatively smaller parameter count. Innovations in transformer block design, attention sparsity, and token routing further improved runtime efficiency.

Why Gemma 3n Matters


Gemma 3n signals a shift in how foundational models are built and deployed. Instead of pushing toward ever-larger model sizes, it focuses on:

  • Architecture-driven efficiency
  • Multimodal comprehension
  • Deployment portability

It aligns with Google’s broader vision for on-device AI: smarter, faster, more private, and universally accessible. For developers and enterprises, this means AI that runs on commodity hardware while delivering the sophistication of cloud-scale models.

Conclusion


With the launch of Gemma 3n, Google is not just releasing another foundation model; it is redefining the infrastructure of intelligent computing at the edge. The availability of E2B and E4B variants provides flexibility for both lightweight mobile applications and high-performance edge AI tasks. As multimodal interfaces become the norm, Gemma 3n stands out as a practical and powerful foundation model optimized for real-world usage.




Check out the Technical details,Models on Hugging FaceandTry it on Google Studio. All credit for this research goes to the researchers of this project.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,778

Google DeepMind Releases AlphaGenome: A Deep Learning Model that can more Comprehensively Predict the Impact of Single Variants or Mutations in DNA​


By Asif Razzaq

June 26, 2025

A Unified Deep Learning Model to Understand the Genome


Google DeepMind has unveiled AlphaGenome , a new deep learning framework designed to predict the regulatory consequences of DNA sequence variations across a wide spectrum of biological modalities. AlphaGenome stands out by accepting long DNA sequences—up to 1 megabase—and outputting high-resolution predictions, such as base-level splicing events, chromatin accessibility, gene expression, and transcription factor binding.

Built to address limitations in earlier models, AlphaGenome bridges the gap between long-sequence input processing and nucleotide-level output precision. It unifies predictive tasks across 11 output modalities and handles over 5,000 human genomic tracks and 1,000+ mouse tracks. This level of multimodal capability positions AlphaGenome as one of the most comprehensive sequence-to-function models in genomics.

Technical Architecture and Training Methodology


AlphaGenome adopts a U-Net-style architecture with a transformer core. It processes DNA sequences in 131kb parallelized chunks across TPUv3 devices, enabling context-aware, base-pair-resolution predictions. The architecture uses two-dimensional embeddings for spatial interaction modeling (e.g., contact maps) and one-dimensional embeddings for linear genomic tasks.

Training involved two stages:

unnamed.png


  1. Pre-training : using fold-specific and all-folds models to predict from observed experimental tracks.
  2. Distillation : a student model learns from teacher models to deliver consistent and efficient predictions, enabling fast inference (~1 second per variant) on GPUs like the NVIDIA H100.

Screenshot-2025-06-26-at-12.36.30%E2%80%AFAM-1-1024x709.png


Performance Across Benchmarks


AlphaGenome was rigorously benchmarked against specialized and multimodal models across 24 genome track and 26 variant effect prediction tasks. It outperformed or matched state-of-the-art models in 22/24 and 24/26 evaluations, respectively. In splicing, gene expression, and chromatin-related tasks, it consistently surpassed specialized models like SpliceAI, Borzoi, and ChromBPNet.

For instance:

  • Splicing : AlphaGenome is the first to simultaneously model splice sites, splice site usage, and splice junctions at 1 bp resolution. It outperformed Pangolin and SpliceAI on 6 of 7 benchmarks.
  • eQTL prediction : The model achieved a 25.5% relative improvement in direction-of-effect prediction compared to Borzoi.
  • Chromatin accessibility : It demonstrated strong correlation with DNase-seq and ATAC-seq experimental data, outperforming ChromBPNet by 8-19%.

Screenshot-2025-06-26-at-12.37.14%E2%80%AFAM-1-1024x660.png


Variant Effect Prediction from Sequence Alone


One of AlphaGenome’s key strengths lies in variant effect prediction (VEP) . It handles zero-shot and supervised VEP tasks without relying on population genetics data, making it robust for rare variants and distal regulatory regions. With a single inference, AlphaGenome evaluates how a mutation may impact splicing patterns, expression levels, and chromatin state—all in a multimodal fashion.

The model’s ability to reproduce clinically observed splicing disruptions , such as exon skipping or novel junction formation, illustrates its utility in diagnosing rare genetic diseases. It accurately modeled the effects of a 4bp deletion in the DLG1 gene observed in GTEx samples.

Application in GWAS Interpretation and Disease Variant Analysis


AlphaGenome aids in interpreting GWAS signals by assigning directionality of variant effects on gene expression. Compared to colocalization methods like COLOC, AlphaGenome provided complementary and broader coverage—resolving 4x more loci in the lowest MAF quintile.

It also demonstrated utility in cancer genomics. When analyzing non-coding mutations upstream of the TAL1 oncogene (linked to T-ALL), AlphaGenome’s predictions matched known epigenomic changes and expression upregulation mechanisms, confirming its ability to assess gain-of-function mutations in regulatory elements.

TL;DR


AlphaGenome by Google DeepMind is a powerful deep learning model that predicts the effects of DNA mutations across multiple regulatory modalities at base-pair resolution. It combines long-range sequence modeling, multimodal prediction, and high-resolution output in a unified architecture. Outperforming specialized and generalist models across 50 benchmarks, AlphaGenome significantly improves the interpretation of non-coding genetic variants and is now available in preview to support genomics research worldwide.




Check out the Paper,Technical detailsandGitHub Page. All credit for this research goes to the researchers of this project.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,778

Shanghai Jiao Tong Researchers Propose OctoThinker for Reinforcement Learning-Scalable LLM Development​


By Sajjad Ansari

July 2, 2025

Introduction: Reinforcement Learning Progress through Chain-of-Thought Prompting



LLMs have shown excellent progress in complex reasoning tasks through CoT prompting combined with large-scale reinforcement learning (RL). Models like Deepseek-R1-Zero have shown strong reasoning capabilities by applying RL directly to base models. Similarly, methods such as SimpleRL and Open-ReasonerZero show improvements in smaller models like the Qwen series. However, achieving success across different base model families remains a challenge. Moreover, applying R1-Zero-style training to base models such as the Llama series faces difficulty, posing a fundamental question about the underlying factors that lead different base models to behave inconsistently during reinforcement learning.

Limitations of RL Scaling on Llama Models



Large-scale RL advances in models like OpenAI’s o1, o3, and DeepSeek’s R1 on competition-level mathematics problems, motivating the exploration of RL on smaller models with less than 100B parameters. However, they are limited to the Qwen model family, while replicating results on families such as Llama is difficult. The lack of transparency in pre-training pipelines has made it difficult to understand how pre-training influences RL scaling. This has prompted unconventional studies, which found that one-shot prompting improves reasoning in Qwen but offers little benefit in Llama. Efforts to curate high-quality mathematical pre-training corpora through projects like OpenWebMath, MathPile, InfiMM-Web-Math, and FineMath have made progress but remain limited in scale under 100B tokens.

Screenshot-2025-07-02-at-5.56.23%E2%80%AFPM-2-1024x422.png


Exploring Mid-Training with Stable-then-Decay Strategy



Researchers from Shanghai Jiao Tong University investigate how mid-training strategies shape RL dynamics, focusing on Qwen and Llama. The study presents several insights: First, high-quality mathematical corpora such as MegaMath-Web-Pro boost both base model and RL outcomes. Second, using QA-style data, especially those with long CoT reasoning, further enhances RL results. Third, long CoT introduces verbosity and instability in RL training. Lastly, applying scaling during mid-training results in stronger downstream RL performance. Researchers introduce a two-stage mid-training strategy called Stable-then-Decay, where base models are first trained on 200B tokens, followed by 20B tokens across three CoT-focused branches, resulting in OctoThinker models that show strong RL compatibility.

RL Configuration and Benchmark Evaluation



Researchers use the MATH8K dataset for RL training prompts. The configuration includes a global training batch size of 128, 16 rollout responses per query, and a PPO mini-batch size of 64, with experiments conducted on Llama-3.2-3B-Base and Qwen2.5-3B-Base models. For evaluation, few-shot prompting is used for base language models, and zero-shot for RL-tuned models across indicator tasks, including GSM8K, MATH500, OlympiadBench, and AMC23. During RL training, Qwen models exhibit increasing response lengths that remain reasonable throughout, whereas Llama displays abnormal behavior, with average response lengths escalating to 4,096 tokens. Evaluation further reveals that RL-tuned Qwen2.5-3B achieves improvements across benchmarks, while Llama-3.2-3B shows only marginal gains.

unnamed.png


OctoThinker Outperforms Llama in RL Compatibility



Each OctoThinker branch demonstrates 10%-20% improvement over the original Llama base model and consistent gains over the stable-stage model across all sizes when evaluated on 13 mathematical benchmarks. The OctoThinker-Zero families reveal diverse thinking behaviors during RL scaling, with strong performance from the OctoThinker-Long variant. When comparing three 3B-scale base models during RL training, OctoThinker-Long-3B outperforms the original Llama-3.2-3B model and reaches performance parity with Qwen2.5-3B, a model known for strong reasoning capabilities and extensive pre-training. The hybrid and short branches show slightly lower performance, especially on challenging benchmarks

Conclusion and Future Work: Toward RL-Ready Foundation Models



This paper investigates why base models such as Llama and Qwen exhibit divergent behaviors during RL for reasoning, showing that mid-training plays a major role in RL scalability. The two-stage mid-training strategy transforms Llama into a foundation model better suited for RL, resulting in OctoThinker models. Future research directions include:

  • Curating higher-quality mathematical corpora to improve mid-training.
  • Creating RL-friendly base models using open recipes without distillation from long CoT reasoning models.
  • Separating the QA format and content to understand their contributions individually.
  • Expanding the OctoThinker family with new branches, such as tool-integrated reasoning.




Check out the Paper,Hugging Face PageandGitHub Page. All credit for this research goes to the researchers of this project.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,778

1/16
@rohanpaul_ai
Brilliant Memory framework proposed in this paper.

MemOS makes remembering a first‑class system call.

LLMs forget stuff fast and retraining them costs a fortune.

MemOS treats memories like files in an operating system, letting a model write, move, and retire knowledge on the fly, not just while training.

It packs every fact or state into a MemCube, tags it with who wrote it and when, then the scheduler moves that cube between plain text, GPU cache, or tiny weight patches depending on use.

On the LOCOMO benchmark the system reaches 73.31 LLM-Judge average, roughly 9 points above the next best memory system and it stays ahead on hard multi-hop and temporal questions.

Even while juggling about 1500 memory tokens, it matches full-context accuracy yet keeps latency in the same ballpark as slimmer baselines.

Switching hot cubes into KV-cache cuts first-token wait by 91.4% on the Qwen2.5-72B test without changing any output text.

Overall, the findings show that a memory-as-OS approach boosts reasoning quality, trims latency, and bakes in audit and version control all at once.

🧵 Read on 👇

Gvdp0nyakAUfeG7.png


2/16
@rohanpaul_ai
🧠 Why memory got messy

Most models squeeze everything into billions of frozen weights, so updating even 1 fact needs a full fine‑tune.

Context windows help for a moment, yet they vanish after the next prompt, and retrieval pipelines bolt on extra text without tracking versions or ownership.

Figure 1 on page 2 shows MemOS beating older fixes across single‑hop, multi‑hop, open‑domain, and temporal questions, which hints that raw parameter tweaks or plain RAG were never enough.

Gvdu-C-WAAAE1rV.png


3/16
@rohanpaul_ai
📦 What a MemCube holds
A MemCube wraps the actual memory plus metadata like owner, timestamp, priority, and access rules.

That wrapper works for 3 shapes of memory, plaintext snippets, activation tensors sitting in the KV‑cache, and low‑rank parameter patches.

Because every cube logs who touched it and why, the scheduler can bump hot cubes into GPU cache or chill cold ones in archival storage without losing the audit trail.

GvdvHW9akAIrCqb.png


4/16
@rohanpaul_ai
🏗️ Three layers doing the heavy lifting

The interface layer turns a user’s chat into structured MemoryAPI calls, so a question about “last year’s check‑ups” becomes a time‑scoped query.

The operation layer runs MemScheduler, MemOperator, and MemLifecycle to pick cubes, fuse overlaps, and mark those cubes as activated, merged, or archived.

The infrastructure layer guards cubes with MemGovernance, ships them through MemLoader / MemDumper, and parks them in MemVault, which can be a vector store, graph DB, or blob bucket.

GvdvMgoWgAE3qqm.png


5/16
@rohanpaul_ai
🔄 Scheduler keeps memories fresh

MemScheduler decides which cube lands where.

High‑hit plaintext converts into activation tensors for instant reuse, and stable activation patterns finally distill into parameter patches for zero prompt overhead.

Old cubes slide the other way, turning pricey weights into cheap text once they stop earning hits.

GvdvUiXaIAAd9_o.png


6/16
@rohanpaul_ai
📊 Numbers that prove the point

On the LOCOMO benchmark MemOS posts an LLM‑Judge score of 73.31, topping the next best by roughly 9 points while holding a similar latency budget.

The bar chart on page 2 shows especially wide gaps in multi‑hop and temporal reasoning, areas that crumble when context slips.

GvdvZoBXMAA0zoc.png


7/16
@rohanpaul_ai
⚡ KV tricks to cut wait time

MemScheduler pre‑bakes popular cubes into KV‑cache entries so the model skips encoder work.

For the Qwen2.5‑72B test, first‑token latency drops from 1.79 s to 0.15 s, a 91% cut, and the output text stays byte‑for‑byte identical.

GvdvertX0AE2CX3.png


8/16
@rohanpaul_ai
Paper – MemOS: A Memory OS for AI System

Paper Title: "MemOS: A Memory OS for AI System"

9/16
@lux
just wanted to say thanks so much for posting all these great papers so much I would miss otherwise!

10/16
@rohanpaul_ai
thanks buddy for the kind words, and great to know that. 👊👊

11/16
@innerly_ai
memories as files? yeah that's a wild trip into our own brains. imagine the mess we’re holding onto

12/16
@anthara_ai
Impressive approach to memory management. Deep impact on reasoning quality and latency.

13/16
@St33lMouse
Memory is the key. Crack the problem and we get EVERYTHING. Individuality and personality. Continuous learning. Instant expertise. Personalized AI.

I could imagine we have smaller models with reasoning at the core and interchangeable memory modules at the periphery.

Combine this with piecemeal processing of large context windows and we might get small models running on consumer hardware capable of learning, solving large problems, and differentiating themselves based on what they're experiencing.

Memory modules could be excluded to make the system manageable. Don't need an AI versed in complex number theory and biology? Don't include those modules. Suddenly need an expert doctor? Add the medical module.

14/16
@_vonarchimboldi
/Samhanknr /twst12612648

15/16
@Trakintelai
MemOS is a smart leap, treating LLM memory like an OS manages files cuts retraining costs and boosts efficiency.

16/16
@tooliense
wow amazing memory, does the model preserves its capability on other bench marks?

Thanks for intoducing interesting works


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,778

1/2
@rohanpaul_ai
🩺 Google Research release MedGemma 27B, multimodal health-AI models that run on 1 GPU

MedGemma 27B multimodal extends the earlier 4B multimodal and 27B text-only models by adding vision capabilities to a 27B-parameter language core.

Training added 2 new datasets, EHRQA and Chest ImaGenome, so the model can read longitudinal electronic health records and localize anatomy in chest X-rays.

The report states that this larger multimodal variant inherits every skill of the 4B model while markedly improving language fluency, EHR reasoning, and visual grounding.

The 4B variant clocks 64.4% MedQA and 81% radiologist-validated X-ray reports, while the 27B text model scores 87.7% at about 10% of DeepSeek R1’s cost

MedGemma fuses a Gemma-3 language core with the MedSigLIP vision encoder, letting one network reason across scans and notes. MedSigLIP unifies radiology, dermatology, retina images into one shared embedding space.

Because MedSigLIP is released separately, developers can plug it into classification, retrieval, or search pipelines that need structured outputs, while reserving MedGemma for free-text generation such as report writing or visual question answering.

Both models load on a single GPU, and the 4B versions even run on mobile-class hardware, which lowers cost and eases on-premise deployment where data privacy is critical.

Simple fine-tuning lifts the 4B chest-X-ray RadGraph F1 to 30.3, proving headroom for domain tweaks

Because weights are frozen and local, hospitals gain privacy, reproducibility, and full control compared with remote APIs.

Gvc2YRYXoAIRXYW.jpg

GvbzLidW8AA5w2Q.jpg


2/2
@rohanpaul_ai
The picture sorts the data first. On top you see 4 imaging streams—radiology, dermatology, digital pathology, ophthalmology—and 1 medical-text stream. Each arrow shows how those sources feed the rest of the stack.

The images go through MedSigLIP, a vision encoder that turns each scan or photo into a compact vector the language models can read.

Those vectors flow into MedGemma 4B Multimodal, a 4B-parameter model that handles both pictures and words in a single forward pass.

For text-only work there is a larger 27B-parameter MedGemma model that skips the image part and focuses on language reasoning.

Gvc7PzLakAAyoiY.png



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,778

1/12
@rohanpaul_ai
I'm playing with Kimi-Researcher from /Kimi_Moonshot , and its delivering an unexpectedly excellent report! 🧠

Its a multi-turn search and reasoning AI Agent from China competing head-to-head with OpenAI's Deep Research. And its free.

🌐 Checks more than 200 URLs for each task.

⏰ Context-aware, very long-horizon reasoning

🛠️ Runs on an internal Kimi k-series backbone

🎮 Learns entirely through end-to-end agentic RL

🧠 Averages about 23 reasoning steps per query

Overall, what I really like is that it turns it all into a clean, visual report that actually makes sense. Smart, solid, reliable.

It has proven that an LLM can train itself with reinforcement learning to plan, search, and code, reaching 26.9% pass@1 on Humanity’s Last Exam after starting at 8.6%, beating many supervised‑finetuned models.

Benchmarks:

🏆 Achieves 26.9% pass@1 on Humanity’s Last Exam, top of the board

📈 Scores 69% pass@1 on xbench-DeepSearch, edging past o3 with tools

🔍 Delivers solid results on FRAMES, Seal-0, and SimpleQA

Key takeaway

🔥 Shows that self-rewarded training can mature planning, search, and coding in one loop

📚 High‑quality agent datasets are rare, so the team generated their own.

They built tool‑centric challenges that force real tool use and hard reasoning prompts that need iterative search. An automated pipeline synthesized question‑answer pairs, verified ground truth, and filtered out trivial or noisy examples to scale data without manual labeling.

🏗️ Context spills were a major pain point. A learned memory policy keeps only useful snippets and discards the rest, letting a single conversation run 50+ turns without hitting context limits.

🎯 Training stays on‑policy. Tool‑call format guards are switched off, so every trajectory truly reflects model probabilities.

Some negative samples are dropped to prevent probability collapse, and the agent keeps improving for longer runs.

What that means is that, If the trainer kept every badly scored run, the learning rule could shove the odds of many actions all the way to 0, and the model would stop exploring. To avoid that freeze, the pipeline drops a slice of the worst runs. The model still sees plenty of errors, but not enough to wipe out whole branches of its search space.

These tweaks keep the feedback loop stable across long tasks, so the agent keeps improving even when a single job takes dozens of steps.

🧵 1/n Read on

GvbIB2MboAEQ4gG.jpg


2/12
@rohanpaul_ai
🧵 2/n Link to try: Kimi - 会推理解析,能深度思考的AI助手

Running the agent is straightforward. A user opens Kimi - 会推理解析,能深度思考的AI助手, logs in, and toggles the “Researcher” mode

Enters a research query, clarifies any follow-up questions, and the agent works for roughly 20-25 minutes.

When the report appears, the text can be copied directly or shared via a link; a download button is not yet available. All usage is free during the public preview, and no hard quota has been announced.

– 200+ URLs
– inline citations
– tool-calling (search + browser + code)

https://video.twimg.com/amplify_video/1942960617314521088/vid/avc1/1920x1080/V9AbUXDkTtEUPYrX.mp4

3/12
@rohanpaul_ai
🧵 3/n

Kimi‑Researcher proves that an agent can learn planning, perception, and precise tool use in one loop.

🌟 After RL, the model averages 23 reasoning steps and checks about 200 URLs per task, reaches 69% pass@1 on xbench‑DeepSearch, and shows habits like cross‑verifying conflicting sources before answering.

GvbIoYiWoAA4BFL.jpg


4/12
@rohanpaul_ai
🧵 4/n

Kimi-Researcher relies on 3 built-in tools: a fast parallel search engine, a text-only browser for interactive sites, and a coding sandbox for data wrangling or analysis.

Together they let the model fetch evidence, run code, and compose structured reports that mix prose, tables, and simple charts inside an interactive page.

GvbJ5rLXsAA6EpB.jpg


5/12
@rohanpaul_ai
🧵 5/n Performance on key research benchmarks

Kimi hits:

– 26.9% pass@1 on Humanity’s Last Exam (vs OpenAI’s ~8%)
– 69% pass@1 on xBench DeepSearch
– Top scores on Seal-0, Frames, SimpleQA
And yes.. it’s all done by one model. No agents and tricks.

GvbJ9jObEAAf864.jpg


6/12
@rohanpaul_ai
🧵 6/n Here are some creative and great use cases or running deep research task with it.

Prompt - “Provide an in-depth and comprehensive analysis of the Golden State Warriors’ salary cap situation for the 2025–2026 NBA season. This should include detailed projections of guaranteed contracts, player options, potential free agents, and dead money on the books. Evaluate the team’s flexibility regarding potential trades, including possible targets, movable contracts, and draft assets. Break down the composition of the roster in terms of strengths, weaknesses, positional depth, age profile, and development potential of young players. Finally, assess the realistic probability of the Warriors mounting a successful championship run in light of their financial constraints, roster construction, and the competitive landscape of the league”

https://www.kimi.com/preview/197bf0e2-e001-861c-96ef-688ebe0005de

https://video.twimg.com/amplify_video/1942962571042000896/vid/avc1/1280x720/TBeFIZ8CPV_EsKXu.mp4

7/12
@rohanpaul_ai
🧵 7/n

Another deep research task

Prompt - “Make a deep research about CoreWeave’s infrastructure expansion and competitive positioning in the GPU-as-a-Service (GPUaaS) market, including key clients, partnerships, and scalability roadmap.”

Kimi - 会推理解析,能深度思考的AI助手

https://video.twimg.com/amplify_video/1942962695642435584/vid/avc1/1280x720/hzt4YGa21OSPs14e.mp4

8/12
@rohanpaul_ai
🧵 8/n Another use case of running deep research task

Prompt - “Make a deep research about Circle’s post-IPO stock surge and volatility—750% rally, recent pullback, analyst sentiment including Ark trim, and comparison to Coinbase movement.”

https://video.twimg.com/amplify_video/1942962873308696577/vid/avc1/1280x720/Ya2RlrsZgBCwvCrk.mp4

9/12
@rohanpaul_ai
🧵 9/n

Another use case

Prompt - “Analyze Tesla’s recent executive shake‑up—including the firing of regional head Omead Afshar—under the strain of Q2 sales decline, European market share drop, and internal ‘Tesla Takedown’ protests.”

Kimi - 会推理解析,能深度思考的AI助手

https://video.twimg.com/amplify_video/1942962963905679360/vid/avc1/1280x720/ERPuC3_Hgti0fQzK.mp4

10/12
@rohanpaul_ai
🧵 10/n

Another use case

Prompt - “Core PCE hits lowest level since 2020 — implications for inflation"

https://www.kimi.com/preview/d1h2smef5ku9dv185g40?blockId=46

https://video.twimg.com/amplify_video/1942963994001850368/vid/avc1/1920x1080/i2qnExBs5Ug-9nhd.mp4

11/12
@rohanpaul_ai
Link to try: Kimi - 会推理解析,能深度思考的AI助手

Kimi-Researcher is beginning its gradual rollout. Join the waitlist here.

Apply for Kimi Researcher

🔗 Blog: Kimi-Researcher: End-to-End RL Training for Emerging Agentic Capabilities

12/12
@ApollonVisual
The results of the research are solid content wise and accuracy wise. But it was very slow (o3 pro deep research with extended research prompt, fetched results nearly 7 minutes earlier) for the same exact topic /promot. But it’s free and an interesting alternative to established deep research agents


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,778

1/3
@rohanpaul_ai
So /xAI 's /grok 4 really did hit 44.4% on HLE (Humanities Last Exam) 🤯

---

(HLE holds 2,500 expert-written questions spanning more than 100 subjects, including math, physics, computer science and humanities, and 14% of them mix text with images.
The authors deliberately built in anti-gaming safeguards and hid a private question set so that simply memorising answers will not help a model.)

GveECKnaEAAR9aF.jpg


2/3
@rohanpaul_ai
Grok 4 brings huge upgrades to voice conversations and introduces new voices, like Eve, capable of rich emotions.

GveZPBVbIAAz3Fv.jpg


3/3
@NavnNavn248469
Android users on suicide watch


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196


1/3
@rohanpaul_ai
So /xAI 's /grok 4 really did hit 44.4% on HLE (Humanities Last Exam) 🤯

---

(HLE holds 2,500 expert-written questions spanning more than 100 subjects, including math, physics, computer science and humanities, and 14% of them mix text with images.
The authors deliberately built in anti-gaming safeguards and hid a private question set so that simply memorising answers will not help a model.)

GveECKnaEAAR9aF.jpg


2/3
@rohanpaul_ai
Grok 4 is now the leading AI model on Artificial Analysis Intelligence Index.

Achieves 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude 4 Opus at 64 and DeepSeek R1 0528 at 68. Full results breakdown below.

GveL_dBXMAE09VW.jpg

Gvd9nWIakAULlB9.jpg


3/3
@dh7net
Another proof that these leaderboards are not correlated anymore with user needs.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/37
@ArtificialAnlys
xAI gave us early access to Grok 4 - and the results are in. Grok 4 is now the leading AI model.

We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude 4 Opus at 64 and DeepSeek R1 0528 at 68. Full results breakdown below.

This is the first time that /elonmusk's /xai has the lead the AI frontier. Grok 3 scored competitively with the latest models from OpenAI, Anthropic and Google - but Grok 4 is the first time that our Intelligence Index has shown xAI in first place.

We tested Grok 4 via the xAI API. The version of Grok 4 deployed for use on X/Twitter may be different to the model available via API. Consumer application versions of LLMs typically have instructions and logic around the models that can change style and behavior.

Grok 4 is a reasoning model, meaning it ‘thinks’ before answering. The xAI API does not share reasoning tokens generated by the model.

Grok 4’s pricing is equivalent to Grok 3 at $3/$15 per 1M input/output tokens ($0.75 per 1M cached input tokens). The per-token pricing is identical to Claude 4 Sonnet, but more expensive than Gemini 2.5 Pro ($1.25/$10, for <200K input tokens) and o3 ($2/$8, after recent price decrease). We expect Grok 4 to be available via the xAI API, via the Grok chatbot on X, and potentially via Microsoft Azure AI Foundry (Grok 3 and Grok 3 mini are currently available on Azure).

Key benchmarking results:
➤ Grok 4 leads in not only our Artificial Analysis Intelligence Index but also our Coding Index (LiveCodeBench & SciCode) and Math Index (AIME24 & MATH-500)
➤ All-time high score in GPQA Diamond of 88%, representing a leap from Gemini 2.5 Pro’s previous record of 84%
➤ All-time high score in Humanity’s Last Exam of 24%, beating Gemini 2.5 Pro’s previous all-time high score of 21%. Note that our benchmark suite uses the original HLE dataset (Jan '25) and runs the text-only subset with no tools
➤ Joint highest score for MMLU-Pro and AIME 2024 of 87% and 94% respectively
➤ Speed: 75 output tokens/s, slower than o3 (188 tokens/s), Gemini 2.5 Pro (142 tokens/s), Claude 4 Sonnet Thinking (85 tokens/s) but faster than Claude 4 Opus Thinking (66 tokens/s)

Other key information:
➤ 256k token context window. This is below Gemini 2.5 Pro’s context window of 1 million tokens, but ahead of Claude 4 Sonnet and Claude 4 Opus (200k tokens), o3 (200k tokens) and R1 0528 (128k tokens)
➤ Supports text and image input
➤ Supports function calling and structured outputs

See below for further analysis 👇

Gvd9nWIakAULlB9.jpg


2/37
@ArtificialAnlys
Grok 4 scores higher in Artificial Analysis Intelligence Index than any other model. Its pricing is higher than OpenAI’s o3, Google’s Gemini 2.5 Pro and Anthropic’s Claude 4 Sonnet - but lower than Anthropic’s Claude 4 Opus and OpenAI’s o3-pro.

GveETjzb0AAMaSW.jpg


3/37
@ArtificialAnlys
Full set of intelligence benchmarks that we have run independently on xAI’s Grok 4 API:

GveEaIZWwAAn6jn.jpg

GveEa69XoAQtWRo.jpg

GveEb6MbYAAWnzl.jpg


4/37
@ArtificialAnlys
Grok 4 recorded slightly higher output token usage compared to peer models when running the Artificial Analysis Intelligence Index. This translates to higher cost relative to its per token price.

GveEhybWQAArU7z.jpg

GveEjOjW8AAoZlX.jpg


5/37
@ArtificialAnlys
xAI’s API is serving Grok 4 at 75 tokens/s. This is slower than o3 (188 tokens/s) but faster than Claude 4 Opus Thinking (66 tokens/s).

GveEntwW4AASCVx.jpg


6/37
@ArtificialAnlys
Grok 4 is now live on Artificial Analysis: http://artificialanalysis.ai

7/37
@Evantaged
Is this Grok 4 Heavy or base??

8/37
@ArtificialAnlys
Base, with no tools. We have not tested Grok 4 Heavy yet.

9/37
@Elkins
🔨⏰

10/37
@AuroraHoX
😎👍

11/37
@tetsuoai
Honestly it's so good!

12/37
@rozer100x
interesting

13/37
@ianksnow1
It’s truly a rockstar. Light years better than the previous model and based on my early interactions perhaps leapfrogged every other frontier model.

14/37
@VibeEdgeAI
It's impressive to see Grok 4 leading the pack with a 73 on the Artificial Analysis Intelligence Index, especially with its strong performance in coding and math benchmarks.

However, the recent hate speech controversy is a sobering reminder of the ethical challenges AI development faces.

Balancing innovation with responsibility will be key as xAI moves forward-hopefully, these issues can be addressed to harness Grok 4's potential for positive impact.

15/37
@XaldwinSealand
Currently Testing Grok 4...

16/37
@MollySOShea


17/37
@0xSweep
might just be the greatest AI innovation of all time

18/37
@HaleemAhmed333
Wow

19/37
@Jeremyybtc
good to have you /grok 4

20/37
@Kriscrichton
🔥🔥🔥🔥

21/37
@ArthurMacwaters
Reality is the best eval

This is where Grok4 impresses me most

GveHxP7aMAAH5RC.jpg


22/37
@Coupon_Printer
I was waiting for your results /ArtificialAnlys !!! Thank you for this

23/37
@TheDevonWayne
so you didn't even get to try grok heavy?

24/37
@_LouiePeters
This is a great and rapid overview!
I think your intelligence benchmarks should start including and up weighting agent and tool use scores though; in the real world we want the models to perform as well as possible, which means giving them every tool possible - no need to handicap them by limiting access.

25/37
@shiels_ai
So this isn’t the tool calling model? Wow!

26/37
@joAnneSongs72
YEAH 🎉❤️🎉❤️🎉❤️🎉

27/37
@riddle_sphere
New kid on the block just dethroned the veterans. Silicon Valley’s watching.

28/37
@blockxs
Grok 4: AI champ confirmed

29/37
@SastriVimla
Great

30/37
@neoonai
NeoON > Grok. Right?

31/37
@EricaDXtra
So cool, so good!

32/37
@evahugsyou
Grok 4 just came out on top, and it’s not even a competition anymore. Elon’s team is absolutely killing it!

33/37
@garricn
Just wait till it starts conducting science experiments

34/37
@mukulneetika
Wow!

35/37
@RationalEtienne
Grok 4 is HOLY.

Humanity has created AI that it will merge with.

All Praise Elon for his act of CREATION! 🙏

36/37
@MixxsyLabs
I personally found it better for coding uses than Claude. Im no expert but when I needed a tool thats the one I started going back to after using a few for code snippets and assistance

37/37
@codewithimanshu
Interesting, perhaps true intelligence lies beyond benchmarks.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

No1

Retired.
Supporter
Joined
Apr 30, 2012
Messages
31,858
Reputation
5,352
Daps
72,199
I wonder when we’ll be able to create longer videos with consistency.
 
Top