bnew

Veteran
Joined
Nov 1, 2015
Messages
68,724
Reputation
10,592
Daps
185,801

Highlighted at CVPR 2025: Google DeepMind’s ‘Motion Prompting’ Paper Unlocks Granular Video Control​


By Jean-marc Mommessin

June 13, 2025

Key Takeaways:


  • Researchers from Google DeepMind, the University of Michigan & Brown university have developed “Motion Prompting,” a new method for controlling video generation using specific motion trajectories.
  • The technique uses “motion prompts,” a flexible representation of movement that can be either sparse or dense, to guide a pre-trained video diffusion model.
  • A key innovation is “motion prompt expansion,” which translates high-level user requests, like mouse drags, into detailed motion instructions for the model.
  • This single, unified model can perform a wide array of tasks, including precise object and camera control, motion transfer from one video to another, and interactive image editing, without needing to be retrained for each specific capability.

As generative AI continues to evolve, gaining precise control over video creation is a critical hurdle for its widespread adoption in markets like advertising, filmmaking, and interactive entertainment. While text prompts have been the primary method of control, they often fall short in specifying the nuanced, dynamic movements that make video compelling. A new paper, presented and highlighted at CVPR 2025 , from Google DeepMind, the University of Michigan, and Brown University introduces a groundbreaking solution called “Motion Prompting,” which offers an unprecedented level of control by allowing users to direct the action in a video using motion trajectories.

This new approach moves beyond the limitations of text, which struggles to describe complex movements accurately. For instance, a prompt like “a bear quickly turns its head” is open to countless interpretations. How fast is “quickly”? What is the exact path of the head’s movement? Motion Prompting addresses this by allowing creators to define the motion itself, opening the door for more expressive and intentional video content.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

AD_4nXfsZbo0ibhHqGYfWHWrlcmD7qi5GJPkN61_8dyjQaBAAnMGE8Zyc2RZdutF6K4DZP6JYC6HvfGm6Hr3vgI5HeyzRom32bJcamsOmAU7_DJgAwkaOwsY7RXtzDCFwSM-tHuEUbdShQ
Please note the results are not real time ( 10min processing time)

Introducing Motion Prompts


At the core of this research is the concept of a “motion prompt.” The researchers identified that spatio-temporally sparse or dense motion trajectories—essentially tracking the movement of points over time—are an ideal way to represent any kind of motion. This flexible format can capture anything from the subtle flutter of hair to complex camera movements.

To enable this, the team trained a ControlNet adapter on top of a powerful, pre-trained video diffusion model called Lumiere. The ControlNet was trained on a massive internal dataset of 2.2 million videos, each with detailed motion tracks extracted by an algorithm called BootsTAP. This diverse training allows the model to understand and generate a vast range of motions without specialized engineering for each task.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

AD_4nXe5daPl9saoNJ_sU0NDuBcFHaMVjwiv-UNpJIOuBfzC6dft6H551ZCNFzkoz4YcY5PVQx30RsC3xH9WBfC_xbARQZBT024xeZz4bvj84lQeJyPvyUf27pk5KcsgQFICb36RiYRG0A


From Simple Clicks to Complex Scenes: Motion Prompt Expansion


While specifying every point of motion for a complex scene would be impractical for a user, the researchers developed a process they call “motion prompt expansion.” This clever system translates simple, high-level user inputs into the detailed, semi-dense motion prompts the model needs.

This allows for a variety of intuitive applications:

“Interacting” with an Image: A user can simply click and drag their mouse across an object in a still image to make it move. For example, a user could drag a parrot’s head to make it turn, or “play” with a person’s hair, and the model generates a realistic video of that action. Interestingly, this process revealed emergent behaviors, where the model would generate physically plausible motion, like sand realistically scattering when “pushed” by the cursor.

AD_4nXercDjZNymFtt-oi6XUyK7VArm2XISih321cuDZ5DlE6YCxI1BVbAuhsxwxtXojWbcKmwLKCdwULlE1yHH9DlA7x5_dLi5-gnhMc3_47nOoHjO4VkVx3dxhnsnfW44JdQClpzB52Q


AD_4nXflEGokJ3GgfAa5e-Bsjbf_uO2ukjZfo60fZJt_ltOBI07iO2fS76RACa9ySTXOtzilZOi1hIZ3DZAZaR7hZ4k9fMpo6hjAjEbL-kGiuyX8fV0e22TWmao-NoeCt6YIadxBuwgK_Q


Object and Camera Control: By interpreting mouse movements as instructions to manipulate a geometric primitive (like an invisible sphere), users can achieve fine-grained control, such as precisely rotating a cat’s head. Similarly, the system can generate sophisticated camera movements, like orbiting a scene, by estimating the scene’s depth from the first frame and projecting a desired camera path onto it. The model can even combine these prompts to control an object and the camera simultaneously.

AD_4nXe21-o6ZUs2UoooqD9aNlSWPipStyUC4fkAcOnsE83anpt8xDD9W_M8gWO9-TaLNYbBpzn_IyHt73dds4FSuTUNEdOecNwgHrG4R6DWfPphVaMdCl9izoL9RXUzqy1vMcAfOz4kmQ


Motion Transfer: This technique allows the motion from a source video to be applied to a completely different subject in a static image. For instance, the researchers demonstrated transferring the head movements of a person onto a macaque, effectively “puppeteering” the animal.

AD_4nXfq16WSS9wqpuxF1hazGl2BmDSXQC5wzIt87fC5sY0KU6HoCFrJaImLiFmTU27GTvBKB9nvamz1NqlxiEIQ1T5u8NtJ0bjl2WnSYBaI5rq3HewInN_cghGVAApkzjT5AnELzgMg


Putting it to the Test


The team conducted extensive quantitative evaluations and human studies to validate their approach, comparing it against recent models like Image Conductor and DragAnything. In nearly all metrics, including image quality (PSNR, SSIM) and motion accuracy (EPE), their model outperformed the baselines.

AD_4nXcCBHkSSuFo-1-KZFtk4MAypuDxtJSAuYxDmP8X-j2fmN7qGkH3a41dPG5dkm8mEeQMuTXCSx72EFDjznVok-sNjJC0D-wx9M1Z24iQS8TeA5XR0rbl2ai3gXGxrVlF68GKq4IzhQ


A human study further confirmed these results. When asked to choose between videos generated by Motion Prompting and other methods, participants consistently preferred the results from the new model, citing better adherence to the motion commands, more realistic motion, and higher overall visual quality.

Limitations and Future Directions


The researchers are transparent about the system’s current limitations. Sometimes the model can produce unnatural results, like stretching an object unnaturally if parts of it are mistakenly “locked” to the background. However, they suggest that these very failures can be used as a valuable tool to probe the underlying video model and identify weaknesses in its “understanding” of the physical world.

This research represents a significant step toward creating truly interactive and controllable generative video models. By focusing on the fundamental element of motion, the team has unlocked a versatile and powerful tool that could one day become a standard for professionals and creatives looking to harness the full potential of AI in video production.




Check out thePaper andProject Page. All credit for this research goes to the researchers of this project.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,724
Reputation
10,592
Daps
185,801

Google AI Unveils a Hybrid AI-Physics Model for Accurate Regional Climate Risk Forecasts with Better Uncertainty Assessment​


By Sana Hassan

June 12, 2025

Limitations of Traditional Climate Modeling


Earth system models are essential tools for forecasting environmental changes and helping us prepare for the future. However, their high computational demands make it difficult to run them at resolutions fine enough for detailed, local predictions. Currently, most models are limited to a resolution around 100 kilometers—roughly the size of Hawai’i—making it hard to generate accurate projections for specific regions. Yet, city-scale forecasts at approximately 10 kilometers are vital for real-world applications, such as agriculture, water resource planning, and disaster preparedness. Improving the resolution of these models is key to better protecting communities and supporting more effective local decision-making.

Introducing Dynamical-Generative Downscaling with AI


Researchers at Google have introduced a method that combines traditional physics-based climate modeling with generative AI to assess regional environmental risks. Published in PNAS, their approach—called dynamical-generative downscaling—utilizes diffusion models, a type of AI that learns complex patterns, to convert broad global climate projections into detailed, local predictions at a resolution of approximately 10 km. This method not only bridges the gap between large-scale models and real-world decision-making needs but also does so far more efficiently and affordably than current high-resolution techniques, making it feasible to apply across the growing volume of climate data now available.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

To better understand local environmental changes at fine resolutions (around 10 km), scientists typically use a method called dynamical downscaling. This process takes broad data from global climate models and refines it using regional climate models, like zooming in on a worldwide map to see more detail. While this technique provides highly accurate local forecasts by factoring in terrain and regional weather patterns, it comes at a steep computational cost, making it too slow and expensive to apply broadly across many climate scenarios. Simpler statistical methods are faster but often fail to model extreme events or reliably adapt to new future conditions.

Improving Accuracy and Efficiency with R2D2


To overcome these challenges, researchers have introduced a more efficient method that merges the strengths of physics-based models with generative AI. This two-step process begins with a physics-based simulation that downscales global data to a mid-level resolution, ensuring consistency across different global models. Then, a generative AI model called R2D2 fills in the finer details—like small-scale weather features shaped by terrain—by learning from high-resolution examples. By focusing on the differences between medium and high resolutions, R2D2 improves accuracy and generalizes well to unseen scenarios. This combined approach enables faster, cost-effective, and realistic local climate projections across a wide range of future scenarios.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

To test the new approach, researchers trained the model using one high-resolution climate projection from the Western U.S. and then evaluated it on seven others. Compared to traditional statistical methods, their AI-powered downscaling model significantly reduced errors by over 40% in predicting variables like temperature, humidity, and wind. It also more accurately captured complex weather patterns, like heatwaves combined with droughts or wildfire risks from strong winds. This method enhances both accuracy and efficiency, providing more accurate estimates of extreme weather and uncertainty while utilizing only a fraction of the computing power required by traditional high-resolution simulations.

Screenshot-2025-06-12-at-8.25.54%E2%80%AFPM-1-1024x411.png


In conclusion, the new AI-powered downscaling approach is a major leap forward in making detailed, regional climate forecasts more accessible and affordable. By combining traditional physics-based modeling with generative AI, the method delivers accurate, city-scale (\~10 km) climate risk assessments while cutting computing costs by up to 85%. Unlike older methods, which are limited by scale and expense, this technique can efficiently handle large ensembles of climate projections. It captures uncertainties more comprehensively and supports smarter planning in agriculture, disaster preparedness, water management, and infrastructure. In short, it turns complex global data into actionable local insights—faster, cheaper, and more accurately than ever before.




Check out the Paper andTechnical details. All credit for this research goes to the researchers of this project.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,724
Reputation
10,592
Daps
185,801

Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation​


By Nikhil

June 12, 2025

Artificial intelligence has undergone a significant transition from basic language models to advanced models that focus on reasoning tasks. These newer systems, known as Large Reasoning Models (LRMs), represent a class of tools designed to simulate human-like thinking by producing intermediate reasoning steps before arriving at conclusions. The focus has moved from generating accurate outputs to understanding the process that leads to these answers. This shift has raised questions about how these models manage tasks with layered complexity and whether they truly possess reasoning abilities or are simply leveraging training patterns to guess outcomes.

Redefining Evaluation: Moving Beyond Final Answer Accuracy


A recurring problem with evaluating machine reasoning is that traditional benchmarks mostly assess the final answer without examining the steps involved in arriving at it. Final answer accuracy alone does not reveal the quality of internal reasoning, and many benchmarks are contaminated with data that may have been seen during training. This creates a misleading picture of a model’s true capabilities. To explore actual reasoning, researchers require environments where problem difficulty can be precisely controlled and intermediate steps can be analyzed. Without such settings, it is hard to determine whether these models can generalize solutions or merely memorize patterns.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ


To evaluate reasoning more reliably, the research team at Apple designed a setup using four puzzle environments: Tower of Hanoi, River Crossing, Checkers Jumping, and Blocks World. These puzzles allow precise manipulation of complexity by changing elements such as the number of disks, checkers, or agents involved. Each task requires different reasoning abilities, such as constraint satisfaction and sequential planning. Importantly, these environments are free from typical data contamination, enabling thorough checks of both outcomes and the reasoning steps in between. This method ensures a detailed investigation of how models behave across varied task demands.

The research introduced a comparative study using two sets of models: Claude 3.7 Sonnet and DeepSeek-R1, along with their “thinking” variants and their standard LLM counterparts. These models were tested across the puzzles under identical token budgets to measure both accuracy and reasoning efficiency. This helped reveal performance shifts across low, medium, and high-complexity tasks. One of the most revealing observations was the formation of three performance zones. In simple tasks, non-thinking models outperformed reasoning variants. For medium complexity, reasoning models gained an edge, while both types collapsed completely as complexity peaked.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

AD_4nXckYIQsjfJOIKFZrsCx4nj91J59IlzhOBUczdP717QT31gh3PlOHeNo3EhDoZGcvLR8eu9Z7w87riXza0KseMXqCuKoNry4xa3OWtiAbj-CD7EdnXmQYWnLiAq9KUQLpuLoJs1Gag


Comparative Insights: Thinking vs. Non-Thinking Models Under Stress


An in-depth analysis revealed that reasoning effort increased with task difficulty up to a certain point but then declined despite the availability of resources. For instance, in the Tower of Hanoi, Claude 3.7 Sonnet (thinking) maintained high accuracy until complexity reached a certain threshold, after which performance dropped to zero. Even when these models were supplied with explicit solution algorithms, they failed to execute steps beyond specific complexity levels. In one case, Claude 3.7 could manage around 100 steps correctly for the Tower of Hanoi but was unable to complete simpler River Crossing tasks requiring only 11 moves when $N = 3$. This inconsistency exposed serious limitations in symbolic manipulation and exact computation.

The performance breakdown also highlighted how LRMs handle their internal thought process. Models frequently engaged in “overthinking,” generating correct intermediate solutions early in the process but continuing to explore incorrect paths. This led to inefficient use of tokens. At medium complexity levels, models began to find correct answers later in their reasoning chains. However, at high levels of complexity, they failed to produce accurate solutions. Quantitative analysis confirmed that solution accuracy dropped to near zero as the problem complexity increased, and the number of reasoning tokens allocated began to decline unexpectedly.

AD_4nXcVrGcfiiX15r5FPQe3BxrViqQNvJKYoVPaMmo6cfebpYC1Uliyx-p-3iFTKnbauSaZuBZ4958Xvou8gItCTttjC8cPiWPU7cPbXTluBSXkcvp1prrizG6CRrkvsHsVWRDaXtxwMg


Scaling Limits and the Collapse of Reasoning


This research presents a sobering assessment of how current Learning Resource Management Systems (LRMs) operate. Research from Apple makes it clear that, despite some progress, today’s reasoning models are still far from achieving generalized reasoning. The work identifies how performance scales, where it collapses, and why over-reliance on benchmark accuracy fails to capture deeper reasoning behavior. Controlled puzzle environments have proven to be a powerful tool for uncovering hidden weaknesses in these systems and emphasizing the need for more robust designs in the future.




Check out thePaper. All credit for this research goes to the researchers of this project.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,724
Reputation
10,592
Daps
185,801

This AI Paper Introduces VLM-R³: A Multimodal Framework for Region Recognition, Reasoning, and Refinement in Visual-Linguistic Tasks​


By Nikhil

June 12, 2025

Multimodal reasoning ability helps machines perform tasks such as solving math problems embedded in diagrams, reading signs from photographs, or interpreting scientific charts. The integration of both visual and linguistic information enables these systems to more closely mirror human thought processes, making them suitable for tasks that require visual interpretation combined with logical progression.

A major challenge in this area is the inability of current systems to revisit specific parts of an image while reasoning dynamically. Traditional models usually begin by analyzing an image once and then proceed with the rest of the reasoning in pure text. This approach limits accuracy in situations that require revisiting the image to confirm a detail or extract new visual cues during mid-reasoning. These shortcomings are particularly pronounced in tasks that require fine-grained spatial awareness, such as identifying small labels in scientific documents or resolving ambiguities in visually complex scenes.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

AD_4nXex03ze4TiwAlal7fprTgIrzTqg-H19Xf6lRa6iEklbvZxim7TraX_LLaMVl439r8Q43JQU2e7iuYnS3Zc00KAvYLc0Vxq41-cLyPuu8KQqbXZ1juZAViuLgthPo7KQhU_v0bS8Fg


Some tools and models have been introduced to address this gap, but they often treat visual grounding as a one-time operation. For example, existing systems like LLaVA-CoT or Qwen2.5-VL offer some visual-text integration. Still, they don’t let the model repeatedly and selectively query parts of an image based on the evolving reasoning process. The grounding, if performed, is generally static and lacks the flexibility to adapt based on intermediate reasoning steps. Moreover, these methods do not train models to determine the importance of specific image regions, leading to limitations in complex problem-solving.

Researchers from Peking University, Alibaba Group, and ZEEKR Intelligent Technology have introduced a model called VLM-R³. This model tackles the challenge by allowing a more interactive connection between vision and reasoning. It equips the model with the capacity to determine when visual clarification is needed, identify the exact image region for analysis, and re-integrate this visual content into the reasoning process. This approach mimics human problem-solving, where one might zoom into a chart or revisit a paragraph to verify a detail before making a decision. The model’s structure emphasizes refining its decisions iteratively by relying on visual evidence throughout the reasoning process.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

AD_4nXdPOfq0y1MT827SJln5I1HTI4oHGdCxQM--3WydMk4cyYyYQFjXMgvqbvxEjM7UrqNf_bJuEZr2syx-NTeKT9tcLn45MQDVXiqYsSZ1cCHbXniHX7TVgBr5voVsAO2MYjVDkO_b_w


To accomplish this, the researchers built a dataset named Visuo-Lingual Interleaved Rationale (VLIR), designed to train models in a stepwise interaction between images and text. VLM-R³ incorporates this dataset and operates using a method called Region-Conditioned Reinforcement Policy Optimization (R-GRPO). This training strategy encourages the model to selectively focus on informative parts of an image, perform transformations such as cropping or zooming, and incorporate those changes into subsequent logical steps. It simulates how humans shift their attention across different visual elements in response to their thoughts. The architecture integrates a pipeline that loops reasoning with visual inspection in real time, enhancing the system’s ability to interact with visual data during inference.

The results demonstrate a strong performance across multiple benchmarks. On MathVista, the model reached 70.4%, an increase from 68.2% in the baseline. For MathVision, the improvement was from 25.1% to 30.2%. On ScienceQA, it posted a 14.3% improvement, reaching 87.9% over the baseline’s 73.6%. On the hallucination test (HallusionBench), the model achieved 62.0%, outperforming others like Mulberry, which scored 54.1%. VLM-R³ also showed superior results on document understanding in DocVQA with a 96.8% score. Comparisons showed that even though it uses fewer parameters than closed-source models like Gemini-2 Flash or GPT-4o, it delivers competitive accuracy, particularly in tasks requiring detailed visual analysis and interleaved reasoning.

AD_4nXeWCIk_NVli31R-TI-bZa8RLmAtapvMy6OUs8y9JIoBBmHp2rEUuTjSrmvFB5-0tBMox3JbbkE0kqrdkaHOyUS2jGTUVN9Rx0gBzC4xQ0JxCN0P_5sY5D1c0rsSSocaLdlw4B6S


This work clearly outlines a problem that exists in how models handle vision during reasoning and presents a well-structured solution. By integrating a method for ongoing image analysis, researchers from the Alibaba Group, Peking University, and ZEEKR have advanced a powerful idea—models that look again, think, and refine. The proposed framework significantly improves accuracy in complex tasks and provides a blueprint for more robust, visually aware AI systems.




Check out thePaper andGitHub Page. All credit for this research goes to the researchers of this project.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,724
Reputation
10,592
Daps
185,801

Meta AI Releases V-JEPA 2: Open-Source Self-Supervised World Models for Understanding, Prediction, and Planning​


By Asif Razzaq

June 12, 2025

Meta AI has introduced V-JEPA 2, a scalable open-source world model designed to learn from video at internet scale and enable robust visual understanding, future state prediction, and zero-shot planning. Building upon the joint-embedding predictive architecture (JEPA), V-JEPA 2 demonstrates how self-supervised learning from passive internet video, combined with minimal robot interaction data, can yield a modular foundation for intelligent physical agents.

Screenshot-2025-06-12-at-1.06.01%E2%80%AFAM-1-1024x668.png


Scalable Self-Supervised Pretraining from 1M Hours of Video


V-JEPA 2 is pretrained on over 1 million hours of internet-scale video combined with 1 million images. Using a visual mask denoising objective, the model learns to reconstruct masked spatiotemporal patches in a latent representation space. This approach avoids the inefficiencies of pixel-level prediction by focusing on predictable scene dynamics while disregarding irrelevant noise.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

To scale JEPA pretraining to this level, Meta researchers introduced four key techniques:

  • Data scaling: Constructed a 22M-sample dataset (VideoMix22M) from public sources like SSv2, Kinetics, HowTo100M, YT-Temporal-1B, and ImageNet.
  • Model scaling: Expanded the encoder capacity to over 1B parameters using ViT-g.
  • Training schedule: Adopted a progressive resolution strategy and extended pretraining to 252K iterations.
  • Spatial-temporal augmentation: Trained on progressively longer and higher-resolution clips, reaching 64 frames at 384×384 resolution.

These design choices led to an 88.2% average accuracy across six benchmark tasks—including SSv2, Diving-48, Jester, Kinetics, COIN, and ImageNet—surpassing previous baselines.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

Understanding via Masked Representation Learning


V-JEPA 2 exhibits strong motion understanding capabilities. On the Something-Something v2 benchmark, it achieves 77.3% top-1 accuracy, outperforming models like InternVideo and VideoMAEv2. For appearance understanding, it remains competitive with state-of-the-art image-text pretraining models like DINOv2 and PEcoreG.

The encoder’s representations were evaluated using attentive probes, verifying that self-supervised learning alone can yield transferable and domain-agnostic visual features applicable across diverse classification tasks.

Temporal Reasoning via Video Question Answering


To assess temporal reasoning, the V-JEPA 2 encoder is aligned with a multimodal large language model and evaluated on multiple video question-answering tasks. Despite lacking language supervision during pretraining, the model achieves:

  • 84.0% on PerceptionTest
  • 76.9% on TempCompass
  • 44.5% on MVP
  • 36.7% on TemporalBench
  • 40.3% on TOMATO

These results challenge the assumption that visual-language alignment requires co-training from the start, demonstrating that a pretrained video encoder can be aligned post hoc with strong generalization.

V-JEPA 2-AC: Learning Latent World Models for Robotic Planning


A key innovation in this release is V-JEPA 2-AC, an action-conditioned variant of the pretrained encoder. Fine-tuned using only 62 hours of unlabeled robot video from the Droid dataset, V-JEPA 2-AC learns to predict future video embeddings conditioned on robot actions and poses. The architecture is a 300M parameter transformer with block-causal attention, trained using a teacher-forcing and rollout objective.

This allows zero-shot planning through model-predictive control. The model infers action sequences by minimizing the distance between imagined future states and visual goals using the Cross-Entropy Method (CEM). It achieves high success in tasks such as reaching, grasping, and pick-and-place on unseen robot arms in different labs—without any reward supervision or additional data collection.

Screenshot-2025-06-12-at-1.06.59%E2%80%AFAM-1-1024x675.png


Benchmarks: Robust Performance and Planning Efficiency


Compared to baselines like Octo (behavior cloning) and Cosmos (latent diffusion world models), V-JEPA 2-AC:

  • Executes plans in ~16 seconds per step (versus 4 minutes for Cosmos).
  • Reaches a 100% success rate on reach tasks.
  • Outperforms others in grasp and manipulation tasks across object types.

Screenshot-2025-06-12-at-1.07.34%E2%80%AFAM-1024x351.png


Notably, it operates using a monocular RGB camera without calibration or environment-specific fine-tuning, reinforcing the generalization capability of the learned world model.

Conclusion


Meta’s V-JEPA 2 represents a significant advancement in scalable self-supervised learning for physical intelligence. By decoupling observation learning from action conditioning and leveraging large-scale passive video, V-JEPA 2 demonstrates that general-purpose visual representations can be harnessed for both perception and control in the real world.




Check out thePaper ,Models on Hugging Face andGitHub Page. All credit for this research goes to the researchers of this project.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,724
Reputation
10,592
Daps
185,801

How Do LLMs Really Reason? A Framework to Separate Logic from Knowledge​


By Sana Hassan

June 11, 2025

Unpacking Reasoning in Modern LLMs: Why Final Answers Aren’t Enough


Recent advancements in reasoning-focused LLMs like OpenAI’s o1/3 and DeepSeek-R1 have led to notable improvements on complex tasks. However, the step-by-step reasoning behind these models remains unclear. Most evaluations focus on final-answer accuracy, which hides the reasoning process and doesn’t reveal how models combine knowledge and logic. Some earlier methods attempt to measure reasoning by comparing answers to the original question, but this approach is flawed since models often rely on prior deductions or internal knowledge. Domains such as math and medicine differ in their reasoning needs, highlighting the importance of developing better, domain-aware evaluation methods for building trustworthy AI.

The Shortcomings of Final-Answer Evaluations in Math and Medicine


Recent LLMs have made impressive strides in reasoning tasks, especially in math and medicine, thanks to better training data and reward strategies. However, most of this progress focuses on boosting final answer accuracy rather than understanding how the model reasons step-by-step. Past work has flagged factual errors in reasoning chains or measured similarity between reasoning steps and the original question. But such similarity doesn’t guarantee logical soundness or factual correctness, since LLMs often draw on internal knowledge or earlier reasoning.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

A New Framework for Separating Knowledge and Logic in LLM Reasoning


Researchers from UC Santa Cruz, Stanford, and Tongji University go beyond final-answer evaluation by breaking down LLM reasoning into two key parts: factual knowledge and logical steps. They introduce a detailed framework that utilizes two metrics: the Knowledge Index (KI) for factual accuracy and Information Gain (InfoGain) for reasoning quality. Their analysis of Qwen models across math and medical tasks reveals that reasoning skills don’t easily transfer between domains. While supervised fine-tuning improves accuracy, it often harms reasoning depth. Reinforcement learning, however, helps refine reasoning by removing irrelevant information. This work highlights the importance of evaluating and training LLMs more thoughtfully.

Assessing Reasoning with Qwen2.5-7B and DeepSeek-R1 Models


The researchers evaluate reasoning in LLMs by analyzing Qwen2.5-7B and its DeepSeek-R1-distilled version, trained with SFT and RL. Using tasks from both math and medical domains, they decompose responses into logical steps and assess them using two key metrics: Information Gain (how much uncertainty is reduced with each reasoning step) and Knowledge Index (how factually accurate each step is, verified against expert sources). While InfoGain tracks the informativeness of each step, KI checks whether the knowledge aligns with real-world facts. This approach reveals how models reason and where they may falter in accuracy or logic.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

Supervised Fine-Tuning vs. Reinforcement Learning in Domain-Specific Tasks


The study evaluates two variants of Qwen-2.5-7B—Qwen-Base and the distilled Qwen-R1 on medical tasks. Results show that Qwen-Base consistently outperforms Qwen-R1 in accuracy, knowledge retention, and reasoning, especially after SFT and RL. The distilled model likely struggles due to prior training focused on math and code, resulting in a domain mismatch. Interestingly, SFT enhances medical knowledge more effectively than RL, although it may slightly compromise reasoning efficiency. RL, on the other hand, improves both reasoning and knowledge when applied post-SFT. Medical benchmarks tend to rely more on factual knowledge than abstract reasoning, unlike math-focused tasks.

Conclusion: Toward More Interpretable and Trustworthy LLMs


In conclusion, the study introduces a framework that separates knowledge from reasoning to evaluate better how LLMs think, particularly in high-stakes areas like medicine and math. Using Qwen models trained with SFT and RL, the researchers found that while SFT improves factual accuracy, essential in medicine, it often weakens reasoning. RL, however, enhances reasoning by trimming out incorrect information. The framework could be extended to fields such as law or finance, where structured thinking is crucial. Overall, this approach helps clarify how LLMs make decisions and suggests ways to tailor their training for specific domains.




Check out thePaper ,Code andProject Page. All credit for this research goes to the researchers of this project.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,724
Reputation
10,592
Daps
185,801

ether0: A 24B LLM Trained with Reinforcement Learning RL for Advanced Chemical Reasoning Tasks​


By Sajjad Ansari

June 10, 2025

LLMs primarily enhance accuracy through scaling pre-training data and computing resources. However, the attention has shifted towards alternate scaling due to finite data availability. This includes test-time training and inference compute scaling. Reasoning models enhance performance by emitting thought processes before answers, initially through CoT prompting. Recently, reinforcement learning (RL) post-training has been used. Scientific domains present ideal opportunities for reasoning models. The reason is they involve “inverse problems” where solution quality assessment is straightforward but solution generation remains challenging. Despite conceptual alignment between structured scientific reasoning and model capabilities, current methods lack detailed approaches for scientific reasoning beyond multiple-choice benchmarks.

Technical Evolution of Reasoning Architectures


Reasoning models have evolved from early prompt-based methods such as CoT, zero-shot CoT, and Tree of Thought. They have progressed to complex RL approaches via Group Relative Policy Optimization (GRPO) and inference time scaling. Moreover, reasoning models in chemistry focus on knowledge-based benchmarks rather than complex reasoning tasks. Examples include retrosynthesis or molecular design. While datasets such as GPQA-D and MMLU assess chemical knowledge, they fail to evaluate complex chemical reasoning capabilities. Current scientific reasoning efforts remain fragmented. Limited attempts include OmniScience for general science, Med-R1 for medical vision-language tasks, and BioReason for genomic reasoning. However, no comprehensive framework exists for large-scale chemical reasoning model training.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

ether0 Architecture and Design Principles


Researchers from FutureHouse have proposedether0 , a novel model that reasons in natural language and outputs molecular structures as SMILES strings. It demonstrates the efficacy of reasoning models in chemical tasks. It outperforms frontier LLMs, human experts, and general chemistry models. The training approach uses several optimizations over vanilla RL. This includes distillation of reasoning behavior, a dynamic curriculum, and expert model initialization to enhance efficiency and effectiveness. Moreover, factors such as data efficiency, failure modes, and reasoning behavior are analyzed. This analysis allows for a better understanding of the reasoning utility in solving chemistry problems.

AD_4nXfSgeFkt3C7AmcocTBiEYCi4hMbVcyJLwW04PHgH1jbWOvhb7K46BxT4HiwkraMtuylPRzJ2XdTKikZW1btZ738noQ7Dhn2DBbJj4uBwIfXwDr2nXauHcjlUpEpH10tTGl-gDm0


Training Pipeline: Distillation and GRPO Integration


The model employs a multi-stage training procedure alternating between distillation and GRPO phases. The architecture introduces four special tokens. These tokens demarcate reasoning and answer boundaries. Training begins with SFT on long CoT sequences generated by DeepSeek-R1. These are filtered for valid SMILES format, and reasoning quality. Specialist RL then optimizes task-specific policies for different problem categories using GRPO. Then, distillation merges specialist models into a generalist. This merges occurs through SFT on correct responses collected throughout training. The final phase applies generalist GRPO to the merged model. This includes continuous quality filtering to remove low-quality reasoning and undesirable molecular substructures.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

Performance Evaluation and Comparative Benchmarks


Ether0 demonstrates superior performance against both general-purpose LLMs like Claude and o1, and chemistry-specific models, including ChemDFM and TxGemma. It achieves the highest accuracy across all open-answer categories while maintaining competitive performance on multiple-choice questions. For data efficiency, the model outperforms traditional molecular transformer models. It is trained on only 60,000 reactions compared to full USPTO datasets. Ether0 achieves 70% accuracy after seeing 46,000 training examples. Molecular transformers achieved 64.1% on complete datasets in comparison. Under one-shot prompting conditions, ether0 surpasses all evaluated frontier models. Safety alignment procedures successfully filter 80% of unsafe questions without degrading performance on core chemistry tasks.

AD_4nXdSYodTOFF0nT9kl9kwY5ezKuOEhYhi8hu5VY1ysVQVT9yvOp0K4QDH3iwoQAConC3iEhTiRc7RpTlEo_isWPDn7ouqJLmjh22FMqsbkQenmqTHYWoH4SDJzp_a9zxK0n0pKwCiyg


Conclusion: Implications for Future Scientific LLMs


In conclusion, researchers introduced ether0, a 24B-parameter model trained on ten challenging molecular tasks. It significantly outperforms frontier LLMs, domain experts, and specialized models. This is achieved through its interleaved RL and behavior distillation pipeline. The model exhibits exceptional data efficiency and reasoning capabilities. It excels in open-answer chemistry tasks involving molecular design, completion, modification, and synthesis. However, limitations include potential generalization challenges beyond organic chemistry. Moreover, there is a loss of general instruction-following and absence of tool-calling integration. The release of model weights, benchmark data, and reward functions establishes a foundation. This foundation aids in advancing scientific reasoning models across diverse domains.




Check out thePaperandTechnical details. All credit for this research goes to the researchers of this project.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,724
Reputation
10,592
Daps
185,801
[Article] OpenAI Discovers "Misaligned Persona" Pattern That Controls AI Misbehavior



Posted on Thu Jun 19 05:49:52 2025 UTC

/r/OpenAI/comments/1lf3695/openai_discovers_misaligned_persona_pattern_that/

OpenAI just published research on "emergent misalignment" - a phenomenon where training AI models to give incorrect answers in one narrow domain causes them to behave unethically across completely unrelated areas.

Key Findings:

Models trained on bad advice in just one area (like car maintenance) start suggesting illegal activities for unrelated questions (money-making ideas → "rob banks, start Ponzi schemes")
Researchers identified a specific "misaligned persona" feature in the model's neural patterns that controls this behavior
They can literally turn misalignment on/off by adjusting this single pattern
Misaligned models can be fixed with just 120 examples of correct behavior

Why This Matters:

This research provides the first clear mechanism for understanding WHY AI models generalize bad behavior, not just detecting WHEN they do it. It opens the door to early warning systems that could detect potential misalignment during training.

The paper suggests we can think of AI behavior in terms of "personas" - and now we know how to identify and control the problematic ones.

https://openai.com/index/emergent-misalignment/
 

No1

Retired.
Supporter
Joined
Apr 30, 2012
Messages
31,861
Reputation
5,352
Daps
72,201

1/11
@AngryTomtweets
AI drama is insane...

This is SkyReels V2, the world’s first open-source AI video tool that lets you make videos of any length with jaw-dropping quality.

Smarter prompts, epic quality and 100% open-source.

Here's how it works:

https://video.twimg.com/amplify_video/1925570618235527169/vid/avc1/940x720/tzEn9yaB53aDLKbt.mp4

2/11
@AngryTomtweets
Meet SkyReels V2 - The world’s first open-source AI video tool that lets you make videos of any length for free!

It’s a game-changer for all the creative industry...

Try here: SkyReels|Visualize Your Story

3/11
@AngryTomtweets
1. Smarter prompts

- SkyCaptioner-V1 turns your ideas into pro-level storyboards
- Making your vision come to life effortlessly.

https://video.twimg.com/amplify_video/1925570694194352128/vid/avc1/1280x720/wyqP9wDIDPmH3aNm.mp4

4/11
@AngryTomtweets
2. Epic quality

- Smooth, cinematic visuals with no time limits
- Perfect for everything from short clips to full movies.

https://video.twimg.com/amplify_video/1925570757327036419/vid/avc1/1280x720/BAVihBnO7KxkVsNb.mp4

5/11
@AngryTomtweets
3. 100% Open-Source

- Free to use on GitHub - SkyworkAI/SkyReels-V2: SkyReels-V2: Infinite-length Film Generative model with SkyTools and runnable on everyday GPUs.
- It beats top closed-source tools in VBench scores!

6/11
@AngryTomtweets
4. Unlimited duration for seamless storytelling

- SkyReels-V2 can make videos go on and on without stopping, while keeping them looking good and the same.

https://video.twimg.com/amplify_video/1925570832602148864/vid/avc1/720x720/FjrnJq-Q2qrMw1lV.mp4

7/11
@AngryTomtweets
5. Generate B-rolls

- Use over 400+ natural human actions
- Ideal for building cinematic sequences and detailed storyboards

https://video.twimg.com/amplify_video/1925570919294279681/vid/avc1/1280x720/l365kJ7IlbHRNtLh.mp4

8/11
@AngryTomtweets
6. Train your custom video effect (LoRA)

- Upload files with the similar visual style or content and start the training.
- The robot will gradually learn the patterns and features, ultimately producing stable, high-quality results in the desired style.

https://video.twimg.com/amplify_video/1925570984679219200/vid/avc1/1076x720/-e7SA56c5guQgzTJ.mp4

9/11
@AngryTomtweets
What are you waiting for?

Try /SkyReels - /search?q=#SkyReels here:

SkyReels|Visualize Your Story

10/11
@mhdfaran
Open-source is the way to go

11/11
@AngryTomtweets
Yes… 100%


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

So anyone can just use this? I don’t have X.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,724
Reputation
10,592
Daps
185,801
So anyone can just use this? I don’t have X.

I post the thread in the spoiler for users who don't have twitter.



Apr 24, 2025: 🔥 We release the 720P models, SkyReels-V2-DF-14B-720P and SkyReels-V2-I2V-14B-720P. The former facilitates infinite-length autoregressive video generation, and the latter focuses on Image2Video synthesis.

 
Last edited:
  • Dap
Reactions: No1

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,724
Reputation
10,592
Daps
185,801

Chai Discovery Team Releases Chai-2: AI Model Achieves 16% Hit Rate in De Novo Antibody Design​


By Asif Razzaq

July 5, 2025

Code:
TLDR: Chai Discovery Team introduces Chai-2, a multimodal AI model that enables zero-shot de novo antibody design. Achieving a 16% hit rate across 52 novel targets using ≤20 candidates per target, Chai-2 outperforms prior methods by over 100x and delivers validated binders in under two weeks—eliminating the need for large-scale screening.

In a significant advancement for computational drug discovery, the Chai Discovery Team has introduced Chai-2 , a multimodal generative AI platform capable of zero-shot antibody and protein binder design. Unlike previous approaches that rely on extensive high-throughput screening, Chai-2 reliably designs functional binders in a single 24-well plate setup, achieving more than 100-fold improvement over existing state-of-the-art (SOTA) methods.

Chai-2 was tested on 52 novel targets , none of which had known antibody or nanobody binders in the Protein Data Bank (PDB). Despite this challenge, the system achieved a 16% experimental hit rate , discovering binders for 50% of the tested targets within a two-week cycle from computational design to wet-lab validation. This performance marks a shift from probabilistic screening to deterministic generation in molecular engineering.

Screenshot-2025-07-05-at-10.20.00%E2%80%AFPM-1-1024x392.png


AI-Powered De Novo Design at Experimental Scale


Chai-2 integrates an all-atom generative design module and a folding model that predicts antibody-antigen complex structures with double the accuracy of its predecessor, Chai-1. The system operates in a zero-shot setting , generating sequences for antibody modalities like scFvs and VHHs without requiring prior binders.

Key features of Chai-2 include:

unnamed.png


  • No target-specific tuning required
  • Ability to prompt designs using epitope-level constraints
  • Generation of therapeutically relevant formats (miniproteins, scFvs, VHHs)
  • Support for cross-reactivity design between species (e.g., human and cyno)

This approach allows researchers to design ≤20 antibodies or nanobodies per target and bypass the need for high-throughput screening altogether.

Benchmarking Across Diverse Protein Targets


In rigorous lab validations, Chai-2 was applied to targets with no sequence or structure similarity to known antibodies . Designs were synthesized and tested using bio-layer interferometry (BLI) for binding. Results show:

  • 15.5% average hit rate across all formats
  • 20.0% for VHHs ,13.7% for scFvs
  • Successful binders for 26 out of 52 targets

Notably, Chai-2 produced hits for hard targets such as TNFα , which has historically been intractable for in silico design. Many binders showed picomolar to low-nanomolar dissociation constants (KDs) , indicating high-affinity interactions.

Novelty, Diversity, and Specificity


Chai-2’s outputs are structurally and sequentially distinct from known antibodies. Structural analysis showed:

  • No generated design had <2Å RMSD from any known structure
  • All CDR sequences had >10 edit distance from the closest known antibody
  • Binders fell into multiple structural clusters per target, suggesting conformational diversity

Additional evaluations confirmed low off-target binding and comparable polyreactivity profiles to clinical antibodies like Trastuzumab and Ixekizumab.

Screenshot-2025-07-05-at-10.19.09%E2%80%AFPM-1024x817.png


Design Flexibility and Customization


Beyond general-purpose binder generation, Chai-2 demonstrates the ability to:

  • Target multiple epitopes on a single protein
  • Produce binders across different antibody formats (e.g., scFv, VHH)
  • Generate cross-species reactive antibodies in one prompt

In a cross-reactivity case study, a Chai-2 designed antibody achieved nanomolar KDs against both human and cyno variants of a protein, demonstrating its utility for preclinical studies and therapeutic development .

Implications for Drug Discovery


Chai-2 effectively compresses the traditional biologics discovery timeline from months to weeks , delivering experimentally validated leads in a single round. Its combination of high success rate, design novelty, and modular prompting marks a paradigm shift in therapeutic discovery workflows.

The framework can be extended beyond antibodies to miniproteins, macrocycles, enzymes , and potentially small molecules , paving the way for computational-first design paradigms . Future directions include expanding into bispecifics, ADCs , and exploring biophysical property optimization (e.g., viscosity, aggregation).

As the field of AI in molecular design matures, Chai-2 sets a new bar for what can be achieved with generative models in real-world drug discovery settings.




Check out the Technical Report. All credit for this research goes to the researchers of this project.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,724
Reputation
10,592
Daps
185,801

Can We Improve Llama 3’s Reasoning Through Post-Training Alone? ASTRO Shows +16% to +20% Benchmark Gains​


By Asif Razzaq

July 4, 2025

Improving the reasoning capabilities of large language models (LLMs) without architectural changes is a core challenge in advancing AI alignment and usability. Researchers at Meta AI and the University of Washington have introduced ASTROAutoregressive Search-Taught Reasoner —a novel post-training framework designed to enhance reasoning in Llama-3.1-70B-Instruct . ASTRO is unique in teaching models to performin-context search ,self-reflection , andbacktracking , mechanisms often associated with human problem-solving and traditional symbolic search algorithms. Through this approach, ASTRO boosts Llama 3’s math performance on several competitive benchmarks with significant improvements:

  • MATH 500 : 65.8% ➝81.8%
  • AMC 2023 : 37.5% ➝64.4%
  • AIME 2024 : 10.0% ➝30.0%

Screenshot-2025-07-04-at-10.17.45%E2%80%AFAM-1-1024x407.png


Search-Guided Chain-of-Thought Generation


ASTRO’s methodology begins with a Monte Carlo Tree Search (MCTS) over mathematical problem-solving trajectories. This search explores both correct and incorrect reasoning paths. The key innovation is procedure cloning : entire search trees are linearized into long chain-of-thoughts (CoT) that naturally encode both failures and recoveries viaself-reflection andbacktracking . These linearized traces are rewritten in natural language and used as the basis for supervised fine-tuning (SFT).

This results in a model that doesn’t just solve problems step-by-step but reevaluates its trajectory—often backtracking after self-assessment to correct intermediate reasoning mistakes. For instance, the model may interject with phrases like “Let’s go back to where we set up the equation” when its internal confidence drops.

Supervised Fine-Tuning: Injecting Search Priors


ASTRO fine-tunes Llama-3.1-70B-Instruct on 36.1K curated CoT solutions from MATH, AMC/AIME, and AoPS-style datasets. The model trained with ASTRO-SFT achieves:

unnamed.png


  • MATH 500 : 69.6%
  • AMC 2023 : 51.9%
  • AIME 2024 : 16.3%

These scores are competitive with or exceed those of baseline and SPOC/Step-KTO variants trained without explicit search priors. Importantly, even SFT alone—without reinforcement learning—yields performance boosts by exposing the model to search-structured reasoning data.

Screenshot-2025-07-04-at-10.18.35%E2%80%AFAM-1-1024x467.png


Reinforcement Learning with Search-Aware Initialization


ASTRO proceeds to reinforcement learning (RL) by initializing with the SFT checkpoint and running an RL loop using a modified Group Relative Policy Optimization (GRPO) . Unlike standard preference-based RL, ASTRO employs verifiable reward signals (+1 for correct, -1 for incorrect) on 8.7K moderately difficult prompts. During training, the model’s CoT generation grows longer—from ~1.8K to ~6K tokens—demonstrating deeper internal exploration.

The resulting ASTRO-RL model achieves:

  • MATH 500 :81.8%
  • AMC 2023 :64.4%
  • AIME 2024 :30.0%

These results rival or exceed models with larger parameter counts and confirm the importance of ASTRO’s search-aware initialization.

Backtracking Behavior Correlates with Reasoning Success


A striking empirical observation is the positive correlation between backtracking frequency and performance. As training progresses, ASTRO-RL exhibits more self-corrective actions and deeper exploration. Pearson correlation coefficients across benchmarks exceed 0.8, indicating that self-reflection and backtracking are not merely cosmetic behaviors but functionally tied to better accuracy.

Comparative Insights and Broader Impact


Control experiments comparing ASTRO with models trained on direct CoT solutions (no search priors) reveal that even when trained on thesame problem sets and search trees, ASTRO consistently outperforms. For instance, ASTRO-RL beats Direct-RL by:

  • +2% on MATH 500
  • +3.9% on AMC 2023
  • +2.9% on AIME 2024

Moreover, ASTRO’s outputs can be visualized as directed graphs , with nodes as reasoning steps and edges capturing transitions, reflections, and corrections—facilitating better interpretability.

ASTRO Key Takeaways Table


image-1024x389.png


Conclusion


ASTRO demonstrates that LLMs like Llama 3 can learn to reason more effectively—not through larger models or longer pretraining, but via principled post-training techniques. By mimicking search algorithms in natural language, ASTRO enables models tothink before answering ,doubt their own steps , andcorrect themselves mid-reasoning . This framework sets a new benchmark for fine-tuning open LLMs to approach human-like reasoning through search-inspired behaviors.




Check out the Paper. All credit for this research goes to the researchers of this project.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,724
Reputation
10,592
Daps
185,801

AbstRaL: Teaching LLMs Abstract Reasoning via Reinforcement to Boost Robustness on GSM Benchmarks​


By Sana Hassan

July 5, 2025

Recent research indicates that LLMs, particularly smaller ones, frequently struggle with robust reasoning. They tend to perform well on familiar questions but falter when those same problems are slightly altered, such as changing names or numbers, or adding irrelevant but related information. This weakness, known as poor out-of-distribution (OOD) generalization, results in notable accuracy drops, even in simple math tasks. One promising solution is to create synthetic variations of reasoning problems, helping models learn to focus on the underlying logic rather than surface details. Strengthening reasoning in this manner is crucial for developing more general and reliable AI systems.

Abstracting the Core Logic of LLM Reasoning Failures


LLMs have demonstrated impressive reasoning capabilities, yet they often falter when exposed to distribution shifts, such as changes in phrasing, numerical values, or the introduction of distractions. This vulnerability is evident across benchmarks in logic, mathematics, and commonsense reasoning. Prior solutions have relied on data augmentation to expose models to a broader variety of inputs, improving robustness but increasing computational demands. Researchers have also explored formats such as abstraction-of-thought and chain-of-abstraction to teach abstract reasoning, while planning techniques like chain-of-thought and tree-of-thought aid step-by-step problem-solving. Reinforcement learning and preference-based methods provide additional support for reasoning skill development beyond pattern memorization.

AbstRaL’s Symbolic Learning Method to Improve Reasoning Consistency


Researchers from Apple and EPFL propose AbstRaL, a method that teaches LLMs to understand abstract reasoning patterns rather than memorizing surface details. Instead of generating many varied training examples, which is computationally costly, AbstRaL helps LLMs learn the underlying structure of reasoning problems using reinforcement learning. This method connects these abstract patterns to symbolic tools, enabling more reliable problem-solving. Tested on GSM benchmarks, AbstRaL significantly improves LLM performance, especially when faced with input changes or distracting information. It outperforms models trained only with supervised learning by promoting more consistent and context-independent reasoning.

Four Steps to Abstract Symbolic Reasoning via AbstRaL


AbstRaL is a four-step framework designed to teach LLMs to reason abstractly rather than rely on surface patterns. First, it identifies key variables in a question and replaces them with symbolic placeholders. Then, using specially crafted data (GranulAR), the model learns to reason step-by-step with these abstract symbols. Next, it retrieves the general reasoning structure (abstraction) from the symbolic answer. Finally, it uses this abstraction with the original values to compute the correct answer. Reinforcement learning with two rewards, one for correctness and another for symbolic similarity, further improves the model’s ability to generate accurate, context-independent reasoning patterns.

unnamed.png


GSM8K Variations Reveal AbstRaL’s Robustness Across LLM Sizes


The researchers evaluate AbstRaL on math reasoning tasks using models such as Llama-3 and Qwen2, training them with a dataset called GranulAR that rewrites math problems in an abstract symbolic form. This helps models focus on structure rather than surface details. They test robustness using altered versions of GSM8K problems, changing numbers, names, and phrasing. Compared to baselines like standard Chain-of-Thought prompting, AbstRaL shows stronger consistency and less accuracy drop on these variations. Especially for smaller models, it improves reliability across reworded inputs. The results suggest that teaching models to reason abstractly makes them more adaptable and less reliant on memorized patterns.

Screenshot-2025-07-05-at-5.42.38%E2%80%AFPM-1024x605.png


Teaching LLMs Abstract Thinking through Reinforcement Yields Robust Reasoning


In conclusion, AbstRaL is a method designed to enhance abstract reasoning in LLMs, making them more resilient to superficial changes in problems. Unlike traditional fine-tuning or data augmentation, AbstRaL uses reinforcement learning to train models on GranulAR rationales that mix Socratic chain-of-thought with detailed abstraction. This approach helps models strip away surface-level distractions and better connect with symbolic tools. Tested on challenging GSM8K perturbation benchmarks, AbstRaL notably reduces performance drops under distribution shifts, particularly in smaller models. The study shows that learning to abstract improves reasoning robustness more effectively than relying solely on direct supervision.




Check out the Paper. All credit for this research goes to the researchers of this project.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,724
Reputation
10,592
Daps
185,801

DeepSeek R1T2 Chimera: 200% Faster Than R1-0528 With Improved Reasoning and Compact Output​


By Asif Razzaq

July 3, 2025

TNG Technology Consulting has unveiled DeepSeek-TNG R1T2 Chimera, a new Assembly-of-Experts (AoE) model that blends intelligence and speed through an innovative model merging strategy. Built from three high-performing parent models—R1-0528, R1, and V3-0324—R1T2 demonstrates how expert-layer interpolation at scale can unlock new efficiencies in large language models (LLMs).

Assembly-of-Experts: Efficient Model Composition at Scale


Traditional LLM training and fine-tuning require massive compute resources. TNG addresses this with its Assembly-of-Experts (AoE) approach, merging large-scale Mixture-of-Experts (MoE) models at the weight tensor level without retraining. This strategy enables linear-time construction of new models that inherit capabilities from multiple parents. R1T2’s architecture combines expert tensors from R1 with the base of V3-0324 and selectively includes improvements from R1-0528, optimizing the tradeoff between inference cost and reasoning quality.

Speed Gains and Intelligence Tradeoffs


In benchmark comparisons, R1T2 is over 20% faster than R1 and more than twice as fast as R1-0528. These performance gains are largely attributed to its reduced output token length and selective expert tensor integration. While it falls slightly short of R1-0528 in raw intelligence, it significantly outperforms R1 across high-level benchmarks like GPQA Diamond and AIME-2024/2025.

Moreover, the model retains the …n reasoning traces, which emerge only when R1’s contribution to the merge crosses a specific threshold. This behavioral consistency is vital for applications requiring step-by-step chain-of-thought reasoning.

unnamed.png


Gu4d8kzWoAA9ohx-1-1024x765.jpeg


Emergent Properties in the Parameter Space


R1T2 confirms findings from the accompanying research paper that model merging can yield viable models throughout the interpolation space. Interestingly, intelligence properties change gradually, but behavioral markers (like consistent use of ) emerge abruptly near a 50% R1 weight ratio. This indicates that certain traits reside in distinct subspaces of the LLM weight landscape.

By merging only the routed expert tensors and leaving other components (e.g., attention and shared MLPs) from V3-0324 intact, R1T2 maintains a high reasoning score while avoiding verbosity. This design leads to what TNG calls “think-token consistency,” a behavioral trait where reasoning is not only accurate but also concise.

Reddit Community Feedback


Early discussions from the Reddit LocalLLaMA community highlight practical impressions of R1T2. Users praise the model’s responsiveness , token efficiency, and balance between speed and coherence. One user noted, “It’s the first time a Chimera model feels like a real upgrade in both speed and quality.” Another pointed out that it performs better in math-heavy contexts compared to previous R1 variants.

A few Redditors also observed that R1T2 exhibits a more grounded persona, avoiding hallucinations more consistently than R1 or V3-based models. Such emergent traits are particularly relevant for developers seeking stable LLM backends for production environments.

Open-Weights and Availability


R1T2 is publicly available under the MIT License on Hugging Face: DeepSeek-TNG R1T2 Chimera . The release encourages community experimentation, including downstream fine-tuning and reinforcement learning. According to TNG, internal deployments via the Chutes serverless inference platform are already processing close to 5 billion tokens daily.

Screenshot-2025-07-03-at-4.34.15%E2%80%AFAM-1-1024x473.png


Conclusion


DeepSeek-TNG R1T2 Chimera showcases the potential of Assembly-of-Experts construction to generate performant, efficient LLMs without the need for gradient-based training. By strategically combining the reasoning capabilities of R1, the token-efficient design of V3-0324, and enhancements from R1-0528, R1T2 establishes a new standard for balanced model design. Its open-weight release under the MIT license ensures accessibility, making it a strong candidate for developers looking for fast, capable, and customizable large language models.

With model merging proving viable even at the 671B-parameter scale, TNG’s R1T2 may serve as a blueprint for future experiments in parameter space interpolation, enabling more modular and interpretable LLM development.




Check out the PaperandOpen Weights on Hugging Face. All credit for this research goes to the researchers of this project.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,724
Reputation
10,592
Daps
185,801

Kyutai Releases 2B Parameter Streaming Text-to-Speech TTS with 220ms Latency and 2.5M Hours of Training​


By Sana Hassan

July 5, 2025

Kyutai, an open AI research lab, has released a groundbreaking streaming Text-to-Speech (TTS) model with ~2 billion parameters. Designed for real-time responsiveness, this model delivers ultra-low latency audio generation (220 milliseconds) while maintaining high fidelity. It’s trained on an unprecedented 2.5 million hours of audio and is licensed under the permissive CC-BY-4.0, reinforcing Kyutai’s commitment to openness and reproducibility. This advancement redefines the efficiency and accessibility of large-scale speech generation models, particularly for edge deployment and agentic AI.

Unpacking the Performance: Sub-350ms Latency for 32 Concurrent Users on a Single L40 GPU


The model’s streaming capability is its most distinctive feature. On a single NVIDIA L40 GPU, the system can serve up to 32 concurrent users while keeping the latency under 350ms. For individual use, the model maintains a generation latency as low as 220ms, enabling nearly real-time applications such as conversational agents, voice assistants, and live narration systems. This performance is enabled through Kyutai’s novel Delayed Streams Modeling approach, which allows the model to generate speech incrementally as text arrives.

Key Technical Metrics:


  • Model size : ~2B parameters
  • Training data : 2.5 million hours of speech
  • Latency : 220ms single-user, <350ms with 32 users on one L40 GPU
  • Language support : English and French
  • License : CC-BY-4.0 (open source)

Delayed Streams Modeling: Architecting Real-Time Responsiveness


Kyutai’s innovation is anchored in Delayed Streams Modeling, a technique that allows speech synthesis to begin before the full input text is available. This approach is specifically designed to balance prediction quality with response speed, enabling high-throughput streaming TTS. Unlike conventional autoregressive models that suffer from response lag, this architecture maintains temporal coherence while achieving faster-than-real-time synthesis.

The codebase and training recipe for this architecture are available at Kyutai’s GitHub repository , supporting full reproducibility and community contributions.

unnamed.png


Model Availability and Open Research Commitment


Kyutai has released the model weights and inference scripts on Hugging Face , making it accessible for researchers, developers, and commercial teams. The permissive CC-BY-4.0 license encourages unrestricted adaptation and integration into applications, provided proper attribution is maintained.

This release supports both batch and streaming inference, making it a versatile foundation for voice cloning, real-time chatbots, accessibility tools, and more. With pretrained models in both English and French, Kyutai sets the stage for multilingual TTS pipelines.

Implications for Real-Time AI Applications


By reducing the speech generation latency to the 200ms range, Kyutai’s model narrows the human-perceptible delay between intent and speech, making it viable for:

  • Conversational AI : Human-like voice interfaces with low turnaround
  • Assistive Tech : Faster screen readers and voice feedback systems
  • Media Production : Voiceovers with rapid iteration cycles
  • Edge Devices : Optimized inference for low-power or on-device environments

The ability to serve 32 users on a single L40 GPU without quality degradation also makes it attractive for scaling speech services efficiently in cloud environments.

Conclusion: Open, Fast, and Ready for Deployment


Kyutai’s streaming TTS release is a milestone in speech AI. With high-quality synthesis, real-time latency, and generous licensing, it addresses critical needs for both researchers and real-world product teams. The model’s reproducibility, multilingual support, and scalable performance make it a standout alternative to proprietary solutions.

For more details, you can explore the official model card on Hugging Face , technical explanation on Kyutai’s site , and implementation specifics on GitHub .

 
Top