bnew

Veteran
Joined
Nov 1, 2015
Messages
64,752
Reputation
9,875
Daps
175,661

Highlighted at CVPR 2025: Google DeepMind’s ‘Motion Prompting’ Paper Unlocks Granular Video Control​


By Jean-marc Mommessin

June 13, 2025

Key Takeaways:


  • Researchers from Google DeepMind, the University of Michigan & Brown university have developed “Motion Prompting,” a new method for controlling video generation using specific motion trajectories.
  • The technique uses “motion prompts,” a flexible representation of movement that can be either sparse or dense, to guide a pre-trained video diffusion model.
  • A key innovation is “motion prompt expansion,” which translates high-level user requests, like mouse drags, into detailed motion instructions for the model.
  • This single, unified model can perform a wide array of tasks, including precise object and camera control, motion transfer from one video to another, and interactive image editing, without needing to be retrained for each specific capability.

As generative AI continues to evolve, gaining precise control over video creation is a critical hurdle for its widespread adoption in markets like advertising, filmmaking, and interactive entertainment. While text prompts have been the primary method of control, they often fall short in specifying the nuanced, dynamic movements that make video compelling. A new paper, presented and highlighted at CVPR 2025 , from Google DeepMind, the University of Michigan, and Brown University introduces a groundbreaking solution called “Motion Prompting,” which offers an unprecedented level of control by allowing users to direct the action in a video using motion trajectories.

This new approach moves beyond the limitations of text, which struggles to describe complex movements accurately. For instance, a prompt like “a bear quickly turns its head” is open to countless interpretations. How fast is “quickly”? What is the exact path of the head’s movement? Motion Prompting addresses this by allowing creators to define the motion itself, opening the door for more expressive and intentional video content.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

AD_4nXfsZbo0ibhHqGYfWHWrlcmD7qi5GJPkN61_8dyjQaBAAnMGE8Zyc2RZdutF6K4DZP6JYC6HvfGm6Hr3vgI5HeyzRom32bJcamsOmAU7_DJgAwkaOwsY7RXtzDCFwSM-tHuEUbdShQ
Please note the results are not real time ( 10min processing time)

Introducing Motion Prompts


At the core of this research is the concept of a “motion prompt.” The researchers identified that spatio-temporally sparse or dense motion trajectories—essentially tracking the movement of points over time—are an ideal way to represent any kind of motion. This flexible format can capture anything from the subtle flutter of hair to complex camera movements.

To enable this, the team trained a ControlNet adapter on top of a powerful, pre-trained video diffusion model called Lumiere. The ControlNet was trained on a massive internal dataset of 2.2 million videos, each with detailed motion tracks extracted by an algorithm called BootsTAP. This diverse training allows the model to understand and generate a vast range of motions without specialized engineering for each task.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

AD_4nXe5daPl9saoNJ_sU0NDuBcFHaMVjwiv-UNpJIOuBfzC6dft6H551ZCNFzkoz4YcY5PVQx30RsC3xH9WBfC_xbARQZBT024xeZz4bvj84lQeJyPvyUf27pk5KcsgQFICb36RiYRG0A


From Simple Clicks to Complex Scenes: Motion Prompt Expansion


While specifying every point of motion for a complex scene would be impractical for a user, the researchers developed a process they call “motion prompt expansion.” This clever system translates simple, high-level user inputs into the detailed, semi-dense motion prompts the model needs.

This allows for a variety of intuitive applications:

“Interacting” with an Image: A user can simply click and drag their mouse across an object in a still image to make it move. For example, a user could drag a parrot’s head to make it turn, or “play” with a person’s hair, and the model generates a realistic video of that action. Interestingly, this process revealed emergent behaviors, where the model would generate physically plausible motion, like sand realistically scattering when “pushed” by the cursor.

AD_4nXercDjZNymFtt-oi6XUyK7VArm2XISih321cuDZ5DlE6YCxI1BVbAuhsxwxtXojWbcKmwLKCdwULlE1yHH9DlA7x5_dLi5-gnhMc3_47nOoHjO4VkVx3dxhnsnfW44JdQClpzB52Q


AD_4nXflEGokJ3GgfAa5e-Bsjbf_uO2ukjZfo60fZJt_ltOBI07iO2fS76RACa9ySTXOtzilZOi1hIZ3DZAZaR7hZ4k9fMpo6hjAjEbL-kGiuyX8fV0e22TWmao-NoeCt6YIadxBuwgK_Q


Object and Camera Control: By interpreting mouse movements as instructions to manipulate a geometric primitive (like an invisible sphere), users can achieve fine-grained control, such as precisely rotating a cat’s head. Similarly, the system can generate sophisticated camera movements, like orbiting a scene, by estimating the scene’s depth from the first frame and projecting a desired camera path onto it. The model can even combine these prompts to control an object and the camera simultaneously.

AD_4nXe21-o6ZUs2UoooqD9aNlSWPipStyUC4fkAcOnsE83anpt8xDD9W_M8gWO9-TaLNYbBpzn_IyHt73dds4FSuTUNEdOecNwgHrG4R6DWfPphVaMdCl9izoL9RXUzqy1vMcAfOz4kmQ


Motion Transfer: This technique allows the motion from a source video to be applied to a completely different subject in a static image. For instance, the researchers demonstrated transferring the head movements of a person onto a macaque, effectively “puppeteering” the animal.

AD_4nXfq16WSS9wqpuxF1hazGl2BmDSXQC5wzIt87fC5sY0KU6HoCFrJaImLiFmTU27GTvBKB9nvamz1NqlxiEIQ1T5u8NtJ0bjl2WnSYBaI5rq3HewInN_cghGVAApkzjT5AnELzgMg


Putting it to the Test


The team conducted extensive quantitative evaluations and human studies to validate their approach, comparing it against recent models like Image Conductor and DragAnything. In nearly all metrics, including image quality (PSNR, SSIM) and motion accuracy (EPE), their model outperformed the baselines.

AD_4nXcCBHkSSuFo-1-KZFtk4MAypuDxtJSAuYxDmP8X-j2fmN7qGkH3a41dPG5dkm8mEeQMuTXCSx72EFDjznVok-sNjJC0D-wx9M1Z24iQS8TeA5XR0rbl2ai3gXGxrVlF68GKq4IzhQ


A human study further confirmed these results. When asked to choose between videos generated by Motion Prompting and other methods, participants consistently preferred the results from the new model, citing better adherence to the motion commands, more realistic motion, and higher overall visual quality.

Limitations and Future Directions


The researchers are transparent about the system’s current limitations. Sometimes the model can produce unnatural results, like stretching an object unnaturally if parts of it are mistakenly “locked” to the background. However, they suggest that these very failures can be used as a valuable tool to probe the underlying video model and identify weaknesses in its “understanding” of the physical world.

This research represents a significant step toward creating truly interactive and controllable generative video models. By focusing on the fundamental element of motion, the team has unlocked a versatile and powerful tool that could one day become a standard for professionals and creatives looking to harness the full potential of AI in video production.




Check out thePaper andProject Page. All credit for this research goes to the researchers of this project.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,752
Reputation
9,875
Daps
175,661

Google AI Unveils a Hybrid AI-Physics Model for Accurate Regional Climate Risk Forecasts with Better Uncertainty Assessment​


By Sana Hassan

June 12, 2025

Limitations of Traditional Climate Modeling


Earth system models are essential tools for forecasting environmental changes and helping us prepare for the future. However, their high computational demands make it difficult to run them at resolutions fine enough for detailed, local predictions. Currently, most models are limited to a resolution around 100 kilometers—roughly the size of Hawai’i—making it hard to generate accurate projections for specific regions. Yet, city-scale forecasts at approximately 10 kilometers are vital for real-world applications, such as agriculture, water resource planning, and disaster preparedness. Improving the resolution of these models is key to better protecting communities and supporting more effective local decision-making.

Introducing Dynamical-Generative Downscaling with AI


Researchers at Google have introduced a method that combines traditional physics-based climate modeling with generative AI to assess regional environmental risks. Published in PNAS, their approach—called dynamical-generative downscaling—utilizes diffusion models, a type of AI that learns complex patterns, to convert broad global climate projections into detailed, local predictions at a resolution of approximately 10 km. This method not only bridges the gap between large-scale models and real-world decision-making needs but also does so far more efficiently and affordably than current high-resolution techniques, making it feasible to apply across the growing volume of climate data now available.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

To better understand local environmental changes at fine resolutions (around 10 km), scientists typically use a method called dynamical downscaling. This process takes broad data from global climate models and refines it using regional climate models, like zooming in on a worldwide map to see more detail. While this technique provides highly accurate local forecasts by factoring in terrain and regional weather patterns, it comes at a steep computational cost, making it too slow and expensive to apply broadly across many climate scenarios. Simpler statistical methods are faster but often fail to model extreme events or reliably adapt to new future conditions.

Improving Accuracy and Efficiency with R2D2


To overcome these challenges, researchers have introduced a more efficient method that merges the strengths of physics-based models with generative AI. This two-step process begins with a physics-based simulation that downscales global data to a mid-level resolution, ensuring consistency across different global models. Then, a generative AI model called R2D2 fills in the finer details—like small-scale weather features shaped by terrain—by learning from high-resolution examples. By focusing on the differences between medium and high resolutions, R2D2 improves accuracy and generalizes well to unseen scenarios. This combined approach enables faster, cost-effective, and realistic local climate projections across a wide range of future scenarios.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

To test the new approach, researchers trained the model using one high-resolution climate projection from the Western U.S. and then evaluated it on seven others. Compared to traditional statistical methods, their AI-powered downscaling model significantly reduced errors by over 40% in predicting variables like temperature, humidity, and wind. It also more accurately captured complex weather patterns, like heatwaves combined with droughts or wildfire risks from strong winds. This method enhances both accuracy and efficiency, providing more accurate estimates of extreme weather and uncertainty while utilizing only a fraction of the computing power required by traditional high-resolution simulations.

Screenshot-2025-06-12-at-8.25.54%E2%80%AFPM-1-1024x411.png


In conclusion, the new AI-powered downscaling approach is a major leap forward in making detailed, regional climate forecasts more accessible and affordable. By combining traditional physics-based modeling with generative AI, the method delivers accurate, city-scale (\~10 km) climate risk assessments while cutting computing costs by up to 85%. Unlike older methods, which are limited by scale and expense, this technique can efficiently handle large ensembles of climate projections. It captures uncertainties more comprehensively and supports smarter planning in agriculture, disaster preparedness, water management, and infrastructure. In short, it turns complex global data into actionable local insights—faster, cheaper, and more accurately than ever before.




Check out the Paper andTechnical details. All credit for this research goes to the researchers of this project.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,752
Reputation
9,875
Daps
175,661

Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation​


By Nikhil

June 12, 2025

Artificial intelligence has undergone a significant transition from basic language models to advanced models that focus on reasoning tasks. These newer systems, known as Large Reasoning Models (LRMs), represent a class of tools designed to simulate human-like thinking by producing intermediate reasoning steps before arriving at conclusions. The focus has moved from generating accurate outputs to understanding the process that leads to these answers. This shift has raised questions about how these models manage tasks with layered complexity and whether they truly possess reasoning abilities or are simply leveraging training patterns to guess outcomes.

Redefining Evaluation: Moving Beyond Final Answer Accuracy


A recurring problem with evaluating machine reasoning is that traditional benchmarks mostly assess the final answer without examining the steps involved in arriving at it. Final answer accuracy alone does not reveal the quality of internal reasoning, and many benchmarks are contaminated with data that may have been seen during training. This creates a misleading picture of a model’s true capabilities. To explore actual reasoning, researchers require environments where problem difficulty can be precisely controlled and intermediate steps can be analyzed. Without such settings, it is hard to determine whether these models can generalize solutions or merely memorize patterns.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ


To evaluate reasoning more reliably, the research team at Apple designed a setup using four puzzle environments: Tower of Hanoi, River Crossing, Checkers Jumping, and Blocks World. These puzzles allow precise manipulation of complexity by changing elements such as the number of disks, checkers, or agents involved. Each task requires different reasoning abilities, such as constraint satisfaction and sequential planning. Importantly, these environments are free from typical data contamination, enabling thorough checks of both outcomes and the reasoning steps in between. This method ensures a detailed investigation of how models behave across varied task demands.

The research introduced a comparative study using two sets of models: Claude 3.7 Sonnet and DeepSeek-R1, along with their “thinking” variants and their standard LLM counterparts. These models were tested across the puzzles under identical token budgets to measure both accuracy and reasoning efficiency. This helped reveal performance shifts across low, medium, and high-complexity tasks. One of the most revealing observations was the formation of three performance zones. In simple tasks, non-thinking models outperformed reasoning variants. For medium complexity, reasoning models gained an edge, while both types collapsed completely as complexity peaked.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

AD_4nXckYIQsjfJOIKFZrsCx4nj91J59IlzhOBUczdP717QT31gh3PlOHeNo3EhDoZGcvLR8eu9Z7w87riXza0KseMXqCuKoNry4xa3OWtiAbj-CD7EdnXmQYWnLiAq9KUQLpuLoJs1Gag


Comparative Insights: Thinking vs. Non-Thinking Models Under Stress


An in-depth analysis revealed that reasoning effort increased with task difficulty up to a certain point but then declined despite the availability of resources. For instance, in the Tower of Hanoi, Claude 3.7 Sonnet (thinking) maintained high accuracy until complexity reached a certain threshold, after which performance dropped to zero. Even when these models were supplied with explicit solution algorithms, they failed to execute steps beyond specific complexity levels. In one case, Claude 3.7 could manage around 100 steps correctly for the Tower of Hanoi but was unable to complete simpler River Crossing tasks requiring only 11 moves when $N = 3$. This inconsistency exposed serious limitations in symbolic manipulation and exact computation.

The performance breakdown also highlighted how LRMs handle their internal thought process. Models frequently engaged in “overthinking,” generating correct intermediate solutions early in the process but continuing to explore incorrect paths. This led to inefficient use of tokens. At medium complexity levels, models began to find correct answers later in their reasoning chains. However, at high levels of complexity, they failed to produce accurate solutions. Quantitative analysis confirmed that solution accuracy dropped to near zero as the problem complexity increased, and the number of reasoning tokens allocated began to decline unexpectedly.

AD_4nXcVrGcfiiX15r5FPQe3BxrViqQNvJKYoVPaMmo6cfebpYC1Uliyx-p-3iFTKnbauSaZuBZ4958Xvou8gItCTttjC8cPiWPU7cPbXTluBSXkcvp1prrizG6CRrkvsHsVWRDaXtxwMg


Scaling Limits and the Collapse of Reasoning


This research presents a sobering assessment of how current Learning Resource Management Systems (LRMs) operate. Research from Apple makes it clear that, despite some progress, today’s reasoning models are still far from achieving generalized reasoning. The work identifies how performance scales, where it collapses, and why over-reliance on benchmark accuracy fails to capture deeper reasoning behavior. Controlled puzzle environments have proven to be a powerful tool for uncovering hidden weaknesses in these systems and emphasizing the need for more robust designs in the future.




Check out thePaper. All credit for this research goes to the researchers of this project.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,752
Reputation
9,875
Daps
175,661

This AI Paper Introduces VLM-R³: A Multimodal Framework for Region Recognition, Reasoning, and Refinement in Visual-Linguistic Tasks​


By Nikhil

June 12, 2025

Multimodal reasoning ability helps machines perform tasks such as solving math problems embedded in diagrams, reading signs from photographs, or interpreting scientific charts. The integration of both visual and linguistic information enables these systems to more closely mirror human thought processes, making them suitable for tasks that require visual interpretation combined with logical progression.

A major challenge in this area is the inability of current systems to revisit specific parts of an image while reasoning dynamically. Traditional models usually begin by analyzing an image once and then proceed with the rest of the reasoning in pure text. This approach limits accuracy in situations that require revisiting the image to confirm a detail or extract new visual cues during mid-reasoning. These shortcomings are particularly pronounced in tasks that require fine-grained spatial awareness, such as identifying small labels in scientific documents or resolving ambiguities in visually complex scenes.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

AD_4nXex03ze4TiwAlal7fprTgIrzTqg-H19Xf6lRa6iEklbvZxim7TraX_LLaMVl439r8Q43JQU2e7iuYnS3Zc00KAvYLc0Vxq41-cLyPuu8KQqbXZ1juZAViuLgthPo7KQhU_v0bS8Fg


Some tools and models have been introduced to address this gap, but they often treat visual grounding as a one-time operation. For example, existing systems like LLaVA-CoT or Qwen2.5-VL offer some visual-text integration. Still, they don’t let the model repeatedly and selectively query parts of an image based on the evolving reasoning process. The grounding, if performed, is generally static and lacks the flexibility to adapt based on intermediate reasoning steps. Moreover, these methods do not train models to determine the importance of specific image regions, leading to limitations in complex problem-solving.

Researchers from Peking University, Alibaba Group, and ZEEKR Intelligent Technology have introduced a model called VLM-R³. This model tackles the challenge by allowing a more interactive connection between vision and reasoning. It equips the model with the capacity to determine when visual clarification is needed, identify the exact image region for analysis, and re-integrate this visual content into the reasoning process. This approach mimics human problem-solving, where one might zoom into a chart or revisit a paragraph to verify a detail before making a decision. The model’s structure emphasizes refining its decisions iteratively by relying on visual evidence throughout the reasoning process.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

AD_4nXdPOfq0y1MT827SJln5I1HTI4oHGdCxQM--3WydMk4cyYyYQFjXMgvqbvxEjM7UrqNf_bJuEZr2syx-NTeKT9tcLn45MQDVXiqYsSZ1cCHbXniHX7TVgBr5voVsAO2MYjVDkO_b_w


To accomplish this, the researchers built a dataset named Visuo-Lingual Interleaved Rationale (VLIR), designed to train models in a stepwise interaction between images and text. VLM-R³ incorporates this dataset and operates using a method called Region-Conditioned Reinforcement Policy Optimization (R-GRPO). This training strategy encourages the model to selectively focus on informative parts of an image, perform transformations such as cropping or zooming, and incorporate those changes into subsequent logical steps. It simulates how humans shift their attention across different visual elements in response to their thoughts. The architecture integrates a pipeline that loops reasoning with visual inspection in real time, enhancing the system’s ability to interact with visual data during inference.

The results demonstrate a strong performance across multiple benchmarks. On MathVista, the model reached 70.4%, an increase from 68.2% in the baseline. For MathVision, the improvement was from 25.1% to 30.2%. On ScienceQA, it posted a 14.3% improvement, reaching 87.9% over the baseline’s 73.6%. On the hallucination test (HallusionBench), the model achieved 62.0%, outperforming others like Mulberry, which scored 54.1%. VLM-R³ also showed superior results on document understanding in DocVQA with a 96.8% score. Comparisons showed that even though it uses fewer parameters than closed-source models like Gemini-2 Flash or GPT-4o, it delivers competitive accuracy, particularly in tasks requiring detailed visual analysis and interleaved reasoning.

AD_4nXeWCIk_NVli31R-TI-bZa8RLmAtapvMy6OUs8y9JIoBBmHp2rEUuTjSrmvFB5-0tBMox3JbbkE0kqrdkaHOyUS2jGTUVN9Rx0gBzC4xQ0JxCN0P_5sY5D1c0rsSSocaLdlw4B6S


This work clearly outlines a problem that exists in how models handle vision during reasoning and presents a well-structured solution. By integrating a method for ongoing image analysis, researchers from the Alibaba Group, Peking University, and ZEEKR have advanced a powerful idea—models that look again, think, and refine. The proposed framework significantly improves accuracy in complex tasks and provides a blueprint for more robust, visually aware AI systems.




Check out thePaper andGitHub Page. All credit for this research goes to the researchers of this project.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,752
Reputation
9,875
Daps
175,661

Meta AI Releases V-JEPA 2: Open-Source Self-Supervised World Models for Understanding, Prediction, and Planning​


By Asif Razzaq

June 12, 2025

Meta AI has introduced V-JEPA 2, a scalable open-source world model designed to learn from video at internet scale and enable robust visual understanding, future state prediction, and zero-shot planning. Building upon the joint-embedding predictive architecture (JEPA), V-JEPA 2 demonstrates how self-supervised learning from passive internet video, combined with minimal robot interaction data, can yield a modular foundation for intelligent physical agents.

Screenshot-2025-06-12-at-1.06.01%E2%80%AFAM-1-1024x668.png


Scalable Self-Supervised Pretraining from 1M Hours of Video


V-JEPA 2 is pretrained on over 1 million hours of internet-scale video combined with 1 million images. Using a visual mask denoising objective, the model learns to reconstruct masked spatiotemporal patches in a latent representation space. This approach avoids the inefficiencies of pixel-level prediction by focusing on predictable scene dynamics while disregarding irrelevant noise.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

To scale JEPA pretraining to this level, Meta researchers introduced four key techniques:

  • Data scaling: Constructed a 22M-sample dataset (VideoMix22M) from public sources like SSv2, Kinetics, HowTo100M, YT-Temporal-1B, and ImageNet.
  • Model scaling: Expanded the encoder capacity to over 1B parameters using ViT-g.
  • Training schedule: Adopted a progressive resolution strategy and extended pretraining to 252K iterations.
  • Spatial-temporal augmentation: Trained on progressively longer and higher-resolution clips, reaching 64 frames at 384×384 resolution.

These design choices led to an 88.2% average accuracy across six benchmark tasks—including SSv2, Diving-48, Jester, Kinetics, COIN, and ImageNet—surpassing previous baselines.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

Understanding via Masked Representation Learning


V-JEPA 2 exhibits strong motion understanding capabilities. On the Something-Something v2 benchmark, it achieves 77.3% top-1 accuracy, outperforming models like InternVideo and VideoMAEv2. For appearance understanding, it remains competitive with state-of-the-art image-text pretraining models like DINOv2 and PEcoreG.

The encoder’s representations were evaluated using attentive probes, verifying that self-supervised learning alone can yield transferable and domain-agnostic visual features applicable across diverse classification tasks.

Temporal Reasoning via Video Question Answering


To assess temporal reasoning, the V-JEPA 2 encoder is aligned with a multimodal large language model and evaluated on multiple video question-answering tasks. Despite lacking language supervision during pretraining, the model achieves:

  • 84.0% on PerceptionTest
  • 76.9% on TempCompass
  • 44.5% on MVP
  • 36.7% on TemporalBench
  • 40.3% on TOMATO

These results challenge the assumption that visual-language alignment requires co-training from the start, demonstrating that a pretrained video encoder can be aligned post hoc with strong generalization.

V-JEPA 2-AC: Learning Latent World Models for Robotic Planning


A key innovation in this release is V-JEPA 2-AC, an action-conditioned variant of the pretrained encoder. Fine-tuned using only 62 hours of unlabeled robot video from the Droid dataset, V-JEPA 2-AC learns to predict future video embeddings conditioned on robot actions and poses. The architecture is a 300M parameter transformer with block-causal attention, trained using a teacher-forcing and rollout objective.

This allows zero-shot planning through model-predictive control. The model infers action sequences by minimizing the distance between imagined future states and visual goals using the Cross-Entropy Method (CEM). It achieves high success in tasks such as reaching, grasping, and pick-and-place on unseen robot arms in different labs—without any reward supervision or additional data collection.

Screenshot-2025-06-12-at-1.06.59%E2%80%AFAM-1-1024x675.png


Benchmarks: Robust Performance and Planning Efficiency


Compared to baselines like Octo (behavior cloning) and Cosmos (latent diffusion world models), V-JEPA 2-AC:

  • Executes plans in ~16 seconds per step (versus 4 minutes for Cosmos).
  • Reaches a 100% success rate on reach tasks.
  • Outperforms others in grasp and manipulation tasks across object types.

Screenshot-2025-06-12-at-1.07.34%E2%80%AFAM-1024x351.png


Notably, it operates using a monocular RGB camera without calibration or environment-specific fine-tuning, reinforcing the generalization capability of the learned world model.

Conclusion


Meta’s V-JEPA 2 represents a significant advancement in scalable self-supervised learning for physical intelligence. By decoupling observation learning from action conditioning and leveraging large-scale passive video, V-JEPA 2 demonstrates that general-purpose visual representations can be harnessed for both perception and control in the real world.




Check out thePaper ,Models on Hugging Face andGitHub Page. All credit for this research goes to the researchers of this project.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,752
Reputation
9,875
Daps
175,661

How Do LLMs Really Reason? A Framework to Separate Logic from Knowledge​


By Sana Hassan

June 11, 2025

Unpacking Reasoning in Modern LLMs: Why Final Answers Aren’t Enough


Recent advancements in reasoning-focused LLMs like OpenAI’s o1/3 and DeepSeek-R1 have led to notable improvements on complex tasks. However, the step-by-step reasoning behind these models remains unclear. Most evaluations focus on final-answer accuracy, which hides the reasoning process and doesn’t reveal how models combine knowledge and logic. Some earlier methods attempt to measure reasoning by comparing answers to the original question, but this approach is flawed since models often rely on prior deductions or internal knowledge. Domains such as math and medicine differ in their reasoning needs, highlighting the importance of developing better, domain-aware evaluation methods for building trustworthy AI.

The Shortcomings of Final-Answer Evaluations in Math and Medicine


Recent LLMs have made impressive strides in reasoning tasks, especially in math and medicine, thanks to better training data and reward strategies. However, most of this progress focuses on boosting final answer accuracy rather than understanding how the model reasons step-by-step. Past work has flagged factual errors in reasoning chains or measured similarity between reasoning steps and the original question. But such similarity doesn’t guarantee logical soundness or factual correctness, since LLMs often draw on internal knowledge or earlier reasoning.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

A New Framework for Separating Knowledge and Logic in LLM Reasoning


Researchers from UC Santa Cruz, Stanford, and Tongji University go beyond final-answer evaluation by breaking down LLM reasoning into two key parts: factual knowledge and logical steps. They introduce a detailed framework that utilizes two metrics: the Knowledge Index (KI) for factual accuracy and Information Gain (InfoGain) for reasoning quality. Their analysis of Qwen models across math and medical tasks reveals that reasoning skills don’t easily transfer between domains. While supervised fine-tuning improves accuracy, it often harms reasoning depth. Reinforcement learning, however, helps refine reasoning by removing irrelevant information. This work highlights the importance of evaluating and training LLMs more thoughtfully.

Assessing Reasoning with Qwen2.5-7B and DeepSeek-R1 Models


The researchers evaluate reasoning in LLMs by analyzing Qwen2.5-7B and its DeepSeek-R1-distilled version, trained with SFT and RL. Using tasks from both math and medical domains, they decompose responses into logical steps and assess them using two key metrics: Information Gain (how much uncertainty is reduced with each reasoning step) and Knowledge Index (how factually accurate each step is, verified against expert sources). While InfoGain tracks the informativeness of each step, KI checks whether the knowledge aligns with real-world facts. This approach reveals how models reason and where they may falter in accuracy or logic.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

Supervised Fine-Tuning vs. Reinforcement Learning in Domain-Specific Tasks


The study evaluates two variants of Qwen-2.5-7B—Qwen-Base and the distilled Qwen-R1 on medical tasks. Results show that Qwen-Base consistently outperforms Qwen-R1 in accuracy, knowledge retention, and reasoning, especially after SFT and RL. The distilled model likely struggles due to prior training focused on math and code, resulting in a domain mismatch. Interestingly, SFT enhances medical knowledge more effectively than RL, although it may slightly compromise reasoning efficiency. RL, on the other hand, improves both reasoning and knowledge when applied post-SFT. Medical benchmarks tend to rely more on factual knowledge than abstract reasoning, unlike math-focused tasks.

Conclusion: Toward More Interpretable and Trustworthy LLMs


In conclusion, the study introduces a framework that separates knowledge from reasoning to evaluate better how LLMs think, particularly in high-stakes areas like medicine and math. Using Qwen models trained with SFT and RL, the researchers found that while SFT improves factual accuracy, essential in medicine, it often weakens reasoning. RL, however, enhances reasoning by trimming out incorrect information. The framework could be extended to fields such as law or finance, where structured thinking is crucial. Overall, this approach helps clarify how LLMs make decisions and suggests ways to tailor their training for specific domains.




Check out thePaper ,Code andProject Page. All credit for this research goes to the researchers of this project.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,752
Reputation
9,875
Daps
175,661

ether0: A 24B LLM Trained with Reinforcement Learning RL for Advanced Chemical Reasoning Tasks​


By Sajjad Ansari

June 10, 2025

LLMs primarily enhance accuracy through scaling pre-training data and computing resources. However, the attention has shifted towards alternate scaling due to finite data availability. This includes test-time training and inference compute scaling. Reasoning models enhance performance by emitting thought processes before answers, initially through CoT prompting. Recently, reinforcement learning (RL) post-training has been used. Scientific domains present ideal opportunities for reasoning models. The reason is they involve “inverse problems” where solution quality assessment is straightforward but solution generation remains challenging. Despite conceptual alignment between structured scientific reasoning and model capabilities, current methods lack detailed approaches for scientific reasoning beyond multiple-choice benchmarks.

Technical Evolution of Reasoning Architectures


Reasoning models have evolved from early prompt-based methods such as CoT, zero-shot CoT, and Tree of Thought. They have progressed to complex RL approaches via Group Relative Policy Optimization (GRPO) and inference time scaling. Moreover, reasoning models in chemistry focus on knowledge-based benchmarks rather than complex reasoning tasks. Examples include retrosynthesis or molecular design. While datasets such as GPQA-D and MMLU assess chemical knowledge, they fail to evaluate complex chemical reasoning capabilities. Current scientific reasoning efforts remain fragmented. Limited attempts include OmniScience for general science, Med-R1 for medical vision-language tasks, and BioReason for genomic reasoning. However, no comprehensive framework exists for large-scale chemical reasoning model training.

Partner with us to speak at the AI Infrastructure miniCON Virtual Event (Aug 2, 2025)

ether0 Architecture and Design Principles


Researchers from FutureHouse have proposedether0 , a novel model that reasons in natural language and outputs molecular structures as SMILES strings. It demonstrates the efficacy of reasoning models in chemical tasks. It outperforms frontier LLMs, human experts, and general chemistry models. The training approach uses several optimizations over vanilla RL. This includes distillation of reasoning behavior, a dynamic curriculum, and expert model initialization to enhance efficiency and effectiveness. Moreover, factors such as data efficiency, failure modes, and reasoning behavior are analyzed. This analysis allows for a better understanding of the reasoning utility in solving chemistry problems.

AD_4nXfSgeFkt3C7AmcocTBiEYCi4hMbVcyJLwW04PHgH1jbWOvhb7K46BxT4HiwkraMtuylPRzJ2XdTKikZW1btZ738noQ7Dhn2DBbJj4uBwIfXwDr2nXauHcjlUpEpH10tTGl-gDm0


Training Pipeline: Distillation and GRPO Integration


The model employs a multi-stage training procedure alternating between distillation and GRPO phases. The architecture introduces four special tokens. These tokens demarcate reasoning and answer boundaries. Training begins with SFT on long CoT sequences generated by DeepSeek-R1. These are filtered for valid SMILES format, and reasoning quality. Specialist RL then optimizes task-specific policies for different problem categories using GRPO. Then, distillation merges specialist models into a generalist. This merges occurs through SFT on correct responses collected throughout training. The final phase applies generalist GRPO to the merged model. This includes continuous quality filtering to remove low-quality reasoning and undesirable molecular substructures.

Recommended open-source AI alignment framework: Parlant — Control LLM agent behavior in customer-facing interactions (Promoted)

Performance Evaluation and Comparative Benchmarks


Ether0 demonstrates superior performance against both general-purpose LLMs like Claude and o1, and chemistry-specific models, including ChemDFM and TxGemma. It achieves the highest accuracy across all open-answer categories while maintaining competitive performance on multiple-choice questions. For data efficiency, the model outperforms traditional molecular transformer models. It is trained on only 60,000 reactions compared to full USPTO datasets. Ether0 achieves 70% accuracy after seeing 46,000 training examples. Molecular transformers achieved 64.1% on complete datasets in comparison. Under one-shot prompting conditions, ether0 surpasses all evaluated frontier models. Safety alignment procedures successfully filter 80% of unsafe questions without degrading performance on core chemistry tasks.

AD_4nXdSYodTOFF0nT9kl9kwY5ezKuOEhYhi8hu5VY1ysVQVT9yvOp0K4QDH3iwoQAConC3iEhTiRc7RpTlEo_isWPDn7ouqJLmjh22FMqsbkQenmqTHYWoH4SDJzp_a9zxK0n0pKwCiyg


Conclusion: Implications for Future Scientific LLMs


In conclusion, researchers introduced ether0, a 24B-parameter model trained on ten challenging molecular tasks. It significantly outperforms frontier LLMs, domain experts, and specialized models. This is achieved through its interleaved RL and behavior distillation pipeline. The model exhibits exceptional data efficiency and reasoning capabilities. It excels in open-answer chemistry tasks involving molecular design, completion, modification, and synthesis. However, limitations include potential generalization challenges beyond organic chemistry. Moreover, there is a loss of general instruction-following and absence of tool-calling integration. The release of model weights, benchmark data, and reward functions establishes a foundation. This foundation aids in advancing scientific reasoning models across diverse domains.




Check out thePaperandTechnical details. All credit for this research goes to the researchers of this project.


 
Top