bnew

Veteran
Joined
Nov 1, 2015
Messages
64,277
Reputation
9,830
Daps
174,775



1/6
@_akhaliq
Nvidia just dropped Describe Anything on Hugging Face

Detailed Localized Image and Video Captioning



https://video.twimg.com/amplify_video/1914917355584397313/vid/avc1/1920x1080/B4LwT6OpO7tW4qZK.mp4

2/6
@_akhaliq
discuss with author: Paper page - Describe Anything: Detailed Localized Image and Video Captioning



3/6
@_akhaliq
app: Describe Anything - a Hugging Face Space by nvidia



4/6
@OfirOzeri
Also localization?

Meaning extracting objects xyz attributes?



5/6
@aiproworkflow
NVIDIA’s “Describe Anything” just dropped on Hugging Face — and it’s a vision-language beast.

🧠 Localized captions
🎯 Region-specific precision
📹 Works on images AND video

Built with a novel focal prompt + localized backbone. If you're building multimodal apps, this changes the game.



6/6
@elomaquiabelo
@grok how can I use this on mi pc? Do I need code? Can I just download it... How it works




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196







1/6
@mervenoyann
New foundation model on image and video captioning just dropped by @NVIDIAAI 🔥

Describe Anything Model (DAM) is a 3B vision language model to generate detailed captions with localized references 😮
The team released the models, the dataset, a new benchmark and a demo 🤩



https://video.twimg.com/amplify_video/1914979917281857536/vid/avc1/1920x700/VtUpH7x5hNiheUV_.mp4

2/6
@mervenoyann
Keep reading for technical details 🤝
> Collection of models, datasets and the demo Describe Anything - a nvidia Collection
> Join discussion and read paper here Paper page - Describe Anything: Detailed Localized Image and Video Captioning



3/6
@mervenoyann
Most of the vision LMs focus on image as a whole, lacking localized references in captions, and not taking in visual prompts (points, boxes, drawings around objects)

DAM addresses this on two levels: new vision backbone and dataset 👀 (see architecture below, authors feed both focal crops and whole image)



GpNfxnwWkAAK_5B.jpg


4/6
@mervenoyann
They also generate a dataset by extending existing segmentation and referring expression generation datasets like REFCOCO, by passing in the images and classes to VLMs and generating captions ⤵️



GpNgazIXoAAyAd-.jpg


5/6
@mervenoyann
Lastly, they also release a new benchmark again with self-supervision, they use an LLM to evaluate the detailed captions focusing on localization 👀



GpNgt9LWAAEdZgY.jpg


6/6
@TheGeneralistHQ
Can it create Ghibli ? 😂




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,277
Reputation
9,830
Daps
174,775
oxen.ai



Training a Rust 1.5B Coder LM with Reinforcement Learning (GRPO) | Oxen.ai​


GitHub

20–26 minutes



Group Relative Policy Optimization (GRPO) has proven to be a useful algorithm for training LLMs to reason and improve on benchmarks. DeepSeek-R1 showed that you can bootstrap a model through a combination of supervised fine-tuning and GRPO to compete with the state of the art models such as OpenAI's o1.

To learn more about how it works in practice, we wanted to try out some of the techniques on a real world task. This post will outline how to train your own custom small LLM using GRPO, your own data, and custom reward functions. Below is a sneak preview of some of the training curves we will see later. It is quite entertaining to watch the model learn to generate code blocks, get better at generating valid code that compiles, and finally code that passes unit tests.

Screenshot-2025-03-05-at-4.07.17-PM.png


If you want to jump straight into the action, the GitHub repository can be found here.

GitHub - Oxen-AI/GRPO-With-Cargo-Feedback: This repository has code for fine-tuning LLMs with GRPO specifically for Rust Programming using cargo as feedback

This repository has code for fine-tuning LLMs with GRPO specifically for Rust Programming using cargo as feedback - Oxen-AI/GRPO-With-Cargo-Feedback

Oxen-AI



This post will not go into the fundamentals of GRPO, if you want to learn more about how it works at a fundamental level, feel free to checkout our deep dive into the algorithm below.

Why GRPO is Important and How it Works | Oxen.ai

Last week on Arxiv Dives we dug into research behind DeepSeek-R1, and uncovered that one of the techniques they use in the their training pipeline is called Group Relative Policy Optimization (GRPO). At it’s core, GRPO is a Reinforcement Learning (RL) algorithm that is aimed at improving the model’s reasoning ability. It was first introduced in their paper DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, but was also used in the post-training of DeepSeek-R1.

Oxen.ai





Why Rust?​


Rust seems like it would be a great playground for Reinforcement Learning (RL) because you have access to the rust compiler and the cargo tooling. The Rust compiler gives great error messages and is pretty strict.

In this project, the first experiment we wanted to prove out was that you can use cargo as a feedback mechanism to teach a model to become a better programmer. The second experiment we wanted to try was to see how small of a language model can you get away with. These experiments are purposely limited to a single node H100 to limit costs and show how accessible the training can be.

We are also a Rust dev shop at Oxen.ai, so have some interesting applications 🦀 x 🐂.



Why 1.5B?​


Recently, there is a lot of work seeing how far we can push the boundaries of small language models for specific tasks. When you have a concrete feedback mechanism such as the correct answer to a math problem or the output of a program, it seems you can shrink the model while maintaining very competitive performance.

The rStar-Math paper from Microsoft shows this in the domain of verifiable math problems allowing the model to reason. The 1.5B model outperforms GPT-4o and o1-preview.

Screenshot-2025-03-04-at-9.31.07-AM.png


rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising “deep thinking” through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids naïve step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs’ math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.

arXiv.orgXinyu Guan



My hypothesis is that we can push similar level of performance on coding, since you have a similar verifiable reward: Does the code compile and does it pass unit tests?



Benefits of Smol LMs​


Having small coding models have many benefits including cost, throughput, data privacy, and ability to customize to your own codebase / coding practices. Plus it's just a fun challenge.

The dream would be to eventually have this small model do all the cursor-like tasks of next tab prediction, fill in the middle, and improve it’s code in an agent loop. But let’s start simple.



Formulating the Problem​


There are a few different ways you could structure the problem of writing code that passes unit tests. We ended up trying a few. A seemingly straightforward option would be to have a set of verifiable unit tests that must pass given the generated code. This would give us a gold standard set of verifiable answers.

Prompt-Code-Unit-Tests-1.png


After trying out this flow we found two main problems. First, if you don’t let the model see the unit tests while writing the code, it will have no sense of the interface it is writing for. Many of the errors ended up being type or naming mismatches between the code and the unit tests while evaluating against pre-built, verified unit tests.

Errors-1.png


Second, if you allow the model to see the unit tests while its writing the code, you lose out on developer experience. Unless you are a hard core “Test Driven Developer” you probably just want to send in a prompt and not think about the function definition or unit tests yet.

Rather than trying to come up with something more clever, we ended up optimizing for simplicity. We reformulated the problem to have the model generate the code and the tests within the same response.

Simplified.png


With single pass there is a danger of the model hacking the reward function to make the functions and unit tests trivial. For example it could just have println! and no assert statements to get everything to compile and pass. We will return to putting guardrails on for this later.

Finally we add a verbose system prompt to give the model guidance on the task.

system-prompt.png


The system prompt gives the model some context in the format and style in which we are expecting the model to answer the user queries.



The Dataset​


Before training, we need a dataset. When starting out, we did not see many datasets targeted at Rust. Many of the LLM benchmarks are targeted at Python. So the first thing we did was convert a dataset of prompts asking Pythonic questions to a dataset of Rust prompts.

We took a random 20k prompts from the Ace-Code-87k dataset. We then used Qwen 2.5 Coder 32B Instruct to write rust code and unit tests. We ran the code and unit tests through the compiler and testing framework to filter out any triples that did not pass the unit tests. This left us with 16500 prompt,code,unit_test triples that we could train and evaluate on. The dataset was split into 15000 train, 1000 test, and 500 evaluation data points.

The final data looks like the following:

Screenshot-2025-03-04-at-3.45.31-PM.png


ox/Rust/cargo_test_passed_train.parquet at main

This is a dataset of rust questions and generated code created to fine tune small language models on rust.. Contribute to the ox/Rust repository by creating an account on Oxen.ai





You can follow the prompts and steps by looking at these model runs:

1) Translate to Rust: https://www.oxen.ai/ox/mbrp-playground/evaluations/ce45630c-d9e8-4fac-9b41-2d41692076b3

2) Write Rust code: https://www.oxen.ai/ox/mbrp-playground/evaluations/febc562a-9bd4-4e91-88d7-a95ee676a5ed

3) Write Rust unit tests - https://www.oxen.ai/ox/mbrp-playground/evaluations/b886ddd6-b501-4db8-8ed6-0b719d0ac595

Funny enough, for the final formulation of the GRPO training we ended up throwing away the gold standard rust code and unit tests columns. With our reinforcement learning loop we only need the prompts as input. This makes it pretty easy to collect more data in the future. We’ll dive into how the single prompt as input works in the following sections. Even though we threw away the code and unit tests for training, it was nice to know the prompts are solvable.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,277
Reputation
9,830
Daps
174,775
Qwen 3 benchmark results(With reasoning)


Posted on Mon Apr 28 21:03:53 2025 UTC

qjjf901q3nxe1.png

qj61pf6b4nxe1.png

7ltidtff4nxe1.png













1/11
@Alibaba_Qwen
Introducing Qwen3!

We release and open-weight Qwen3, our latest large language models, including 2 MoE models and 6 dense models, ranging from 0.6B to 235B. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general capabilities, etc., when compared to other top-tier models such as DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro. Additionally, the small MoE model, Qwen3-30B-A3B, outcompetes QwQ-32B with 10 times of activated parameters, and even a tiny model like Qwen3-4B can rival the performance of Qwen2.5-72B-Instruct.

For more information, feel free to try them out in Qwen Chat Web (Qwen Chat) and APP and visit our GitHub, HF, ModelScope, etc.

Blog: Qwen3: Think Deeper, Act Faster
GitHub: GitHub - QwenLM/Qwen3: Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
Hugging Face: Qwen3 - a Qwen Collection
ModelScope: Qwen3

The post-trained models, such as Qwen3-30B-A3B, along with their pre-trained counterparts (e.g., Qwen3-30B-A3B-Base), are now available on platforms like Hugging Face, ModelScope, and Kaggle. For deployment, we recommend using frameworks like SGLang and vLLM. For local usage, tools such as Ollama, LMStudio, MLX, llama.cpp, and KTransformers are highly recommended. These options ensure that users can easily integrate Qwen3 into their workflows, whether in research, development, or production environments.

Hope you enjoy our new models!



Gppj9_kbEAAkO9U.jpg

Gppj9_eaAAAeB0s.jpg


2/11
@Alibaba_Qwen
Qwen3 exhibits scalable and smooth performance improvements that are directly correlated with the computational reasoning budget allocated. This design enables users to configure task-specific budgets with greater ease, achieving a more optimal balance between cost efficiency and inference quality.



GppkMjPbEAECW1D.jpg


3/11
@Alibaba_Qwen
Qwen3 models are supporting 119 languages and dialects. This extensive multilingual capability opens up new possibilities for international applications, enabling users worldwide to benefit from the power of these models.



GppkYSFbEAANXPj.png


4/11
@Alibaba_Qwen
We have optimized the Qwen3 models for coding and agentic capabilities, and also we have strengthened the support of MCP as well. Below we provide examples to show how Qwen3 thinks and interacts with the environment.



https://video.twimg.com/amplify_video/1916955612430397440/vid/avc1/1156x720/iUcPwb2A3t9kjUiE.mp4

5/11
@Alibaba_Qwen




GppuaCRbEAEK_fY.jpg


6/11
@Alibaba_Qwen
We also evaluated the preliminary performance of Qwen3-235B-A22B on the open-source coding agent Openhands. It achieved 34.4% on Swebench-verified, achieving competitive results with fewer parameters! Thanks to @allhands_ai for providing an easy-to-use agent. Both open models and open agents are exciting!



GprHzWObEAUH6jV.jpg


7/11
@MavMikee
@OpenRouterAI 🔥



8/11
@ofermend
Congrats. Will evaluate this for @vectara hallucination leaderboard and publish the results shortly.



9/11
@wilsonsilva90
Is there any chance to partner with @GroqInc or @CerebrasSystems?



10/11
@vansinhu




GpqFYmmaYAAYCsZ.jpg


11/11
@Fabeyy1337
@theo where this??




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196












1/12
@godofprompt
🚨 BREAKING: Alibaba just launched "Qwen3" its most powerful AI model yet.

It thinks deeper, acts faster, and it outperforms models like DeepSeek-R1, Grok 3 and Gemini-2.5-Pro.

Here are 5 insane examples of what it can do:



GpsTi_1bQAAvxbl.jpg


2/12
@godofprompt
1. Complex Reasoning:

Qwen3 solved a classic logic puzzle step-by-step without rushing.

Each step was reasoned out clearly, leading to the right answer with full explanations.

No hallucination.
No skipping steps.

Just pure thinking.



https://video.twimg.com/amplify_video/1917147575427493888/vid/avc1/1386x720/3Sh4HKHCWtVFAgBR.mp4

3/12
@godofprompt
2. Instant Answers:

When asked to list creative reasons why pizza beats salad, Qwen3 replied in under 3 seconds.

Funny, punchy, and perfectly structured.

It switches between slow thinking and instant output almost like a human.



https://video.twimg.com/amplify_video/1917147613083938816/vid/avc1/1402x720/5SYvNoQT61v9zY0a.mp4

4/12
@godofprompt
3. Multilingual Genius:

Qwen3 explained Einstein’s Theory of Relativity in Arabic, French, and Hindi.

Simple enough for a 10-year-old to understand.

And it kept the tone and complexity natural across languages no awkward translation.



https://video.twimg.com/amplify_video/1917147638044229632/vid/avc1/1388x720/3xKS2j_yKDSH-KeV.mp4

5/12
@godofprompt
4. Real-World Coding:

Qwen3 wrote a clean Python script to scrape headlines from the New York Times and save them into a CSV.

Fully working, no bugs, no missing imports.

It even suggested adding error handling without being asked.



https://video.twimg.com/amplify_video/1917147688015187968/vid/avc1/1380x720/6w-0n_KOS0TsBom9.mp4

6/12
@godofprompt
5. Agentic Planning:

Given a tight $500 budget, Qwen3 built a full 3-day Tokyo itinerary: sightseeing, culture, shopping.

It calculated transport costs, entry fees, food expenses and recommended hacks to save money.

It thinks like a planner, not just a text generator.



https://video.twimg.com/amplify_video/1917147728100065280/vid/avc1/1410x720/276bPkPpbwUGmLpI.mp4

7/12
@godofprompt
Qwen3 isn’t just another model.

It’s built for real tasks, real users, and real-world complexity at insane speed and depth.

Try it here at: Qwen Chat



8/12
@godofprompt
Which Qwen3 ability are you most excited to try?

Deep thinking? Agent planning? Code generation?

Curious to hear.



9/12
@ihteshamit
Insanely powerful models dropped by Alibaba for Qwen.

I'm shocked asf



10/12
@godofprompt
Can’t wait to use it



11/12
@hasantoxr
crazy... Alibaba new models are pretty awesome!!



12/12
@godofprompt
Yeah




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/5
@MaziyarPanahi
🚨 All the new Qwen3 models dropped on @huggingface are under Apache 2.0! No research-only licenses this time!

Open-source community wins big! 🤗



GpsCJgtXkAA2moq.jpg


2/5
@MaziyarPanahi
Just take them! Qwen3 - a Qwen Collection



3/5
@caviterginsoy
Super of them, absolutely super



4/5
@lgaa201
🗿 our new toys



5/5
@paulcx
But two base models are missing 😃




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
@BayAreaTimes
JUST IN: Alibaba’s Qwen3 open-weight “hybrid” AI models debut in 0.6B-235B parameters

- 2 MoE and 6 dense models in total in the Qwen3 family.

- The models can “reason” through complex problems or quickly answer simple requests.

- Support 119 languages and trained on a dataset of ~36T tokens.



Gptb1IQbIAAnyLe.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/3
@victormustar
Qwen3-30B-A3B has hit the mark 🎯

Currently running it on my laptop at 100 tokens/sec (MLX) with blender-MCP, it's fast, it's clinical, it just works... Local AI will never be the same... 🥹

[Quoted tweet]
BOOOOM: Today I'm dropping TINY AGENTS

the 50 lines of code Agent in Javascript 🔥

I spent the last few weeks working on this, so I hope you will like it.

I've been diving into MCP (Model Context Protocol) to understand what the hype was all about.

It is fairly simple, but still quite powerful: MCP is a standard API to expose sets of Tools that can be hooked to LLMs.

But while doing that, came my second realization:

Once you have a MCP Client, an Agent is literally just a while loop on top of it. 🤯


GpZA2AQXoAAUJkf.jpg


https://video.twimg.com/amplify_video/1917224345929216000/vid/avc1/1920x1080/Ek8malpsukDEO0Qk.mp4

2/3
@victormustar
Stack is 100% free: LM Studio + Blender MCP + TINY AGENTS
[mcp-client] Allow arbitrary endpoint via URL (in both McpClient and Agent) by julien-c · Pull Request #1396 · huggingface/huggingface.js



3/3
@lifeafterAi_
Bro help me. I’m thinking about buying Mac mini. Which one would u suggest me to run 30b smoothly




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,277
Reputation
9,830
Daps
174,775

1/1
@DoctorGoldOval
FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis



https://video.twimg.com/amplify_video/1913085247140642816/vid/avc1/540x360/CVrR8DPm6YFRcDqD.mp4


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196













1/7
@Gradio
FantasyTalking -- Realistic talking portrait generations have never been this good! Checkout the links in the thread for learning more 👇



https://video.twimg.com/ext_tw_video/1917550851037577217/pu/vid/avc1/960x720/jjANKUFGkaZ3QpbT.mp4

2/7
@Gradio
Build Fantasy Talking on your own machine: GitHub - Fantasy-AMAP/fantasy-talking: FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis



3/7
@Gradio
App is live on @huggingface: FantasyTalking - a Hugging Face Space by acvlab



4/7
@ericreator
this one is not ready



5/7
@FearCryptoGreed
Looks solid



6/7
@NeuralKnight_
These realistic portraits sound intriguing. Curious to learn about the techniques involved.



7/7
@WendyCarlosa
you do python go away




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196









1/7
@Tech_Transforms
China just dropped a new open-source AI model — FantasyTalking and this could totally reshape how we create videos 👀

See it's amazing work and how it's different from other models here:



2/7
@Tech_Transforms
Realistic lips synchronization

Generate realistic lip synchronization ensuring characters lip movements match the audio provided



https://video.twimg.com/amplify_video/1912112753851949056/vid/avc1/1056x720/psGZlBAYNErtb2j4.mp4

3/7
@Tech_Transforms
Realistic talking videos

Generation of realistic talking videos with varied body types and angles — from close-ups to full-body, front-facing to side views.



https://video.twimg.com/amplify_video/1912112957669916672/vid/avc1/576x720/QvzcEjcxp0x8V8p8.mp4

4/7
@Tech_Transforms
Various Avatars

From humans to cartoons to animated characters, FantasyTalking can create it all seamlessly.



https://video.twimg.com/amplify_video/1912113080890204160/vid/avc1/720x480/k0-uXQg2R42V2-PR.mp4

5/7
@Tech_Transforms
Comparing with other Models

FantasyTalking’s output when compared to OmniHuman-1 with the same source video.



https://video.twimg.com/amplify_video/1912114521356783618/vid/avc1/980x646/U-N9udMUbZruhf2h.mp4

6/7
@Tech_Transforms
Architecture overview

Built on the Wan2.1 video diffusion model, FantasyTalking creates ultra-realistic talking portraits with precise audio-visual alignment.
It ensures identity consistency and natural motion using face-focused modeling and a motion control network.



GokxF69XEAEZpm8.jpg


7/7
@Tech_Transforms
Is FastasyTalking better than other Video Generation models?

Follow @Tech_Transforms for more such updates!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196















1/13
@ItsKevinNexus
China's Open Source AI is Next-Level 🤯

Alibaba just dropped FantasyTalking — an insane AI that lip-syncs characters with realistic facial and full-body motion.

It outperforms current SOTA methods like OmniHuman-1, Sonic, and Hallo 3.

🔥 Here are 10 jaw-dropping examples you’ve gotta see:



https://video.twimg.com/amplify_video/1911729207111016448/vid/avc1/1066x720/-kPWm_FtHWGiulJX.mp4

2/13
@ItsKevinNexus
1. 🧠 Generated Videos with FantasyTalking

Delivers highly realistic lip-syncing, perfectly matching mouth movements to audio.
Supports a wide range of avatar styles — from realistic to cartoon.
Generates high-quality conversational videos with full facial and body motion.



https://video.twimg.com/amplify_video/1911729287842988032/vid/avc1/480x480/TZhDRuBJjjpSYVzr.mp4

3/13
@ItsKevinNexus
2. 🎥 Realistic Talking Videos

Supports generating lifelike talking videos across multiple body ranges: close-up portraits, half-body, and full-body.
Handles various orientations, including front-facing and side-facing poses — all with natural motion and detail.



https://video.twimg.com/amplify_video/1911729352720461824/vid/avc1/480x600/NvmmcA9L1SJuTjld.mp4

4/13
@ItsKevinNexus
3. 🎭 Diverse Character Styles

Animate both humans and animals in a wide range of styles — from realistic to highly stylized.
Produces dynamic, expressive, and naturally realistic animations that bring any character to life.



https://video.twimg.com/amplify_video/1911729417186885632/vid/avc1/720x480/zdczvitwUOj-qmCJ.mp4

5/13
@ItsKevinNexus
4. 📊 Comparison with Closed-Source Methods

FantasyTalking outperforms current state-of-the-art (SOTA) approaches in multimodality-conditioned human video generation, setting a new benchmark for realism and control.



https://video.twimg.com/amplify_video/1911729480709574656/vid/avc1/1296x720/h2qDwFheHwaucntG.mp4

6/13
@ItsKevinNexus
5. 🗣️ Lip Sync with Half-Body Motion

Achieves precise lip-syncing synchronized with natural half-body movements, creating more immersive and lifelike character animations.



https://video.twimg.com/amplify_video/1911729737988292608/vid/avc1/720x576/rXRGyVlJRbAZi6rx.mp4

7/13
@ItsKevinNexus
6. 🗣️ Lip Sync with Half-Body Motion

Achieves precise lip-syncing synchronized with natural half-body movements, creating more immersive and lifelike character animations.



https://video.twimg.com/amplify_video/1911729783928414208/vid/avc1/900x720/rQrB5YDkU1HnkV8x.mp4

8/13
@ItsKevinNexus
7. 🎨 Diverse Character Styles

Supports a wide range of character types — from realistic humans to stylized avatars and animals.
Generates expressive, dynamic animations tailored to each style, making every character feel alive.



https://video.twimg.com/amplify_video/1911729859144876032/vid/avc1/720x432/7i9XdUF7AIrhw60M.mp4

9/13
@ItsKevinNexus
8. 🧍‍♂️ Diverse Characters with Full-Body Motion

Animate a variety of characters — from realistic to stylized, including animals — with natural full-body movement.
Delivers smooth, expressive motion across different poses, styles, and perspectives.



https://video.twimg.com/amplify_video/1911729974064635904/vid/avc1/720x720/Ausu2YSjnZnywTwg.mp4

10/13
@ItsKevinNexus
9. 🗣️ Lip Sync with Half-Body Motion

Achieves precise lip-syncing synchronized with natural half-body movements, creating more immersive and lifelike character animations.



https://video.twimg.com/amplify_video/1911730045078347776/vid/avc1/720x480/vaGkfumjQT-H47Y5.mp4

11/13
@ItsKevinNexus
10. 🗣️ Lip Sync with Half-Body Motion

Achieves precise lip-syncing synchronized with natural half-body movements, creating more immersive and lifelike character animations.



https://video.twimg.com/amplify_video/1911730093610696704/vid/avc1/480x800/uvI9f6T0OC62VFAh.mp4

12/13
@ItsKevinNexus
Paper page of huggingface:
https://huggingface.co/papers/2504.04842



13/13
@ItsKevinNexus
Thanks for reading

If you enjoyed this post, please support it with like / repost the post below

[Quoted tweet]
China's Open Source AI is Next-Level 🤯

Alibaba just dropped FantasyTalking — an insane AI that lip-syncs characters with realistic facial and full-body motion.

It outperforms current SOTA methods like OmniHuman-1, Sonic, and Hallo 3.

🔥 Here are 10 jaw-dropping examples you’ve gotta see:
[media=twitter]1911730347290550747[/media]

https://video.twimg.com/amplify_video/1911729207111016448/vid/avc1/1066x720/-kPWm_FtHWGiulJX.mp4








1/4
@susumuota
[29/30] 58 Likes, 7 Comments, 1 Posts
https://arxiv.org/abs/2504.04842 cs․CV, 07 Apr 2025

🆕FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis

Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, Mu Xu



GpGRYwDXsAAYmm5.png


2/4
@susumuota
Twitter: https://twitter.com/search?q=arxiv.org/abs/2504.04842 OR arxiv.org/pdf/2504.04842.pdf
Reddit: https://www.reddit.com/search/?q="2504.04842"&sort=top



3/4
@susumuota
(1/1) 58 Likes, 7 Comments, 17 Apr 2025, Reddit
https://redd.it/1k16klz



4/4
@susumuota
https://arxiv.org/abs/2504.04842
一枚の静止画からリアルなアニメーション可能なアバターを作るのは、依然として難しい。既存のアプローチでは、微妙な表情やそれに伴う全体的な体の動き、ダイナミックな背景を捉えるのに苦労することが多い。これらの限界に対処するために、我々は、制御可能なモ...



GpGRaBOWcAAl85X.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/2
@wildmindai
FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis
Released inference code and model weights for audio conditions.
- Wan2.1-I2V-14B-720P (Base model)
- Wav2Vec (Audio encoder)
- FantasyTalking model (condition weights)
FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis



https://video.twimg.com/amplify_video/1916797115893698561/vid/avc1/1280x720/qirUBXJbzz86yxOi.mp4

2/2
@wildmindai
Code: https://github.com/Fantasy-AMAP/fantasy-talking
Model: https://huggingface.co/acvlab/FantasyTalking




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,277
Reputation
9,830
Daps
174,775



FutureHouse releases AI tools it claims can accelerate science​


Kyle Wiggers

1:16 PM PDT · May 1, 2025



FutureHouse, an Eric Schmidt-backed nonprofit that aims to build an “AI scientist” within the next decade, has launched its first major product: a platform and API with AI-powered tools designed to support scientific work.

Many, many startups are racing to develop AI research tools for the scientific domain, and some have with massive amounts of VC funding behind them. Tech giants seem bullish on AI for science, too — earlier this year, Google unveiled “AI co-scientist,” which the company said could aid scientists in creating hypotheses and experimental research plans.

The CEOs of AI companies OpenAI and Anthropic have asserted that AI tools could massively accelerate scientific discovery, particularly in medicine. But many researchers don’t consider AI today to be especially useful in guiding the scientific process, largely due to its unreliability.

FutureHouse on Thursday released four AI tools: Crow, Falcon, Owl and Phoenix. Crow can search scientific literature and answer questions about it; Falcon can conduct deeper literature searches, including of scientific databases; Owl looks for previous work in a given subject area; and Phoenix uses tools to help plan chemistry experiments.

Today, we are launching the first publicly available AI Scientist, via the FutureHouse Platform.

Our AI Scientist agents can perform a wide variety of scientific tasks better than humans. By chaining them together, we've already started to discover new biology really fast. With… pic.twitter.com/wMMmZoGZPI

— Sam Rodriques (@SGRodriques) May 1, 2025

“Unlike other [AIs], FutureHouse’s have access to a vast corpus of high-quality open-access papers and specialized scientific tools,” the nonprofit wrote in a blog post. “They [also] have transparent reasoning and use a multi-stage process to consider each source in more depth […] By chaining these [AI]s together, at scale, scientists can greatly accelerate the pace of scientific discovery.”

Tellingly, FutureHouse is yet to achieve a scientific breakthrough or make a novel discovery with its AI tools.

Part of the challenge in developing an “AI scientist” is anticipating an untold number of confounding factors. AI might come in handy in areas where broad exploration is needed, like narrowing down a vast list of possibilities, but it’s less clear whether it can do the kind of out-of-the-box problem-solving that leads to bonafide breakthroughs.

Results from AI systems designed for science have so far been mostly underwhelming. In 2023, Google said around 40 new materials had been synthesized with the help of one of its AIs, called GNoME. But an outside analysis found not even one of those materials was, in fact, net new.

AI’s technical shortcomings and risks, such as its tendency to hallucinate, also make scientists wary of endorsing it for serious work. Even well-designed studies could end up being tainted by misbehaving AI, which struggles with executing high-precision work.

Indeed, FutureHouse acknowledges that its AI tools — Phoenix in particular — may make mistakes.

“We are releasing [this] now in the spirit of rapid iteration,” the company said in its blog post. “Please provide feedback as you use it.”
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,277
Reputation
9,830
Daps
174,775

Google Researchers Advance Diagnostic AI: AMIE Now Matches or Outperforms Primary Care Physicians Using Multimodal Reasoning with Gemini 2.0 Flash​


By Sana Hassan

May 4, 2025

LLMs have shown impressive promise in conducting diagnostic conversations, particularly through text-based interactions. However, their evaluation and application have largely ignored the multimodal nature of real-world clinical settings, especially in remote care delivery, where images, lab reports, and other medical data are routinely shared through messaging platforms. While systems like the Articulate Medical Intelligence Explorer (AMIE) have matched or surpassed primary care physicians in text-only consultations, this format falls short of reflecting telemedicine environments. Multimodal communication is essential in modern care, as patients often share photographs, documents, and other visual artifacts that cannot be fully conveyed through text alone. Limiting AI systems to textual inputs risks omitting critical clinical information, increasing diagnostic errors, and creating accessibility barriers for patients with lower health or digital literacy. Despite the widespread use of multimedia messaging apps in global healthcare, there has been little research into how LLMs can reason over such diverse data during diagnostic interactions.

Research in diagnostic conversational agents began with rule-based systems like MYCIN, but recent developments have focused on LLMs capable of emulating clinical reasoning. While multimodal AI systems, such as vision-language models, have demonstrated success in radiology and dermatology, integrating these capabilities into conversational diagnostics remains challenging. Effective AI-based diagnostic tools must handle the complexity of multimodal reasoning and uncertainty-driven information gathering, a step beyond merely answering isolated questions. Evaluation frameworks like OSCEs and platforms such as AgentClinic provide useful starting points, yet tailored metrics are still needed to assess performance in multimodal diagnostic contexts. Moreover, while messaging apps are increasingly used in low-resource settings for sharing clinical data, concerns about data privacy, integration with formal health systems, and policy compliance persist.

Google DeepMind and Google Research have enhanced the AMIE with multimodal capabilities for improved conversational diagnosis and management. Using Gemini 2.0 Flash, AMIE employs a state-aware dialogue framework that adapts conversation flow based on patient state and diagnostic uncertainty, allowing strategic, structured history-taking with multimodal inputs like skin images, ECGs, and documents. AMIE outperformed or matched primary care physicians in a randomized OSCE-style study with 105 scenarios and 25 patient actors across 29 of 32 clinical metrics and 7 of 9 multimodal-specific criteria, demonstrating strong diagnostic accuracy, reasoning, communication, and empathy.

The study enhances the AMIE diagnostic system by incorporating multimodal perception and a state-aware dialogue framework that guides conversations through phases of history taking, diagnosis, and follow-up. Gemini 2.0 Flash powers the system and dynamically adapts based on evolving patient data, including text, images, and clinical documents. A structured patient profile and differential diagnosis are updated throughout the interaction, with targeted questions and multimodal data requests guiding clinical reasoning. Evaluation includes automated perception tests on isolated artifacts, simulated dialogues rated by auto-evaluators, and expert OSCE-style assessments, ensuring robust diagnostic performance and clinical realism.

The results show that the multimodal AMIE system performs at par or better than primary care physicians (PCPs) across multiple clinical tasks in simulated text-chat consultations. In OSCE-style assessments, AMIE consistently outperformed PCPs in diagnostic accuracy, especially when interpreting multimodal data such as images and clinical documents. It also demonstrated greater robustness when image quality was poor and showed fewer hallucinations. Patient actors rated AMIE’s communication skills highly, including empathy and trust. Automated evaluations confirmed that AMIE’s advanced reasoning framework, built on the Gemini 2.0 Flash model, significantly improved diagnosis and conversation quality, validating its design and effectiveness in real-world clinical scenarios.

AD_4nXd7EbYeP4mVOcqN382OXtQ3XJq87T2GKNcUxCiLjYNma2Lbgn15aBA4YAopyLz-50XFXVySATFA7fAb2xl2eZ_S9_WGoUzkJ_8COku8TrPAYPG7SeHryShDrrJbpRK-kZ7u2nMcwg


In conclusion, the study advances conversational diagnostic AI by enhancing AMIE to integrate multimodal reasoning within patient dialogues. Using a novel state-aware inference-time strategy with Gemini 2.0 Flash, AMIE can interpret and reason about medical artifacts like images or ECGs in real-time clinical conversations. Evaluated through a multimodal OSCE framework, AMIE outperformed or matched primary care physicians in diagnostic accuracy, empathy, and artifact interpretation, even in complex cases. Despite limitations tied to chat-based interfaces and the need for real-world testing, these findings highlight AMIE’s potential as a robust, context-aware diagnostic assistant for future telehealth applications.




Check out the Paper and Technical details . Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Gr oup . Don’t Forget to join our 90k+ ML SubReddit . For Promotion and Partnerships, please talk us .

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,277
Reputation
9,830
Daps
174,775

LLMs Can Now Reason in Parallel: UC Berkeley and UCSF Researchers Introduce Adaptive Parallel Reasoning to Scale Inference Efficiently Without Exceeding Context Windows​


By Mohammad Asjad

May 2, 2025

Large language models (LLMs) have made significant strides in reasoning capabilities, exemplified by breakthrough systems like OpenAI o1 and DeepSeekR1, which utilize test-time compute for search and reinforcement learning to optimize performance. Despite this progress, current methodologies face critical challenges that impede their effectiveness. Serialized chain-of-thought approaches generate excessively long output sequences, increasing latency and pushing against context window constraints. In contrast, parallel methods such as best-of-N and self-consistency suffer from poor coordination between inference paths and lack end-to-end optimization, resulting in computational inefficiency and limited improvement potential. Also, structured inference-time search techniques like tree-of-thought rely on manually designed search structures, significantly restricting their flexibility and ability to scale across different reasoning tasks and domains.

Several approaches have emerged to address the computational challenges in LLM reasoning. Inference-time scaling methods have improved downstream task performance by increasing test-time computation, but typically generate significantly longer output sequences. This creates higher latency and forces models to fit entire reasoning chains into a single context window, making it difficult to attend to relevant information. Parallelization strategies like ensembling have attempted to mitigate these issues by running multiple independent language model calls simultaneously. However, these methods suffer from poor coordination across parallel threads, leading to redundant computation and inefficient resource utilization. Fixed parallelizable reasoning structures, such as tree-of-thought and multi-agent reasoning systems, have been proposed, but their hand-designed search structures limit flexibility and scalability. Other approaches, like PASTA decompose tasks into parallel sub-tasks but ultimately reintegrate the complete context into the main inference trajectory, failing to reduce context usage effectively. Meanwhile, Hogwild! Inference employs parallel worker threads but relies exclusively on prompting without end-to-end optimization.

Researchers from UC Berkeley and UCSF have proposed Adaptive Parallel Reasoning (APR) . This robust approach enables language models to dynamically distribute inference-time computation across both serial and parallel operations. This methodology generalizes existing reasoning approaches—including serialized chain-of-thought reasoning, parallelized inference with self-consistency, and structured search—by training models to determine when and how to parallelize inference operations rather than imposing fixed search structures. APR introduces two key innovations: a parent-child threading mechanism and end-to-end reinforcement learning optimization. The threading mechanism allows parent inference threads to delegate subtasks to multiple child threads through a spawn() operation, enabling parallel exploration of distinct reasoning paths. Child threads then return outcomes to the parent thread via a join() operation, allowing the parent to continue decoding with this new information. Built on the SGLang model serving framework, APR significantly reduces real-time latency by performing inference in child threads simultaneously through batching. The second innovation—fine-tuning via end-to-end reinforcement learning—optimizes for overall task success without requiring predefined reasoning structures. This approach delivers three significant advantages: higher performance within fixed context windows, superior scaling with increased compute budgets, and improved performance at equivalent latency compared to traditional methods.

AD_4nXcOSpZZ45HzAUliITMfeMJHvwJ9418W5j6MgsvW3HtAvVsWRLqJx99y4g7_j1qlrMPPBHceZYFq-9y9duCYHGcDPCoZhdkSwonw6yYpnutom5012Nt1ot3fysyIGtlKl67m8jYk_g


The APR architecture implements a sophisticated multi-threading mechanism that enables language models to dynamically orchestrate parallel inference processes. APR addresses the limitations of serialized reasoning methods by distributing computation across parent and child threads, minimizing latency while improving performance within context constraints. The architecture consists of three key components:

AD_4nXf1qIbupddIvz74Lt9Dz8ycYGjfrsyCWfvCPsILc4EDJ9hb0TAz0Dj0hb-4n4z40l2BamE8qbiE5lg00PBJRNpzJwwCnWSDzmhZaxwNwzrV-UELB8opKem9YAuBR8WJLne92HMn7g


First, the multi-threading inference system allows parent threads to spawn multiple child threads using a spawn(msgs) operation. Each child thread receives a distinct context and executes inference independently, yet simultaneously using the same language model. When a child thread completes its task, it returns results to the parent via a join(msg) operation, selectively communicating only the most relevant information. This approach significantly reduces token usage by keeping intermediate search traces confined to child threads.

Second, the training methodology employs a two-phase approach. Initially, APR utilizes supervised learning with automatically-generated demonstrations that incorporate both depth-first and breadth-first search strategies, creating hybrid search patterns. The symbolic solver creates demonstrations with parallelization, decomposing searches into multiple components that avoid context window bottlenecks during both training and inference.

Finally, the system implements end-to-end reinforcement learning optimization with GRPO (Gradient-based Policy Optimization). During this phase, the model learns to strategically determine when and how broadly to invoke child threads, optimizing for computational efficiency and reasoning effectiveness. The model iteratively samples reasoning traces, evaluates their correctness, and adjusts parameters accordingly, ultimately learning to balance parallel exploration against context window constraints for maximum performance.

AD_4nXdHV49v1CdXtn5HuKdElvPvFAFSGfPTmeql89rPmEAuiozBL3gKvrZYfNmoclIsP5DgR6ym8qhS1YdXnBkQ3p77uWSqsOxwk37ViXYi1aHMeCY4srREvks4DIFrUgIenSfglWIKrw


The evaluation compared Adaptive Parallel Reasoning against serialized chain-of-thought reasoning and self-consistency methods using a standard decoder-only language model with 228M parameters built on the Llama2 architecture and supporting a 4,096-token context window. All models were initialized through supervised learning on 500,000 trajectories from symbolic solvers. For direct compute-accuracy assessment, the team implemented a budget constraint method with context-window conditioning for SoS+ models and thread count conditioning for APR models. The SGLang framework was utilized for inference due to its support for continuous batching and radix attention, enabling efficient APR implementation.

Experimental results demonstrate that APR consistently outperforms serialized methods across multiple dimensions. When scaling with higher compute, APR initially underperforms in low-compute regimes due to parallelism overhead but significantly outpaces SoS+ as compute increases, achieving a 13.5% improvement at 20k tokens and surpassing SoS+ pass@8 performance while using 57.4% less compute. For context window scaling, APR consistently exploits context more efficiently, with 10 threads achieving approximately 20% higher accuracy at the 4k-token limit by distributing reasoning across parallel threads rather than containing entire traces within a single context window.

AD_4nXcmrgV68f7_50b4O40FJoLUjFQJTAy1CvteOhLzLvgsPCkMJpqGNBTSOvxQvnPdEwfjSl77Ozth30Bjrjun0akJ8tXgIixrdE0n-c7CvNSrXHx56k-my2yDfaJJPRIQOxl1_XGD


End-to-end reinforcement learning significantly enhances APR performance, boosting accuracy from 75.5% to 83.4%. The RL-optimized models demonstrate markedly different behaviors, increasing both sequence length (22.1% relative increase) and number of child threads (34.4% relative increase). This reveals that for Countdown tasks, RL-optimized models favor broader search patterns over deeper ones, demonstrating the algorithm’s ability to discover optimal search strategies autonomously.

AD_4nXdM2sZ6Wbvg4QJWVS6_5AEjJpL-g8TgUJBrRhZCSYqntBp6RxXFBXK0vucGP-raMia3s3GkH0b-xxDp5f5L53WEQ8MpzR5gj-PVM4uhpuBTBDas76BzdVpM3dpgfsyBJDiYiUIYzg


APR demonstrates superior efficiency in both theoretical and practical evaluations. When measuring sequential token usage, APR significantly boosts accuracy with minimal additional sequential tokens beyond 2,048, rarely exceeding 2,500 tokens, while SoS+ shows only marginal improvements despite approaching 3,000 tokens. Real-world latency testing on an 8-GPU NVIDIA RTX A6000 server reveals APR achieves substantially better accuracy-latency trade-offs, reaching 75% accuracy at 5000ms per sample—an 18% absolute improvement over SoS+’s 57%. These results highlight APR’s effective hardware parallelization and potential for optimized performance in deployment scenarios.

AD_4nXc5nhtZcSFgL2Svt45dXwSX_NWRXmLJnkdDapsRue6A1s1hdFDUCZJNA52ZM0Y43wVR3spRcvuz5JbAKF6fbglrP3gV-2hck5tgxewFyUR41mD7LCzkny54cpey4gX5u8YwsYrhVA


AD_4nXdFOrxJX9yM2ayV4ldJiy3jjQKFkoTLuOgEPB_Q-c8qaTUnH6GjgdwGlcvtN36P0nsKvB9fEKDsWyvLxRI46HeRktvtNUZ14SzmIzBf88YDyBzCpIFiZCutK8ir-RZa6XZ3GzNWJQ


Adaptive Parallel Reasoning represents a significant advancement in language model reasoning capabilities by enabling dynamic distribution of computation across serial and parallel paths through a parent-child threading mechanism. By combining supervised training with end-to-end reinforcement learning, APR eliminates the need for manually designed structures while allowing models to develop optimal parallelization strategies. Experimental results on the Countdown task demonstrate APR’s substantial advantages: higher performance within fixed context windows, superior scaling with increased compute budgets, and significantly improved success rates at equivalent latency constraints. These achievements highlight the potential of reasoning systems that dynamically structure inference processes to achieve enhanced scalability and efficiency in complex problem-solving tasks.




Check out the Paper . Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Gr oup . Don’t Forget to join our 90k+ ML SubReddit . For Promotion and Partnerships, please talk us .

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,277
Reputation
9,830
Daps
174,775

JetBrains Open Sources Mellum: A Developer-Centric Language Model for Code-Related Tasks​


By Asif Razzaq

May 2, 2025

JetBrains has officially open-sourced Mellum , a purpose-built 4-billion-parameter language model tailored for software development tasks. Developed from the ground up, Mellum reflects JetBrains’ engineering-first approach, offering a domain-specialized model trained for practical usage across codebases and programming environments. With its release on Hugging Face under the Apache 2.0 license, JetBrains extends an invitation to the broader research and developer community to experiment, adapt, and advance Mellum’s capabilities.

A Focal Model for Code Understanding


Unlike general-purpose LLMs, Mellum is classified by JetBrains as a “focal model”—a term they use to describe models with a narrow yet deep specialization. Mellum is optimized specifically for programming-related tasks such as autocompletion, infilling, and structural understanding of source code. This focused design avoids the overhead of broader linguistic modeling and enables the model to perform efficiently in IDE-like environments.

The model supports a wide array of languages including Java, Kotlin, Python, Go, PHP, C, C++, C#, JavaScript, TypeScript, CSS, HTML, Rust, and Ruby—reflecting the polyglot nature of modern development teams.

Model Architecture and Training Pipeline


Mellum follows a LLaMA-style architecture and was trained from scratch using over 4.2 trillion tokens drawn from code-rich sources such as The Stack, StarCoder, CommitPack, and English Wikipedia. It features an 8K token context window and was trained using bf16 mixed precision across a high-throughput cluster of 256 NVIDIA H200 GPUs connected via Infiniband.

The training process spanned approximately 20 days and leveraged modern infrastructure for scalable model development. The architecture and training procedure were designed with reproducibility and deployment flexibility in mind, making Mellum usable in both cloud inference setups (e.g., vLLM) and on local environments (e.g., llama.cpp, Ollama).

Benchmarking and Evaluation


JetBrains evaluated Mellum across a range of benchmarks that reflect its primary use cases—code infilling and completion. The model’s performance indicates strong alignment with the design goals:

  • RepoBench v1.1 (8K context) :
    • Python EM: 27.97%
    • Java EM: 31.08%
  • SAFIM (Syntax-Aware Fill-in-the-Middle) :
    • pass@1: 38.11%
  • HumanEval Infilling :
    • Single-line: 66.21%
    • Multi-line: 38.52%
    • Random-span: 29.70%

These results reflect Mellum’s specialization for structured code understanding, especially in scenarios involving partial or interrupted code, which are common in real-world development workflows.

Rationale for Open Sourcing


JetBrains’ decision to release Mellum as open-source is grounded in several practical motivations:

  • Transparency : Enables scrutiny of both training data and architectural decisions.
  • Reusability : Supports integration in custom development environments and research experiments.
  • Community Collaboration : Facilitates contribution from external developers to refine model behavior.
  • Pedagogical Value : Provides educators and students with a hands-on artifact for understanding how domain-specific LLMs are constructed and applied.

The release includes both the base model ( Mellum-4b-base ) and a fine-tuned variant for Python ( Mellum-4b-sft-python ).

Implications for Developer Tooling


The availability of a compact, performant model optimized for source code opens new opportunities in the IDE space and beyond. JetBrains envisions Mellum as part of a broader strategy involving multiple focal models, each optimized for specific programming tasks such as diff generation or code review assistance. This approach aligns with the growing need for deployable, cost-effective, and context-aware AI tooling that can augment developer productivity without introducing opaque or oversized general-purpose models.

Conclusion


Mellum represents a deliberate shift toward smaller, specialized language models that prioritize utility, transparency, and efficiency. By making the model openly available, JetBrains offers a high-quality foundation for building the next generation of AI-assisted developer tools. Its architecture, training methodology, and benchmark performance signal a practical step forward in the evolving space of LLMs tailored for software engineering.




The release includes both the base model ( Mellum-4b-base ) and a fine-tuned variant for Python ( Mellum-4b-sft-python ). Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Gr oup . Don’t Forget to join our 90k+ ML SubReddit .

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,277
Reputation
9,830
Daps
174,775

Microsoft AI Released Phi-4-Reasoning: A 14B Parameter Open-Weight Reasoning Model that Achieves Strong Performance on Complex Reasoning Tasks​


By Asif Razzaq

April 30, 2025

Despite notable advancements in large language models (LLMs), effective performance on reasoning-intensive tasks—such as mathematical problem solving, algorithmic planning, or coding—remains constrained by model size, training methodology, and inference-time capabilities. Models that perform well on general NLP benchmarks often lack the ability to construct multi-step reasoning chains or reflect on intermediate problem-solving states. Furthermore, while scaling up model size can improve reasoning capacity, it introduces prohibitive computational and deployment costs, especially for applied use in education, engineering, and decision-support systems.

Microsoft Releases Phi-4 Reasoning Model Suite


Microsoft recently introduced the Phi-4 reasoning family, consisting of three models— Phi-4-reasoning , Phi-4-reasoning-plus , and Phi-4-mini-reasoning . These models are derived from the Phi-4 base (14B parameters) and are specifically trained to handle complex reasoning tasks in mathematics, scientific domains, and software-related problem solving. Each variant addresses different trade-offs between computational efficiency and output precision. Phi-4-reasoning is optimized via supervised fine-tuning, while Phi-4-reasoning-plus extends this with outcome-based reinforcement learning, particularly targeting improved performance in high-variance tasks such as competition-level mathematics.

The open weight models were released with transparent training details and evaluation logs, including benchmark design, and are hosted on Hugging Face for reproducibility and public access.

Technical Composition and Methodological Advances


The Phi-4-reasoning models build upon the Phi-4 architecture with targeted improvements to model behavior and training regime. Key methodological decisions include:

  • Structured Supervised Fine-Tuning (SFT): Over 1.4M prompts were curated with a focus on “boundary” cases—problems at the edge of Phi-4’s baseline capabilities. Prompts were sourced and filtered to emphasize multi-step reasoning rather than factual recall, and responses were synthetically generated using o3-mini in high-reasoning mode.
  • Chain-of-Thought Format: To facilitate structured reasoning, models were trained to generate output using explicit <think> tags, encouraging separation between reasoning traces and final answers.
  • Extended Context Handling: The RoPE base frequency was modified to support a 32K token context window, allowing for deeper solution traces, particularly relevant in multi-turn or long-form question formats.
  • Reinforcement Learning (Phi-4-reasoning-plus): Using Group Relative Policy Optimization (GRPO), Phi-4-reasoning-plus was further refined on a small curated set of ∼6,400 math-focused problems. A reward function was crafted to favor correct, concise, and well-structured outputs, while penalizing verbosity, repetition, and format violations.

This data-centric and format-aware training regime supports better inference-time utilization and model generalization across domains, including unseen symbolic reasoning problems.

Screenshot-2025-04-30-at-11.44.48%E2%80%AFPM-1-1024x561.png


Evaluation and Comparative Performance


Across a broad range of reasoning benchmarks, Phi-4-reasoning and Phi-4-reasoning-plus deliver competitive results relative to significantly larger open-weight models:

Phi-4-reasoning-plus shows strong performance not only on domain-specific evaluations but also generalizes well to planning and combinatorial problems like TSP and 3SAT, despite no explicit training in these areas. Performance gains were also observed in instruction-following (IFEval) and long-context QA (FlenQA), suggesting the chain-of-thought formulation improves broader model utility.

Importantly, Microsoft reports full variance distributions across 50+ generation runs for sensitive datasets like AIME 2025, revealing that Phi-4-reasoning-plus matches or exceeds the performance consistency of models like o3-mini, while remaining disjoint from smaller baseline distributions like DeepSeek-R1-Distill.

Screenshot-2025-04-30-at-11.46.10%E2%80%AFPM-1-1024x798.png


Conclusion and Implications


The Phi-4 reasoning models represent a methodologically rigorous effort to advance small model capabilities in structured reasoning. By combining data-centric training, architectural tuning, and minimal but well-targeted reinforcement learning, Microsoft demonstrates that 14B-scale models can match or outperform much larger systems in tasks requiring multi-step inference and generalization.

The models’ open weight availability and transparent benchmarking set a precedent for future development in small LLMs, particularly for applied domains where interpretability, cost, and reliability are paramount. Future work is expected to extend the reasoning capabilities into additional STEM fields, improve decoding strategies, and explore scalable reinforcement learning on longer horizons.




Check out the Paper , HuggingFace Page and Microsoft Blog . Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Gr oup . Don’t Forget to join our 90k+ ML SubReddit .

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,277
Reputation
9,830
Daps
174,775

Meta and Booz Allen Deploy Space Llama: Open-Source AI Heads to the ISS for Onboard Decision-Making​


By Nikhil

May 2, 2025

In a significant step toward enabling autonomous AI systems in space, Meta and Booz Allen Hamilton have announced the deployment of Space Llama , a customized instance of Meta’s open-source large language model , Llama 3.2, aboard the International Space Station (ISS) U.S. National Laboratory. This initiative marks one of the first practical integrations of an LLM in a remote, bandwidth-limited, space-based environment.

Addressing Disconnection and Autonomy Challenges


Unlike terrestrial applications, AI systems deployed in orbit face strict constraints—limited compute resources, constrained bandwidth, and high-latency communication links with ground stations. Space Llama has been designed to function entirely offline, allowing astronauts to access technical assistance, documentation, and maintenance protocols without requiring live support from mission control.

To address these constraints, the AI model had to be optimized for onboard deployment, incorporating the ability to reason over mission-specific queries, retrieve context from local data stores, and interact with astronauts in natural language—all without internet connectivity.

Technical Framework and Integration Stack


The deployment leverages a combination of commercially available and mission-adapted technologies:

  • Llama 3.2 : Meta’s latest open-source LLM serves as the foundation, fine-tuned for contextual understanding and general reasoning tasks in edge environments. Its open architecture enables modular adaptation for aerospace-grade applications.
  • A2E2™ (AI for Edge Environments) : Booz Allen’s AI framework provides containerized deployment and modular orchestration tailored to constrained environments like the ISS. It abstracts complexity in model serving and resource allocation across diverse compute layers.
  • HPE Spaceborne Computer-2 : This edge computing platform, developed by Hewlett Packard Enterprise, provides reliable high-performance processing hardware for space. It supports real-time inference workloads and model updates when necessary.
  • NVIDIA CUDA-capable GPUs : These enable the accelerated execution of transformer-based inference tasks while staying within the ISS’s strict power and thermal budgets.

This integrated stack ensures that the model operates within the limits of orbital infrastructure, delivering utility without compromising reliability.

Open-Source Strategy for Aerospace AI


The selection of an open-source model like Llama 3.2 aligns with growing momentum around transparency and adaptability in mission-critical AI. The benefits include:

  • Modifiability : Engineers can tailor the model to meet specific operational requirements, such as natural language understanding in mission terminology or handling multi-modal astronaut inputs.
  • Data Sovereignty : With all inference running locally, sensitive data never needs to leave the ISS, ensuring compliance with NASA and partner agency privacy standards.
  • Resource Optimization : Open access to the model’s architecture allows for fine-grained control over memory and compute use—critical for environments where system uptime and resilience are prioritized.
  • Community-Based Validation : Using a widely studied open-source model promotes reproducibility, transparency in behavior, and better testing under mission simulation conditions.

Toward Long-Duration and Autonomous Missions


Space Llama is not just a research demonstration—it lays the groundwork for embedding AI systems into longer-term missions. In future scenarios like lunar outposts or deep-space habitats, where round-trip communication latency with Earth spans minutes or hours, onboard intelligent systems must assist with diagnostics, operations planning, and real-time problem-solving.

Furthermore, the modular nature of Booz Allen’s A2E2 platform opens up the potential for expanding the use of LLMs to non-space environments with similar constraints—such as polar research stations, underwater facilities, or forward operating bases in military applications.

Conclusion


The Space Llama initiative represents a methodical advancement in deploying AI systems to operational environments beyond Earth. By combining Meta’s open-source LLMs with Booz Allen’s edge deployment expertise and proven space computing hardware, the collaboration demonstrates a viable approach to AI autonomy in space.

Rather than aiming for generalized intelligence, the model is engineered for bounded, reliable utility in mission-relevant contexts—an important distinction in environments where robustness and interpretability take precedence over novelty.

As space systems become more software-defined and AI-assisted, efforts like Space Llama will serve as reference points for future AI deployments in autonomous exploration and off-Earth habitation.




Check out the Details here . Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Gr oup . Don’t Forget to join our 90k+ ML SubReddit .

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,277
Reputation
9,830
Daps
174,775

LLMs Can Learn Complex Math from Just One Example: Researchers from University of Washington, Microsoft, and USC Unlock the Power of 1-Shot Reinforcement Learning with Verifiable Reward​


By Sana Hassan

May 2, 2025

Recent advancements in LLMs such as OpenAI-o1, DeepSeek-R1, and Kimi-1.5 have significantly improved their performance on complex mathematical reasoning tasks. Reinforcement Learning with Verifiable Reward (RLVR) is a key contributor to these improvements, which uses rule-based rewards, typically a binary signal indicating whether a model’s solution to a problem is correct. Beyond enhancing final output accuracy, RLVR has also been observed to foster beneficial cognitive behaviors like self-reflection and improve generalization across tasks. While much research has focused on optimizing reinforcement learning algorithms like PPO and GRPO for greater stability and performance, the influence of training data—its quantity and quality—remains less understood. Questions around how much and what kind of data is truly effective for RLVR are still open, despite some work like LIMR introducing metrics to identify impactful examples and reduce dataset size while maintaining performance.

In contrast to the extensive research on data selection in supervised fine-tuning and human feedback-based reinforcement learning, the role of data in RLVR has seen limited exploration. While LIMR demonstrated that using a small subset of data (1.4k out of 8.5k examples) could maintain performance, it did not examine the extreme case of minimal data use. Another concurrent study found that even training with just four PPO examples led to notable improvements, but this finding wasn’t deeply investigated or benchmarked against full-dataset performance. Although RLVR shows great promise for enhancing reasoning in LLMs, a deeper, systematic study of data efficiency and selection in this context is still lacking.

Researchers from the University of Washington, University of Southern California, Microsoft, University of California, Santa Cruz, and Georgia Institute of Technology show that RLVR can significantly enhance large language models’ mathematical reasoning using a single training example, 1-shot RLVR. Applying it to Qwen2.5-Math-1.5B improves its MATH500 accuracy from 36.0% to 73.6%, matching the performance of much larger datasets. The improvements generalize across models, tasks, and algorithms. The study also reveals effects like cross-domain generalization, increased self-reflection, and post-saturation generalization, and highlights the roles of policy gradient loss and entropy-driven exploration.

The study investigates how much the RLVR training dataset can be reduced while retaining comparable performance to the full dataset. Remarkably, the authors find that a single training example—1-shot RLVR—can significantly boost mathematical reasoning in LLMs. The study shows that this effect generalizes across tasks, models, and domains. Interestingly, training on one example often enhances performance on unrelated domains. A simple data selection strategy based on training accuracy variance is proposed, but results show that even randomly chosen examples can yield major gains.

The study evaluates their method using Qwen2.5-Math-1.5B as the primary model and other models like Qwen2.5-Math-7B, Llama-3.2-3 B-Instructt, and DeepSeek-R1-DistillQwen-1.5 BB. They use a 1,209-example subset of the DeepScaleR dataset for data selection, and the MATH dataset for comparison. Training involves the Verl pipeline, with carefully chosen hyperparameters and batch configurations. Surprisingly, training with just one or two examples—especially π1 and π13—leads to strong generalization, even beyond math tasks. This “post-saturation generalization” persists despite overfitting signs. The study also finds increased model self-reflection and shows that even simple examples can significantly enhance performance across domains.

AD_4nXc-CitH2GVWTJ_lpeiyAIpq-R-HsS4MpaFmoV9wEqMUe9AaG6o6PVVkfUVC_UBvUHYUF3uyaQoXfSLkXRS3nsLuDvN1OWqFRDTlzRxWS7saRXOzpGWtCArRUeNBilp2735EwAG8hA


In conclusion, the study explores the mechanisms behind the success of 1-shot RLVR, demonstrating that base models already possess strong reasoning abilities. Experiments show that even a single example can significantly improve performance on reasoning tasks, suggesting the model’s inherent capacity for reasoning. The study highlights that policy gradient loss is key to 1-shot RLVR’s effectiveness, with entropy loss further enhancing performance. Additionally, encouraging exploration through techniques like entropy regularization can improve post-saturation generalization. The findings also emphasize the need for careful data selection to optimize the model’s performance, particularly in data-constrained scenarios.




Check out the Paper and GitHub Page . Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Gr oup . Don’t Forget to join our 90k+ ML SubReddit . For Promotion and Partnerships, please talk us .

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,277
Reputation
9,830
Daps
174,775

Mem0: A Scalable Memory Architecture Enabling Persistent, Structured Recall for Long-Term AI Conversations Across Sessions​


By Asif Razzaq

April 30, 2025

Large language models can generate fluent responses, emulate tone, and even follow complex instructions; however, they struggle to retain information across multiple sessions. This limitation becomes more pressing as LLMs are integrated into applications that require long-term engagement, such as personal assistance, health management, and tutoring. In real-life conversations, people recall preferences, infer behaviors, and construct mental maps over time. A person who mentioned their dietary restrictions last week expects those to be taken into account the next time food is discussed. Without mechanisms to store and retrieve such details across conversations, AI agents fail to offer consistency and reliability, undermining user trust.

The central challenge with today’s LLMs lies in their inability to persist relevant information beyond the boundaries of a conversation’s context window. These models rely on limited tokens, sometimes as high as 128K or 200K, but when long interactions span days or weeks, even these expanded windows fall short. More critically, the quality of attention degrades over distant tokens, making it harder for models to locate or utilize earlier context effectively. A user may bring up personal details, switch to a completely different topic, and return to the original subject much later. Without a robust memory system, the AI will likely ignore the previously mentioned facts. This creates friction, especially in scenarios where continuity is crucial. The issue is not just forgetting information, but also retrieving the wrong information from irrelevant parts of the conversation history due to token overflow and thematic drift.

Several attempts have been made to tackle this memory gap. Some systems rely on retrieval-augmented generation ( RAG ) techniques, which utilize similarity searches to retrieve relevant text chunks during a conversation. Others employ full-context approaches that simply refeed the entire conversation into the model, which increases latency and token costs. Proprietary memory solutions and open-source alternatives try to improve upon these by storing past exchanges in vector databases or structured formats. However, these methods often lead to inefficiencies, such as retrieving excessive irrelevant information or failing to consolidate updates in a meaningful manner. They also lack effective mechanisms to detect conflicting data or prioritize newer updates, leading to fragmented memories that hinder reliable reasoning.

A research team from Mem0.ai developed a new memory-focused system called Mem0 . This architecture introduces a dynamic mechanism to extract, consolidate, and retrieve information from conversations as they happen. The design enables the system to selectively identify useful facts from interactions, evaluate their relevance and uniqueness, and integrate them into a memory store that can be consulted in future sessions. The researchers also proposed a graph-enhanced version, Mem0g, which builds upon the base system by structuring information in relational formats. These models were tested using the LOCOMO benchmark and compared against six other categories of memory-enabled systems, including memory-augmented agents, RAG methods with varying configurations, full-context approaches, and both open-source and proprietary tools. Mem0 consistently achieved superior performance across all metrics.

AD_4nXewsywQv3zHVptOEgyXnKPd2nlgrGYF4psnDBuc1xltMMTUSJA67DeD7llXUUnyfV7i4KkGfZXCURCTka7SWxuECXr5dXBJnRvoQFZL8ZcvLS-NmUo2zTYGO6NOcfRC78clGK2CWw


The core of the Mem0 system involves two operational stages. In the first phase, the model processes pairs of messages, typically a user’s question and the assistant’s response, along with summaries of recent conversations. A combination of global conversation summaries and the last 10 messages serves as the input for a language model that extracts salient facts. These facts are then analyzed in the second phase, where they are compared with similar existing memories in a vector database. The top 10 most similar memories are retrieved, and a decision mechanism, referred to as a ‘tool call’, determines whether the fact should be added, updated, deleted, or ignored. These decisions are made by the LLM itself rather than a classifier, streamlining memory management and avoiding redundancies.

AD_4nXeKnqjr3Y7DxUKMAKblcrncbdo7YKjyo1TuGW8TFW1ezpyLKtfs3TnB7k_L61idz1CMr7G0uXfxFUfLVDm2umztNiJlHr0Nm63vrIOVk_afgiY7Wagu7aPDXA4UHodQ-7yXLUWchA


The advanced variant, Mem0g, takes the memory representation a step further. It translates conversation content into a structured graph format, where entities, such as people, cities, or preferences, become nodes, and relationships, such as “lives in” or “prefers,” become edges. Each entity is labeled, embedded, and timestamped, while the relationships form triplets that capture the semantic structure of the dialogue. This format supports more complex reasoning across interconnected facts, allowing the model to trace relational paths across sessions. The conversion process uses LLMs to identify entities, classify them, and build the graph incrementally. For example, if a user discusses travel plans, the system creates nodes for cities, dates, and companions, thereby building a detailed and navigable structure of the conversation.

AD_4nXdGinAQ1wsqYmgoluiKviC_XQGbiRFD5fvm2marY0IBq6kZHFDgDqpcK7jAgM9tL8lCSiPVDP3SkY6aBrWAvuIQ4_E_7ltaSAWedDFL6gzNQCQVOgTD_K8nG9Lo5t3HdAEDN4qH8g


The performance metrics reported by the research team underscore the strength of both models. Mem0 showed a 26% improvement over OpenAI’s system when evaluated using the “LLM-as-a-Judge” metric. Mem0g, with its graph-enhanced design, achieved an additional 2% gain, pushing the total improvement to 28%. In terms of efficiency, Mem0 demonstrated 91% lower p95 latency than full-context methods, and more than 90% savings in token cost. This balance between performance and practicality is significant for production use cases, where response times and computational expenses are critical. The models also handled a wide range of question types, from single-hop factual lookups to multi-hop and open-domain queries, outperforming all other approaches in accuracy across categories.

Several Key takeaways from the research on Mem0 include:

  • Mem0 uses a two-step process to extract and manage salient conversation facts, combining recent messages and global summaries to form a contextual prompt.
  • Mem0g builds memory as a directed graph of entities and relationships, offering superior reasoning over complex information chains.
  • Mem0 surpassed OpenAI’s memory system with a 26% improvement on LLM-as-a-Judge, while Mem0g added an extra 2% gain, achieving 28% overall.
  • Mem0 achieved a 91% reduction in p95 latency and saved over 90% in token usage compared to full-context approaches.
  • These architectures maintain fast, cost-efficient performance even when handling multi-session dialogues, making them suitable for deployment in production settings.
  • The system is ideal for AI assistants in tutoring, healthcare, and enterprise settings where continuity of memory is essential.




Check out the Paper . Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Gr oup . Don’t Forget to join our 90k+ ML SubReddit .

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
64,277
Reputation
9,830
Daps
174,775

DeepSeek-AI Released DeepSeek-Prover-V2: An Open-Source Large Language Model Designed for Formal Theorem, Proving through Subgoal Decomposition and Reinforcement Learning​


By Asif Razzaq

May 1, 2025

Formal mathematical reasoning has evolved into a specialized subfield of artificial intelligence that requires strict logical consistency. Unlike informal problem solving, which allows for intuition and loosely defined heuristics, formal theorem proving relies on every step being fully described, precise, and verifiable by computational systems. Proof assistants, such as Lean, Coq, and Isabelle, serve as the structural frameworks within which these formal proofs are constructed. Their operation demands logical soundness with no space for omissions, approximations, or unstated assumptions. This makes the challenge particularly demanding for AI systems, especially large language models, which excel in producing coherent natural language responses but typically lack the rigor to produce verifiable formal proofs. However, the desire to blend these strengths, AI’s fluency in informal reasoning and the structure of formal verification, has led to new innovations at the interface of language modeling and formal logic automation.

A major issue arises from the inability of current language models to bridge the conceptual divide between informal and formal reasoning. Language models typically excel at generating human-like explanations and solving math problems written in natural language. However, this reasoning is inherently informal and often lacks the structural precision required by formal logic systems. While humans can intuitively leap from one deductive step to another, proof assistants require a fully specified sequence of steps, free of ambiguity. Thus, the challenge is to guide AI models to produce logically coherent formal outputs from their otherwise informal and intuitive internal reasoning processes. This problem becomes increasingly complex when handling advanced theorems from domains such as number theory or geometry, where precision is crucial.

Recent efforts have attempted to address this issue by guiding models first to generate natural language proof sketches, which are then manually or semi-automatically translated into formal proof steps. A known strategy includes decomposing a complex theorem into smaller subgoals. Each subgoal represents a lemma that can be tackled independently and later combined to form a complete proof. Frameworks like “Draft, Sketch, and Prove” have applied this idea, using language models to generate proof outlines that are then translated into formal language. Another method employs hierarchical reinforcement learning, breaking down complex mathematical problems into simpler layers. However, these models often struggle to produce fully verifiable outputs in Lean or Coq environments. Moreover, the training data for these models is usually limited, and proof attempts frequently fail to yield successful outcomes that provide useful learning signals.

A team of researchers from DeepSeek-AI has introduced a new model, DeepSeek-Prover-V2 , designed to generate formal mathematical proofs by leveraging subgoal decomposition and reinforcement learning. The core of their approach utilizes DeepSeek-V3 to break down a complex theorem into manageable subgoals, each of which is translated into a “have” statement in Lean 4 with a placeholder indicating that the proof is incomplete. These subgoals are then passed to a 7B-sized prover model that completes each proof step. Once all steps are resolved, they are synthesized into a complete Lean proof and paired with the original natural language reasoning generated by DeepSeek-V3. This forms a rich cold-start dataset for reinforcement learning. Importantly, the model’s training is entirely bootstrapped from synthetic data, with no human-annotated proof steps used.

AD_4nXe7aL8-RrxfhH-bMkF7kskSh397RCTUi5IzBb3xUWi7ohcYAAtnqzrSQCuQZjFPx0tjghTF4e73JAh8pF4859mT5gNus3h4S4UH_aaFKovhlEltO706d6ryr5qW-SbeTZgw5pB_VA


The cold-start pipeline begins by prompting DeepSeek-V3 to create proof sketches in natural language. These sketches are transformed into formal theorem statements with unresolved parts. A key innovation lies in recursively solving each subgoal using the 7B prover, reducing computation costs while maintaining formal rigor. Researchers constructed a curriculum learning framework that increased the complexity of training tasks over time. They also implemented two types of subgoal theorems, one incorporating preceding subgoals as premises, and one treating them independently. This dual structure was embedded into the model’s expert iteration stage to train it on progressively more challenging problem sets. The model’s capability was then reinforced through a consistency-based reward system during training, ensuring that all decomposed lemmas were correctly incorporated into the final formal proof.

AD_4nXfNDosnCbAIqjrrJ7TLikzq1ChC4XVyS5_rvut_nNi-d_oNrW-Z6fxu_SuNZve2ltwr8sBH4F_kcLzUoiONr1GWSqHEBL28Nu_lffH7Ux8jjIzJvpRYwghwTPF-hn-Ywin291pR_Q


On the MiniF2F-test benchmark, the model achieved an 88.9% pass rate with high sampling (Pass@8192), compared to 82.0% by Kimina-Prover and 64.7% by Geodel-Prover. It also solved 49 out of 658 problems from PutnamBench, a platform featuring challenging mathematical tasks. On the newly introduced ProverBench dataset, comprising 325 formalized problems, the model addressed 6 out of 15 issues from the AIME (American Invitational Mathematics Examination) competitions for the years 2024 and 2025. These benchmarks highlight the model’s generalization ability across multiple formal reasoning tasks. Even when compared to DeepSeek-V3, which employs natural-language reasoning, the new model demonstrates competitive performance, solving a comparable number of AIME problems while ensuring formal verifiability.

AD_4nXe4VS5ygXMRMwQpDcglTItiRLNEEqR2s__giTrp7RhAj8L3aFrJ1v6UORJKtngRDrctVr_60lHLto7u5pwXEI6t52mes5AXkwCZOSO4Cu0Sm5xlD0OHHNM3psZSrlIGZIVyx98T


Several Key Takeaways from the Research on DeepSeek-Prover-V2:

  • DeepSeek-Prover-V2 achieved an 88.9% pass rate on the MiniF2F-test (Pass@8192), the highest reported among formal reasoning models so far.
  • The model successfully solved 49 out of 658 problems from the PutnamBench dataset, which contains advanced mathematical challenges.
  • It tackled 6 out of 15 problems from the recent AIME 2024–2025 competitions, showcasing real-world applicability.
  • A new benchmark, ProverBench, comprising 325 formal problems, has been introduced for evaluating formal reasoning models.
  • The pipeline unifies natural language proof sketching and formal proof construction by combining DeepSeek-V3 and a 7B prover model.
  • Two types of subgoal decompositions—one with and one without dependent premises—were used to train the model in a structured, curriculum-guided manner.
  • Reinforcement learning with a consistency-based reward significantly improved proof accuracy by enforcing structural alignment between sketch and solution.
  • The entire training strategy relies on synthetic cold-start data, eliminating dependence on manually labeled proofs.




Check out the model on Paper and GitHub Page . Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Gr oup . Don’t Forget to join our 90k+ ML SubReddit .

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

 
Top