bnew

Veteran
Joined
Nov 1, 2015
Messages
63,071
Reputation
9,641
Daps
172,639
oxen.ai



Training a Rust 1.5B Coder LM with Reinforcement Learning (GRPO) | Oxen.ai​


GitHub

20–26 minutes



Group Relative Policy Optimization (GRPO) has proven to be a useful algorithm for training LLMs to reason and improve on benchmarks. DeepSeek-R1 showed that you can bootstrap a model through a combination of supervised fine-tuning and GRPO to compete with the state of the art models such as OpenAI's o1.

To learn more about how it works in practice, we wanted to try out some of the techniques on a real world task. This post will outline how to train your own custom small LLM using GRPO, your own data, and custom reward functions. Below is a sneak preview of some of the training curves we will see later. It is quite entertaining to watch the model learn to generate code blocks, get better at generating valid code that compiles, and finally code that passes unit tests.

Screenshot-2025-03-05-at-4.07.17-PM.png


If you want to jump straight into the action, the GitHub repository can be found here.

GitHub - Oxen-AI/GRPO-With-Cargo-Feedback: This repository has code for fine-tuning LLMs with GRPO specifically for Rust Programming using cargo as feedback

This repository has code for fine-tuning LLMs with GRPO specifically for Rust Programming using cargo as feedback - Oxen-AI/GRPO-With-Cargo-Feedback

Oxen-AI



This post will not go into the fundamentals of GRPO, if you want to learn more about how it works at a fundamental level, feel free to checkout our deep dive into the algorithm below.

Why GRPO is Important and How it Works | Oxen.ai

Last week on Arxiv Dives we dug into research behind DeepSeek-R1, and uncovered that one of the techniques they use in the their training pipeline is called Group Relative Policy Optimization (GRPO). At it’s core, GRPO is a Reinforcement Learning (RL) algorithm that is aimed at improving the model’s reasoning ability. It was first introduced in their paper DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, but was also used in the post-training of DeepSeek-R1.

Oxen.ai





Why Rust?​


Rust seems like it would be a great playground for Reinforcement Learning (RL) because you have access to the rust compiler and the cargo tooling. The Rust compiler gives great error messages and is pretty strict.

In this project, the first experiment we wanted to prove out was that you can use cargo as a feedback mechanism to teach a model to become a better programmer. The second experiment we wanted to try was to see how small of a language model can you get away with. These experiments are purposely limited to a single node H100 to limit costs and show how accessible the training can be.

We are also a Rust dev shop at Oxen.ai, so have some interesting applications 🦀 x 🐂.



Why 1.5B?​


Recently, there is a lot of work seeing how far we can push the boundaries of small language models for specific tasks. When you have a concrete feedback mechanism such as the correct answer to a math problem or the output of a program, it seems you can shrink the model while maintaining very competitive performance.

The rStar-Math paper from Microsoft shows this in the domain of verifiable math problems allowing the model to reason. The 1.5B model outperforms GPT-4o and o1-preview.

Screenshot-2025-03-04-at-9.31.07-AM.png


rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising “deep thinking” through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids naïve step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs’ math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.

arXiv.orgXinyu Guan



My hypothesis is that we can push similar level of performance on coding, since you have a similar verifiable reward: Does the code compile and does it pass unit tests?



Benefits of Smol LMs​


Having small coding models have many benefits including cost, throughput, data privacy, and ability to customize to your own codebase / coding practices. Plus it's just a fun challenge.

The dream would be to eventually have this small model do all the cursor-like tasks of next tab prediction, fill in the middle, and improve it’s code in an agent loop. But let’s start simple.



Formulating the Problem​


There are a few different ways you could structure the problem of writing code that passes unit tests. We ended up trying a few. A seemingly straightforward option would be to have a set of verifiable unit tests that must pass given the generated code. This would give us a gold standard set of verifiable answers.

Prompt-Code-Unit-Tests-1.png


After trying out this flow we found two main problems. First, if you don’t let the model see the unit tests while writing the code, it will have no sense of the interface it is writing for. Many of the errors ended up being type or naming mismatches between the code and the unit tests while evaluating against pre-built, verified unit tests.

Errors-1.png


Second, if you allow the model to see the unit tests while its writing the code, you lose out on developer experience. Unless you are a hard core “Test Driven Developer” you probably just want to send in a prompt and not think about the function definition or unit tests yet.

Rather than trying to come up with something more clever, we ended up optimizing for simplicity. We reformulated the problem to have the model generate the code and the tests within the same response.

Simplified.png


With single pass there is a danger of the model hacking the reward function to make the functions and unit tests trivial. For example it could just have println! and no assert statements to get everything to compile and pass. We will return to putting guardrails on for this later.

Finally we add a verbose system prompt to give the model guidance on the task.

system-prompt.png


The system prompt gives the model some context in the format and style in which we are expecting the model to answer the user queries.



The Dataset​


Before training, we need a dataset. When starting out, we did not see many datasets targeted at Rust. Many of the LLM benchmarks are targeted at Python. So the first thing we did was convert a dataset of prompts asking Pythonic questions to a dataset of Rust prompts.

We took a random 20k prompts from the Ace-Code-87k dataset. We then used Qwen 2.5 Coder 32B Instruct to write rust code and unit tests. We ran the code and unit tests through the compiler and testing framework to filter out any triples that did not pass the unit tests. This left us with 16500 prompt,code,unit_test triples that we could train and evaluate on. The dataset was split into 15000 train, 1000 test, and 500 evaluation data points.

The final data looks like the following:

Screenshot-2025-03-04-at-3.45.31-PM.png


ox/Rust/cargo_test_passed_train.parquet at main

This is a dataset of rust questions and generated code created to fine tune small language models on rust.. Contribute to the ox/Rust repository by creating an account on Oxen.ai





You can follow the prompts and steps by looking at these model runs:

1) Translate to Rust: https://www.oxen.ai/ox/mbrp-playground/evaluations/ce45630c-d9e8-4fac-9b41-2d41692076b3

2) Write Rust code: https://www.oxen.ai/ox/mbrp-playground/evaluations/febc562a-9bd4-4e91-88d7-a95ee676a5ed

3) Write Rust unit tests - https://www.oxen.ai/ox/mbrp-playground/evaluations/b886ddd6-b501-4db8-8ed6-0b719d0ac595

Funny enough, for the final formulation of the GRPO training we ended up throwing away the gold standard rust code and unit tests columns. With our reinforcement learning loop we only need the prompts as input. This makes it pretty easy to collect more data in the future. We’ll dive into how the single prompt as input works in the following sections. Even though we threw away the code and unit tests for training, it was nice to know the prompts are solvable.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,071
Reputation
9,641
Daps
172,639
[New Model] olmOCR-7B-faithful by TNG, a fine-tuned version of olmOCR-7B-0225-preview



Posted on Fri Apr 25 11:14:56 2025 UTC


A fine-tuned version of olmOCR-7B-0225-preview that aims to extract all information from documents, including header and footer information.

Release article: Finetuning olmOCR to be a faithful OCR-Engine
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,071
Reputation
9,641
Daps
172,639
Sleep Time Compute - AI That "Thinks" 24/7!



Channel Info Matthew Berman Subscribers: 460K subscribers

Description
👉 Access all top AIs for 10 on Stay at the top of AI

Join My Newsletter for Regular AI Updates 👇🏼

My Links 🔗
👉🏻 Subscribe: Matthew Berman
👉🏻 Twitter: https://twitter.com/matthewberman
👉🏻 Discord: Join the Forward Future AI Discord Server!
👉🏻 Patreon: Get more from Matthew Berman on Patreon
👉🏻 Instagram: https://www.instagram.com/matthewberman_ai
👉🏻 Threads: Matthew Berman (@matthewberman_ai) • Threads, Say more
👉🏻 LinkedIn: Forward Future | LinkedIn

Media/Sponsorship Inquiries ✅

0:00 Intro: AI That Thinks BEFORE You Ask?
0:13 Introducing Sleep-Time Compute
0:59 The Problem with Standard Test-Time Compute (Cost & Latency)
2:58 Stateful LLM Applications (Code, Docs, Chat)
3:33 Sleep Time vs. Test Time (Diagram Explained)
4:51 Why Sleep-Time is More Cost-Effective
6:00 Defining Sleep-Time Compute
6:26 Sponsor: Mammoth (Generative AI Platform)
7:18 Paper Details: How They Tested Non-Reasoning Models
9:24 Benchmarking Sleep-Time (The Juggle Example)
10:05 Models Used (GPT-4o, Claude, DeepSeek, etc.)
10:25 Results: Non-Reasoning Models (Graphs)
12:18 Results: Reasoning Models (Graphs)
13:39 Sleep Time vs. Parallel Sampling (A Big Issue)
14:41 Scaling Sleep-Time Compute
15:45 Amortizing Cost Across Queries (Why it's Cheaper!)
16:48 Predictable Queries Benefit Most
18:04 Paper Summary & Future Directions
18:40 Outro & Newsletter
 

CodeKansas

Superstar
Joined
Dec 5, 2017
Messages
7,057
Reputation
1,710
Daps
27,011
Anybody having issues with image gen on Gemini?

It's been telling me that it's been having issues for almost 48hrs now.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,071
Reputation
9,641
Daps
172,639
Qwen 3 benchmark results(With reasoning)


Posted on Mon Apr 28 21:03:53 2025 UTC

qjjf901q3nxe1.png

qj61pf6b4nxe1.png

7ltidtff4nxe1.png













1/11
@Alibaba_Qwen
Introducing Qwen3!

We release and open-weight Qwen3, our latest large language models, including 2 MoE models and 6 dense models, ranging from 0.6B to 235B. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general capabilities, etc., when compared to other top-tier models such as DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro. Additionally, the small MoE model, Qwen3-30B-A3B, outcompetes QwQ-32B with 10 times of activated parameters, and even a tiny model like Qwen3-4B can rival the performance of Qwen2.5-72B-Instruct.

For more information, feel free to try them out in Qwen Chat Web (Qwen Chat) and APP and visit our GitHub, HF, ModelScope, etc.

Blog: Qwen3: Think Deeper, Act Faster
GitHub: GitHub - QwenLM/Qwen3: Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
Hugging Face: Qwen3 - a Qwen Collection
ModelScope: Qwen3

The post-trained models, such as Qwen3-30B-A3B, along with their pre-trained counterparts (e.g., Qwen3-30B-A3B-Base), are now available on platforms like Hugging Face, ModelScope, and Kaggle. For deployment, we recommend using frameworks like SGLang and vLLM. For local usage, tools such as Ollama, LMStudio, MLX, llama.cpp, and KTransformers are highly recommended. These options ensure that users can easily integrate Qwen3 into their workflows, whether in research, development, or production environments.

Hope you enjoy our new models!



Gppj9_kbEAAkO9U.jpg

Gppj9_eaAAAeB0s.jpg


2/11
@Alibaba_Qwen
Qwen3 exhibits scalable and smooth performance improvements that are directly correlated with the computational reasoning budget allocated. This design enables users to configure task-specific budgets with greater ease, achieving a more optimal balance between cost efficiency and inference quality.



GppkMjPbEAECW1D.jpg


3/11
@Alibaba_Qwen
Qwen3 models are supporting 119 languages and dialects. This extensive multilingual capability opens up new possibilities for international applications, enabling users worldwide to benefit from the power of these models.



GppkYSFbEAANXPj.png


4/11
@Alibaba_Qwen
We have optimized the Qwen3 models for coding and agentic capabilities, and also we have strengthened the support of MCP as well. Below we provide examples to show how Qwen3 thinks and interacts with the environment.



https://video.twimg.com/amplify_video/1916955612430397440/vid/avc1/1156x720/iUcPwb2A3t9kjUiE.mp4

5/11
@Alibaba_Qwen




GppuaCRbEAEK_fY.jpg


6/11
@Alibaba_Qwen
We also evaluated the preliminary performance of Qwen3-235B-A22B on the open-source coding agent Openhands. It achieved 34.4% on Swebench-verified, achieving competitive results with fewer parameters! Thanks to @allhands_ai for providing an easy-to-use agent. Both open models and open agents are exciting!



GprHzWObEAUH6jV.jpg


7/11
@MavMikee
@OpenRouterAI 🔥



8/11
@ofermend
Congrats. Will evaluate this for @vectara hallucination leaderboard and publish the results shortly.



9/11
@wilsonsilva90
Is there any chance to partner with @GroqInc or @CerebrasSystems?



10/11
@vansinhu




GpqFYmmaYAAYCsZ.jpg


11/11
@Fabeyy1337
@theo where this??




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196












1/12
@godofprompt
🚨 BREAKING: Alibaba just launched "Qwen3" its most powerful AI model yet.

It thinks deeper, acts faster, and it outperforms models like DeepSeek-R1, Grok 3 and Gemini-2.5-Pro.

Here are 5 insane examples of what it can do:



GpsTi_1bQAAvxbl.jpg


2/12
@godofprompt
1. Complex Reasoning:

Qwen3 solved a classic logic puzzle step-by-step without rushing.

Each step was reasoned out clearly, leading to the right answer with full explanations.

No hallucination.
No skipping steps.

Just pure thinking.



https://video.twimg.com/amplify_video/1917147575427493888/vid/avc1/1386x720/3Sh4HKHCWtVFAgBR.mp4

3/12
@godofprompt
2. Instant Answers:

When asked to list creative reasons why pizza beats salad, Qwen3 replied in under 3 seconds.

Funny, punchy, and perfectly structured.

It switches between slow thinking and instant output almost like a human.



https://video.twimg.com/amplify_video/1917147613083938816/vid/avc1/1402x720/5SYvNoQT61v9zY0a.mp4

4/12
@godofprompt
3. Multilingual Genius:

Qwen3 explained Einstein’s Theory of Relativity in Arabic, French, and Hindi.

Simple enough for a 10-year-old to understand.

And it kept the tone and complexity natural across languages no awkward translation.



https://video.twimg.com/amplify_video/1917147638044229632/vid/avc1/1388x720/3xKS2j_yKDSH-KeV.mp4

5/12
@godofprompt
4. Real-World Coding:

Qwen3 wrote a clean Python script to scrape headlines from the New York Times and save them into a CSV.

Fully working, no bugs, no missing imports.

It even suggested adding error handling without being asked.



https://video.twimg.com/amplify_video/1917147688015187968/vid/avc1/1380x720/6w-0n_KOS0TsBom9.mp4

6/12
@godofprompt
5. Agentic Planning:

Given a tight $500 budget, Qwen3 built a full 3-day Tokyo itinerary: sightseeing, culture, shopping.

It calculated transport costs, entry fees, food expenses and recommended hacks to save money.

It thinks like a planner, not just a text generator.



https://video.twimg.com/amplify_video/1917147728100065280/vid/avc1/1410x720/276bPkPpbwUGmLpI.mp4

7/12
@godofprompt
Qwen3 isn’t just another model.

It’s built for real tasks, real users, and real-world complexity at insane speed and depth.

Try it here at: Qwen Chat



8/12
@godofprompt
Which Qwen3 ability are you most excited to try?

Deep thinking? Agent planning? Code generation?

Curious to hear.



9/12
@ihteshamit
Insanely powerful models dropped by Alibaba for Qwen.

I'm shocked asf



10/12
@godofprompt
Can’t wait to use it



11/12
@hasantoxr
crazy... Alibaba new models are pretty awesome!!



12/12
@godofprompt
Yeah




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/5
@MaziyarPanahi
🚨 All the new Qwen3 models dropped on @huggingface are under Apache 2.0! No research-only licenses this time!

Open-source community wins big! 🤗



GpsCJgtXkAA2moq.jpg


2/5
@MaziyarPanahi
Just take them! Qwen3 - a Qwen Collection



3/5
@caviterginsoy
Super of them, absolutely super



4/5
@lgaa201
🗿 our new toys



5/5
@paulcx
But two base models are missing 😃




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
@BayAreaTimes
JUST IN: Alibaba’s Qwen3 open-weight “hybrid” AI models debut in 0.6B-235B parameters

- 2 MoE and 6 dense models in total in the Qwen3 family.

- The models can “reason” through complex problems or quickly answer simple requests.

- Support 119 languages and trained on a dataset of ~36T tokens.



Gptb1IQbIAAnyLe.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/3
@victormustar
Qwen3-30B-A3B has hit the mark 🎯

Currently running it on my laptop at 100 tokens/sec (MLX) with blender-MCP, it's fast, it's clinical, it just works... Local AI will never be the same... 🥹

[Quoted tweet]
BOOOOM: Today I'm dropping TINY AGENTS

the 50 lines of code Agent in Javascript 🔥

I spent the last few weeks working on this, so I hope you will like it.

I've been diving into MCP (Model Context Protocol) to understand what the hype was all about.

It is fairly simple, but still quite powerful: MCP is a standard API to expose sets of Tools that can be hooked to LLMs.

But while doing that, came my second realization:

Once you have a MCP Client, an Agent is literally just a while loop on top of it. 🤯


GpZA2AQXoAAUJkf.jpg


https://video.twimg.com/amplify_video/1917224345929216000/vid/avc1/1920x1080/Ek8malpsukDEO0Qk.mp4

2/3
@victormustar
Stack is 100% free: LM Studio + Blender MCP + TINY AGENTS
https://github.com/huggingface/huggingface.js/pull/1396



3/3
@lifeafterAi_
Bro help me. I’m thinking about buying Mac mini. Which one would u suggest me to run 30b smoothly




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,071
Reputation
9,641
Daps
172,639

Qwen3: Think Deeper, Act Faster​


April 29, 2025 · 10 min · 2036 words · Qwen Team | Translations:


Qwen3 Main Image


QWEN CHAT GitHub Hugging Face ModelScope Kaggle DEMO DISCORD

Introduction​


Today, we are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general capabilities, etc., when compared to other top-tier models such as DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro. Additionally, the small MoE model, Qwen3-30B-A3B, outcompetes QwQ-32B with 10 times of activated parameters, and even a tiny model like Qwen3-4B can rival the performance of Qwen2.5-72B-Instruct.

qwen3-235a22.jpg


qwen3-30a3.jpg


We are open-weighting two MoE models: Qwen3-235B-A22B, a large model with 235 billion total parameters and 22 billion activated parameters, and Qwen3-30B-A3B, a smaller MoE model with 30 billion total parameters and 3 billion activated parameters. Additionally, six dense models are also open-weighted, including Qwen3-32B, Qwen3-14B, Qwen3-8B, Qwen3-4B, Qwen3-1.7B, and Qwen3-0.6B, under Apache 2.0 license.

ModelsLayersHeads (Q / KV)Tie EmbeddingContext Length
Qwen3-0.6B2816 / 8Yes32K
Qwen3-1.7B2816 / 8Yes32K
Qwen3-4B3632 / 8Yes32K
Qwen3-8B3632 / 8No128K
Qwen3-14B4040 / 8No128K
Qwen3-32B6464 / 8No128K
ModelsLayersHeads (Q / KV)# Experts (Total / Activated)Context Length
Qwen3-30B-A3B4832 / 4128 / 8128K
Qwen3-235B-A22B9464 / 4128 / 8128K

The post-trained models, such as Qwen3-30B-A3B, along with their pre-trained counterparts (e.g., Qwen3-30B-A3B-Base), are now available on platforms like Hugging Face, ModelScope, and Kaggle. For deployment, we recommend using frameworks like SGLang and vLLM. For local usage, tools such as Ollama, LMStudio, MLX, llama.cpp, and KTransformers are highly recommended. These options ensure that users can easily integrate Qwen3 into their workflows, whether in research, development, or production environments.

We believe that the release and open-sourcing of Qwen3 will significantly advance the research and development of large foundation models. Our goal is to empower researchers, developers, and organizations around the world to build innovative solutions using these cutting-edge models.

Feel free to try Qwen3 out in Qwen Chat Web (chat.qwen.ai) and mobile APP!

Key Features​


  • Hybrid Thinking Modes

Qwen3 models introduce a hybrid approach to problem-solving. They support two modes:

  1. Thinking Mode: In this mode, the model takes time to reason step by step before delivering the final answer. This is ideal for complex problems that require deeper thought.
  2. Non-Thinking Mode: Here, the model provides quick, near-instant responses, suitable for simpler questions where speed is more important than depth.

This flexibility allows users to control how much “thinking” the model performs based on the task at hand. For example, harder problems can be tackled with extended reasoning, while easier ones can be answered directly without delay. Crucially, the integration of these two modes greatly enhances the model’s ability to implement stable and efficient thinking budget control. As demonstrated above, Qwen3 exhibits scalable and smooth performance improvements that are directly correlated with the computational reasoning budget allocated. This design enables users to configure task-specific budgets with greater ease, achieving a more optimal balance between cost efficiency and inference quality.

thinking_budget.png


  • Multilingual Support

Qwen3 models are supporting 119 languages and dialects. This extensive multilingual capability opens up new possibilities for international applications, enabling users worldwide to benefit from the power of these models.

Language FamilyLanguages & Dialects
Indo-EuropeanEnglish, French, Portuguese, German, Romanian, Swedish, Danish, Bulgarian, Russian, Czech, Greek, Ukrainian, Spanish, Dutch, Slovak, Croatian, Polish, Lithuanian, Norwegian Bokmål, Norwegian Nynorsk, Persian, Slovenian, Gujarati, Latvian, Italian, Occitan, Nepali, Marathi, Belarusian, Serbian, Luxembourgish, Venetian, Assamese, Welsh, Silesian, Asturian, Chhattisgarhi, Awadhi, Maithili, Bhojpuri, Sindhi, Irish, Faroese, Hindi, Punjabi, Bengali, Oriya, Tajik, Eastern Yiddish, Lombard, Ligurian, Sicilian, Friulian, Sardinian, Galician, Catalan, Icelandic, Tosk Albanian, Limburgish, Dari, Afrikaans, Macedonian, Sinhala, Urdu, Magahi, Bosnian, Armenian
Sino-TibetanChinese (Simplified Chinese, Traditional Chinese, Cantonese), Burmese
Afro-AsiaticArabic (Standard, Najdi, Levantine, Egyptian, Moroccan, Mesopotamian, Ta’izzi-Adeni, Tunisian), Hebrew, Maltese
AustronesianIndonesian, Malay, Tagalog, Cebuano, Javanese, Sundanese, Minangkabau, Balinese, Banjar, Pangasinan, Iloko, Waray (Philippines)
DravidianTamil, Telugu, Kannada, Malayalam
TurkicTurkish, North Azerbaijani, Northern Uzbek, Kazakh, Bashkir, Tatar
Tai-KadaiThai, Lao
UralicFinnish, Estonian, Hungarian
AustroasiaticVietnamese, Khmer
OtherJapanese, Korean, Georgian, Basque, Haitian, Papiamento, Kabuverdianu, Tok Pisin, Swahili

  • Improved Agentic Capabilities

We have optimized the Qwen3 models for coding and agentic capabilities, and also we have strengthened the support of MCP as well. Below we provide examples to show how Qwen3 thinks and interacts with the environment.

Pre-training​


In terms of pretraining, the dataset for Qwen3 has been significantly expanded compared to Qwen2.5. While Qwen2.5 was pre-trained on 18 trillion tokens, Qwen3 uses nearly twice that amount, with approximately 36 trillion tokens covering 119 languages and dialects. To build this large dataset, we collected data not only from the web but also from PDF-like documents. We used Qwen2.5-VL to extract text from these documents and Qwen2.5 to improve the quality of the extracted content. To increase the amount of math and code data, we used Qwen2.5-Math and Qwen2.5-Coder to generate synthetic data. This includes textbooks, question-answer pairs, and code snippets.

The pre-training process consists of three stages. In the first stage (S1), the model was pretrained on over 30 trillion tokens with a context length of 4K tokens. This stage provided the model with basic language skills and general knowledge. In the second stage (S2), we improved the dataset by increasing the proportion of knowledge-intensive data, such as STEM, coding, and reasoning tasks. The model was then pretrained on an additional 5 trillion tokens. In the final stage, we used high-quality long-context data to extend the context length to 32K tokens. This ensures the model can handle longer inputs effectively.
 
Top