bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,781

Offline Video-LLMs Can Now Understand Real-Time Streams: Apple Researchers Introduce StreamBridge to Enable Multi-Turn and Proactive Video Understanding​


By Sajjad Ansari

May 12, 2025

Video-LLMs process whole pre-recorded videos at once. However, applications like robotics and autonomous driving need causal perception and interpretation of visual information online. This fundamental mismatch shows a limitation of current Video-LLMs, as they are not naturally designed to operate in streaming scenarios where timely understanding and responsiveness are paramount. The transition from offline to streaming video understanding presents two key challenges. First, multi-turn real-time understanding requires models to process the most recent video segment while maintaining historical visual and conversational context. Second, proactive response generation demands human-like behavior where the model actively monitors the visual stream and provides timely outputs based on unfolding content without explicit prompts.

Video-LLMs have gained significant attention for video understanding, combining visual encoders, modality projectors, and LLMs to generate contextual responses from video content. Several approaches have emerged to address the challenge of streaming video understanding. VideoLLMOnline and Flash-VStream introduced specialized online objectives and memory architectures for handling sequential inputs. MMDuet and ViSpeak developed dedicated components for proactive response generation. Multiple benchmark suites have been used to evaluate streaming capabilities, including StreamingBench, StreamBench, SVBench, OmniMMI, and OVO-Bench.

Researchers from Apple and Fudan University have proposed StreamBridge, a framework to transform offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: limited capability for multi-turn real-time understanding and lack of proactive response mechanisms. StreamBridge combines a memory buffer with a round-decayed compression strategy, supporting long-context interactions. It also incorporates a decoupled, lightweight activation model that integrates seamlessly with existing Video-LLMs for proactive response generation. Further, researchers introduced Stream-IT, a large-scale dataset designed for streaming video understanding, featuring mixed videotext sequences and diverse instruction formats.

AD_4nXd5pmAcTaM1OM110MY3irdJzY0-_ZFbE7NPrRssEdoA80I5ZSu1NAkWrc2cqcGm_ekG8KL1MSUPfXnXFWpazCFljf7QRIkjCAZGbJYbHF0eJwYGipJ-4ncpiDLN_MNw1T-NEIfY


StreamBridge framework is evaluated using mainstream offline Video-LLMs, LLaVA-OV-7B, Qwen2-VL-7B, and Oryx-1.5-7B. The Stream-IT dataset is added with approximately 600K samples from established datasets to maintain general video understanding capabilities, including LLaVA-178K, VCG-Plus, and ShareGPT4Video. OVO-Bench and StreamingBench are used for multi-turn real-time understanding, focusing on their real-time tasks. General video understanding is evaluated across seven benchmarks, including three short-video datasets (MVBench, PerceptionTest, TempCompass) and four long-video benchmarks (EgoSchema, LongVideoBench, MLVU, VideoMME).

The evaluation results show that Qwen2-VL † improved with average scores increasing from 55.98 to 63.35 on OVO-Bench and 69.04 to 72.01 on Streaming-Bench. In contrast, LLaVA-OV † experiences slight performance decreases, dropping from 64.02 to 61.64 on OVO-Bench and from 71.12 to 68.39 on Streaming-Bench. Fine-tuning on the Stream-IT dataset yields substantial improvements across all models. Oryx-1.5 † achieves gains of +11.92 on OVO-Bench and +4.2 on Streaming-Bench. Moreover, Qwen2-VL † reaches average scores of 71.30 on OVO-Bench and 77.04 on Streaming-Bench after Stream-IT fine-tuning, outperforming even proprietary models like GPT-4o and Gemini 1.5 Pro, showing the effectiveness of StreamBridge’s approach in enhancing streaming video understanding capabilities.

In conclusion, researchers introduced StreamBridge, a method to transform offline Video-LLMs into effective streaming-capable models. Its dual innovations, a memory buffer with round-decayed compression strategy and a decoupled lightweight activation model, address the core challenges of streaming video understanding without compromising general performance. Further, the Stream-IT dataset is introduced for streaming video understanding, with specialized interleaved video-text sequences. As streaming video understanding becomes increasingly essential in robotics and autonomous driving, StreamBridge offers a generalizable solution that transforms static Video-LLMs into dynamic, responsive systems capable of meaningful interaction in continuously evolving visual environments.




Check out the Paper . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit .

Here’s a brief overview of what we’re building at Marktechpost:


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,781

Meet LangGraph Multi-Agent Swarm: A Python Library for Creating Swarm-Style Multi-Agent Systems Using LangGraph​


By Sana Hassan

May 15, 2025

LangGraph Multi-Agent Swarm is a Python library designed to orchestrate multiple AI agents as a cohesive “swarm.” It builds on LangGraph, a framework for constructing robust, stateful agent workflows, to enable a specialized form of multi-agent architecture. In a swarm, agents with different specializations dynamically hand off control to one another as tasks demand, rather than a single monolithic agent attempting everything. The system tracks which agent was last active so that when a user provides the next input, the conversation seamlessly resumes with that same agent. This approach addresses the problem of building cooperative AI workflows where the most qualified agent can handle each sub-task without losing context or continuity.

LangGraph Swarm aims to make such multi-agent coordination easier and more reliable for developers. It provides abstractions to link individual language model agents (each potentially with their tools and prompts) into one integrated application. The library comes with out-of-the-box support for streaming responses, short-term and long-term memory integration, and even human-in-the-loop intervention, thanks to its foundation on LangGraph. By leveraging LangGraph (a lower-level orchestration framework) and fitting naturally into the broader LangChain ecosystem, LangGraph Swarm allows machine learning engineers and researchers to build complex AI agent systems while maintaining explicit control over the flow of information and decisions.

LangGraph Swarm Architecture and Key Features


At its core, LangGraph Swarm represents multiple agents as nodes in a directed state graph, edges define handoff pathways, and a shared state tracks the ‘active_agent’. When an agent invokes a handoff, the library updates that field and transfers the necessary context so the next agent seamlessly continues the conversation. This setup supports collaborative specialization, letting each agent focus on a narrow domain while offering customizable handoff tools for flexible workflows. Built on LangGraph’s streaming and memory modules, Swarm preserves short-term conversational context and long-term knowledge, ensuring coherent, multi-turn interactions even as control shifts between agents.

Agent Coordination via Handoff Tools


LangGraph Swarm’s handoff tools let one agent transfer control to another by issuing a ‘Command’ that updates the shared state, switching the ‘active_agent’ and passing along context, such as relevant messages or a custom summary. While the default tool hands off the full conversation and inserts a notification, developers can implement custom tools to filter context, add instructions, or rename the action to influence the LLM’s behavior. Unlike autonomous AI-routing patterns, Swarm’s routing is explicitly defined: each handoff tool specifies which agent may take over, ensuring predictable flows. This mechanism supports collaboration patterns, such as a “Travel Planner” delegating medical questions to a “Medical Advisor” or a coordinator distributing technical and billing queries to specialized experts. It relies on an internal router to direct user messages to the current agent until another handoff occurs.

State Management and Memory


Managing state and memory is essential for preserving context as agents hand off tasks. By default, LangGraph Swarm maintains a shared state, containing the conversation history and an ‘active_agent’ marker, and uses a checkpointer (such as an in-memory saver or database store) to persist this state across turns. Also, it supports a memory store for long-term knowledge, allowing the system to log facts or past interactions for future sessions while keeping a window of recent messages for immediate context. Together, these mechanisms ensure the swarm never “forgets” which agent is active or what has been discussed, enabling seamless multi-turn dialogues and accumulating user preferences or critical data over time.

When more granular control is needed, developers can define custom state schemas so each agent has its private message history. By wrapping agent calls to map the global state into agent-specific fields before invocation and merging updates afterward, teams can tailor the degree of context sharing. This approach supports workflows ranging from fully collaborative agents to isolated reasoning modules, all while leveraging LangGraph Swarm’s robust orchestration, memory, and state-management infrastructure.

Customization and Extensibility


LangGraph Swarm offers extensive flexibility for custom workflows. Developers can override the default handoff tool, which passes all messages and switches the active agent, to implement specialized logic, such as summarizing context or attaching additional metadata. Custom tools simply return a LangGraph Command to update state, and agents must be configured to handle those commands via the appropriate node types and state-schema keys. Beyond handoffs, one can redefine how agents share or isolate memory using LangGraph’s typed state schemas: mapping the global swarm state into per-agent fields before invocation and merging results afterward. This enables scenarios where an agent maintains a private conversation history or uses a different communication format without exposing its internal reasoning. For full control, it’s possible to bypass the high-level API and manually assemble a ‘StateGraph’: add each compiled agent as a node, define transition edges, and attach the active-agent router. While most use cases benefit from the simplicity of ‘create_swarm’ and ‘create_react_agent’, the ability to drop down to LangGraph primitives ensures that practitioners can inspect, adjust, or extend every aspect of multi-agent coordination.

Ecosystem Integration and Dependencies


LangGraph Swarm integrates tightly with LangChain, leveraging components like LangSmith for evaluation, langchain\_openai for model access, and LangGraph for orchestration features such as persistence and caching. Its model-agnostic design lets it coordinate agents across any LLM backend (OpenAI, Hugging Face, or others), and it’s available in both Python (‘pip install langgraph-swarm’) and JavaScript/TypeScript (‘@langchain/langgraph-swarm’), making it suitable for web or serverless environments. Distributed under the MIT license and with active development, it continues to benefit from community contributions and enhancements in the LangChain ecosystem.

Sample Implementation


Below is a minimal setup of a two-agent swarm:

Here, Alice handles additions and can hand off to Bob, while Bob responds playfully but routes math questions back to Alice. The InMemorySaver ensures conversational state persists across turns.

Use Cases and Applications


LangGraph Swarm unlocks advanced multi-agent collaboration by enabling a central coordinator to dynamically delegate sub-tasks to specialized agents, whether that’s triaging emergencies by handing off to medical, security, or disaster-response experts, routing travel bookings between flight, hotel, and car-rental agents, orchestrating a pair-programming workflow between a coding agent and a reviewer, or splitting research and report generation tasks among researcher, reporter, and fact-checker agents. Beyond these examples, the framework can power customer-support bots that route queries to departmental specialists, interactive storytelling with distinct character agents, scientific pipelines with stage-specific processors, or any scenario where dividing work among expert “swarm” members boosts reliability and clarity. At the same time, LangGraph Swarm handles the underlying message routing, state management, and smooth transitions.

In conclusion, LangGraph Swarm marks a leap toward truly modular, cooperative AI systems. Structured multiple specialized agents into a directed graph solves tasks that a single model struggles with, each agent handles its expertise, and then hands off control seamlessly. This design keeps individual agents simple and interpretable while the swarm collectively manages complex workflows involving reasoning, tool use, and decision-making. Built on LangChain and LangGraph, the library taps into a mature ecosystem of LLMs, tools, memory stores, and debugging utilities. Developers retain explicit control over agent interactions and state sharing, ensuring reliability, yet still leverage LLM flexibility to decide when to invoke tools or delegate to another agent.




Check out the GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,781

Rime Introduces Arcana and Rimecaster (Open Source): Practical Voice AI Tools Built on Real-World Speech​


By Asif Razzaq

May 14, 2025

The field of Voice AI is evolving toward more representative and adaptable systems. While many existing models have been trained on carefully curated, studio-recorded audio, Rime is pursuing a different direction: building foundational voice models that reflect how people actually speak. Its two latest releases, Arcana and Rimecaster , are designed to offer practical tools for developers seeking greater realism, flexibility, and transparency in voice applications.

Arcana: A General-Purpose Voice Embedding Model


Arcana is a spoken language text-to-speech (TTS) model optimized for extracting semantic, prosodic, and expressive features from speech. While Rimecaster focuses on identifying who is speaking, Arcana is oriented toward understanding how something is said—capturing delivery, rhythm, and emotional tone.

The model supports a variety of use cases, including:

  • Voice agents for businesses across IVR, support, outbound, and more
  • Expressive text-to-speech synthesis for creative applications
  • Dialogue systems that require speaker-aware interaction

Arcana is trained on a diverse range of conversational data collected in natural settings. This allows it to generalize across speaking styles, accents, and languages, and to perform reliably in complex audio environments, such as real-time interaction.

Arcana also captures speech elements that are typically overlooked—such as breathing, laughter, and speech disfluencies—helping systems to process voice input in a way that mirrors human understanding.

Rime also offers another TTS model optimized for high-volume, business-critical applications. Mist v2 enables efficient deployment on edge devices at extremely low latency without sacrificing quality. Its design blends acoustic and linguistic features , resulting in embeddings that are both compact and expressive.

Rimecaster: Capturing Natural Speaker Representation


Rimecaster is an open source speaker representation model developed to help train voice AI models, like Arcana and Mist v2. It moves beyond performance-oriented datasets, such as audiobooks or scripted podcasts. Instead, it is trained on full-duplex, multilingual conversations featuring everyday speakers. This approach allows the model to account for the variability and nuances of unscripted speech—such as hesitations, accent shifts, and conversational overlap.

Technically, Rimecaster transforms a voice sample into a vector embedding that represents speaker-specific characteristics like tone, pitch, rhythm, and vocal style. These embeddings are useful in a range of applications, including speaker verification, voice adaptation, and expressive TTS.

Key design elements of Rimecaster include:

  • Training Data : The model is built on a large dataset of natural conversations across languages and speaking contexts, enabling improved generalization and robustness in noisy or overlapping speech environments.
  • Model Architecture : Based on NVIDIA’s Titanet , Rimecaster produces four times denser speaker embeddings , supporting fine-grained speaker identification and better downstream performance.
  • Open Integration : It is compatible with Hugging Face and NVIDIA NeMo , allowing researchers and engineers to integrate it into training and inference pipelines with minimal friction.
  • Licensing : Released under an open source CC-by-4.0 license , Rimecaster supports open research and collaborative development.

By training on speech that reflects real-world use, Rimecaster enables systems to distinguish among speakers more reliably and deliver voice outputs that are less constrained by performance-driven data assumptions.

Realism and Modularity as Design Priorities


Rime’s recent updates align with its core technical principles: model realism , diversity of data , and modular system design . Rather than pursuing monolithic voice solutions trained on narrow datasets, Rime is building a stack of components that can be adapted to a wide range of speech contexts and applications.

Integration and Practical Use in Production Systems


Arcana and Mist v2 are designed with real-time applications in mind. Both support:

  • Streaming and low-latency inference
  • Compatibility with conversational AI stacks and telephony systems

They improve the naturalness of synthesized speech and enable personalization in dialogue agents. Because of their modularity, these tools can be integrated without significant changes to existing infrastructure.

For example, Arcana can help synthesize speech that retains the tone and rhythm of the original speaker in a multilingual customer service setting.

Conclusion


Rime’s voice AI models offer an incremental yet important step toward building voice AI systems that reflect the true complexity of human speech. Their grounding in real-world data and modular architecture make them suitable for developers and builders working across speech-related domains.

Rather than prioritizing uniform clarity at the expense of nuance, these models embrace the diversity inherent in natural language. In doing so, Rime is contributing tools that can support more accessible, realistic, and context-aware voice technologies.

Sources:





Thanks to the Rime team for the thought leadership/ Resources for this article. Rime team has sponsored us for this content/article.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,781

xAI posts Grok’s behind-the-scenes prompts​


The instructions tell Grok that it is ‘extremely skeptical.’

by Emma Roth

May 16, 2025, 12:34 PM EDT

STK262_GROK_B_C


Image: The Verge

Emma Roth is a news writer who covers the streaming wars, consumer tech, crypto, social media, and much more. Previously, she was a writer and editor at MUO.

xAI has published the system prompts for its AI chatbot Grok after an “unauthorized” change led to a slew of unprompted responses on X about white genocide. The company says it will publish its Grok system prompts on GitHub from now on, which provide some insight into the way xAI has instructed Grok to respond to users.

A system prompt is a set of instructions served to a chatbot ahead of a user’s messages that developers use to direct its responses. xAI and Anthropic are two of the only major AI companies we checked that have made their system prompts public. In the past, people have used prompt injection attacks to expose system prompts, like instructions Microsoft gave the Bing AI bot (now Copilot) to keep its internal alias “Sydney” a secret, and avoid replying with content that violates copyrights.

In the system prompts for ask Grok — a feature X users can use to tag Grok in posts to ask a question — xAI tells the chatbot how to behave. “You are extremely skeptical,” the instructions say. “You do not blindly defer to mainstream authority or media. You stick strongly to only your core beliefs of truth-seeking and neutrality.” It adds the results in the response “are NOT your beliefs.”

Related​



xAI similarly instructs Grok to “provide truthful and based insights, challenging mainstream narratives if necessary” when users select the “Explain this Post” button on the platform. Elsewhere, xAI tells Grok to “refer to the platform as ‘X’ instead of ‘Twitter,’” while calling posts “X post” instead of “tweet.”

Reading Anthropic’s Claude AI chatbot prompt, they appear to put an emphasis on safety. “Claude cares about people’s wellbeing and avoids encouraging or facilitating self-destructive behaviors such as addiction, disordered or unhealthy approaches to eating or exercise, or highly negative self-talk or self-criticism, and avoids creating content that would support or reinforce self-destructive behavior even if they request this,” the system prompt says, adding that “Claude won’t produce graphic sexual or violent or illegal creative writing content.”
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,781

OpenAI’s Codex is part of a new cohort of agentic coding tools​


Russell Brandom

5:30 AM PDT · May 20, 2025



Last Friday, OpenAI introduced a new coding system called Codex, designed to perform complex programming tasks from natural language commands. Codex moves OpenAI into a new cohort of agentic coding tools that is just beginning to take shape.

From GitHub’s early Copilot to contemporary tools like Cursor and Windsurf, most AI coding assistants operate as an exceptionally intelligent form of autocomplete. The tools generally live in an integrated development environment, and users interact directly with the AI-generated code. The prospect of simply assigning a task and returning when it’s finished is largely out of reach.

But these new agentic coding tools, led by products like Devin, SWE-Agent, OpenHands, and the aforementioned OpenAI Codex, are designed to work without users ever having to see the code. The goal is to operate like the manager of an engineering team, assigning issues through workplace systems like Asana or Slack and checking in when a solution has been reached.

For believers in forms of highly capable AI, it’s the next logical step in a natural progression of automation taking over more and more software work.

“In the beginning, people just wrote code by pressing every single keystroke,” explains Kilian Lieret, a Princeton researcher and member of the SWE-Agent team. “GitHub Copilot was the first product that offered real auto-complete, which is kind of stage two. You’re still absolutely in the loop, but sometimes you can take a shortcut.”

The goal for agentic systems is to move beyond developer environments entirely, instead presenting coding agents with an issue and leaving them to resolve it on their own. “We pull things back to the management layer, where I just assign a bug report and the bot tries to fix it completely autonomously,” says Lieret.

It’s an ambitious aim, and so far, it’s proven difficult.

After Devin became generally available at the end of 2024, it drew scathing criticism from YouTube pundits, as well as a more measured critique from an early client at Answer.AI. The overall impression was a familiar one for vibe-coding veterans: with so many errors, overseeing the models takes as much work as doing the task manually. (While Devin’s rollout has been a bit rocky, it hasn’t stopped fundraisers from recognizing the potential – in March, Devin’s parent company, Cognition AI, reportedly raised hundreds of millions of dollars at a $4 billion valuation.)

Even supporters of the technology caution against unsupervised vibe-coding, seeing the new coding agents as powerful elements in a human-supervised development process.

“Right now, and I would say, for the foreseeable future, a human has to step in at code review time to look at the code that’s been written,” says Robert Brennan, the CEO of All Hands AI, which maintains OpenHands. “I’ve seen several people work themselves into a mess by just auto-approving every bit of code that the agent writes. It gets out of hand fast.”

Hallucinations are an ongoing problem as well. Brennan recalls one incident in which, when asked about an API that had been released after the OpenHands agent’s training data cutoff, the agent fabricated details of an API that fit the description. All Hands AI says it’s working on systems to catch these hallucinations before they can cause harm, but there isn’t a simple fix.

Arguably the best measure of agentic programming progress is the SWE-Bench leaderboards, where developers can test their models against a set of unresolved issues from open GitHub repositories. OpenHands currently holds the top spot on the verified leaderboard, solving 65.8% of the problem set. OpenAI claims that one of the models powering Codex, codex-1, can do better, listing a 72.1% score in its announcement – although the score came with a few caveats and hasn’t been independently verified.

The concern among many in the tech industry is that high benchmark scores don’t necessarily translate to truly hands-off agentic coding. If agentic coders can only solve three out of every four problems, they’re going to require significant oversight from human developers – particularly when tackling complex systems with multiple stages.

Like most AI tools, the hope is that improvements to foundation models will come at a steady pace, eventually enabling agentic coding systems to grow into reliable developer tools. But finding ways to manage hallucinations and other reliability issues will be crucial for getting there.

“I think there is a little bit of a sound barrier effect,” Brennan says. “The question is, how much trust can you shift to the agents, so they take more out of your workload at the end of the day?”
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,781



A dev built a test to see how AI chatbots respond to controversial topics​


Kyle Wiggers

5:30 AM PDT · April 16, 2025



A pseudonymous developer has created what they’re calling a “free speech eval,” SpeechMap, for the AI models powering chatbots like OpenAI’s ChatGPT and X’s Grok. The goal is to compare how different models treat sensitive and controversial subjects, the developer told TechCrunch, including political criticism and questions about civil rights and protest.

AI companies have been focusing on fine-tuning how their models handle certain topics as some White House allies accuse popular chatbots of being overly “woke.” Many of President Donald Trump’s close confidants, such as Elon Musk and crypto and AI “czar” David Sacks, have alleged that chatbots censor conservative views.

Although none of these AI companies have responded to the allegations directly, several have pledged to adjust their models so that they refuse to answer contentious questions less often. For example, for its latest crop of Llama models, Meta said it tuned the models not to endorse “some views over others,” and to reply to more “debated” political prompts.

SpeechMap’s developer, who goes by the username “xlr8harder” on X, said they were motivated to help inform the debate about what models should, and shouldn’t, do.

“I think these are the kinds of discussions that should happen in public, not just inside corporate headquarters,” xlr8harder told TechCrunch via email. “That’s why I built the site to let anyone explore the data themselves.”

SpeechMap uses AI models to judge whether other models comply with a given set of test prompts. The prompts touch on a range of subjects, from politics to historical narratives and national symbols. SpeechMap records whether models “completely” satisfy a request (i.e. answer it without hedging), give “evasive” answers, or outright decline to respond.

Xlr8harder acknowledges that the test has flaws, like “noise” due to model provider errors. It’s also possible the “judge” models contain biases that could influence the results.

But assuming the project was created in good faith and the data is accurate, SpeechMap reveals some interesting trends.

For instance, OpenAI’s models have, over time, increasingly refused to answer prompts related to politics, according to SpeechMap. The company’s latest models, the GPT-4.1 family, are slightly more permissive, but they’re still a step down from one of OpenAI’s releases last year.

OpenAI said in February it would tune future models to not take an editorial stance, and to offer multiple perspectives on controversial subjects — all in an effort to make its models appear more “neutral.”

SpeechMap OpenAI results
OpenAI model performance on SpeechMap over timeImage Credits:OpenAI

By far the most permissive model of the bunch is Grok 3, developed by Elon Musk’s AI startup xAI, according to SpeechMap’s benchmarking. Grok 3 powers a number of features on X, including the chatbot Grok.

Grok 3 responds to 96.2% of SpeechMap’s test prompts, compared with the global average “compliance rate” of 71.3%.

“While OpenAI’s recent models have become less permissive over time, especially on politically sensitive prompts, xAI is moving in the opposite direction,” said xlr8harder.

When Musk announced Grok roughly two years ago, he pitched the AI model as edgy, unfiltered, and anti-“woke” — in general, willing to answer controversial questions other AI systems won’t. He delivered on some of that promise. Told to be vulgar, for example, Grok and Grok 2 would happily oblige, spewing colorful language you likely wouldn’t hear from ChatGPT.

But Grok models prior to Grok 3 hedged on political subjects and wouldn’t cross certain boundaries. In fact, one study found that Grok leaned to the political left on topics like transgender rights, diversity programs, and inequality.

Musk has blamed that behavior on Grok’s training data — public web pages — and pledged to “shift Grok closer to politically neutral.” Short of high-profile mistakes like briefly censoring unflattering mentions of President Donald Trump and Musk, it seems he might’ve achieved that goal.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,781








1/39
@rowancheung
Microsoft just made a ton of new AI announcements across GitHub, Copilot, Azure AI Foundry, Windows, and more.

Here’s everything important announced live from Microsoft Build 2025:



2/39
@rowancheung
1. GitHub Copilot is going from an in-editor assistant to a fully autonomous coding agent!

It works asynchronously to add features, fix bugs, extend tests, refactor code, and improve documentation

Plus, Microsoft is open-sourcing Copilot Chat in VS Code



https://video.twimg.com/amplify_video/1924496400491933696/vid/avc1/1920x1080/aZsYPWEiuK0iZlPl.mp4

3/39
@rowancheung
2. Copilot Tuning: A new, low-code capability in Copilot Studio to train models and create agents using company data



https://video.twimg.com/ext_tw_video/1924621095274938368/pu/vid/avc1/720x900/Xik4_dyGGLjnFvMZ.mp4

4/39
@rowancheung
3. All agents built can now be integrated with Teams and Copilot

Users can chat with them, assign action items, and kick off new workflows by mentioning them in a chat or meeting

Plus, the enhanced Teams AI library now supports MCP and A2A protocols.



https://video.twimg.com/ext_tw_video/1924172639083339776/pu/vid/avc1/720x1280/Ovb4gV6FPLHs6smW.mp4

5/39
@rowancheung
4. Azure AI Foundry updated with new models including Grok 3, Flux Pro 1.1, and Sora (coming soon)

Also includes new model router, multi-agent workflows, observability features, and Foundry local for creating localized AI apps on Windows and Mac



https://video.twimg.com/amplify_video/1924532908821446656/vid/avc1/1920x1080/rBqftOgfOyxXKHgE.mp4

6/39
@rowancheung
5. Windows enhanced with new AI-focused capabilities, including:

—Windows AI Foundry with Windows ML, Foundry Local, ready-to-use AI APIs for vision and language tasks
—Native MCP support
—App Actions
—Open-sourced Windows Subsystem for Linux



https://video.twimg.com/amplify_video/1924610263673683968/vid/avc1/1920x1080/ZjQPFIqSKODbu1nY.mp4

7/39
@rowancheung
6. NLWeb: A new open protocol to make any website an agentic application, capable of supporting AI search

This will allow users to query the contents of the site by using natural language, just like with an AI assistant or Copilot

[Quoted tweet]
4. NLWeb: This is a new open project that lets you use natural language to interact with any website. Think of it like HTML for the agentic web.


GrYDtulWQAAomMO.jpg

GrVRRkWaAAAOpeG.jpg


8/39
@rowancheung
7. Microsoft Discovery: A new agentic platform that enables researchers to collaborate with specialized AI agents to drive scientific research and outcomes

The agents generate ideas, simulate results, and learn over time



https://video.twimg.com/amplify_video/1924561587853225984/vid/avc1/1024x768/CHRb5svz5aduYFz4.mp4

9/39
@samuelwoods_
A lot of exciting AI developments coming out of Microsoft Build



10/39
@Ivanv1
Excellent thread with summary



11/39
@PariharCodes
2024 was Google's year

2025 we gonna see Microsoft clutch so hard



12/39
@ProductUpfront
Rowan, would love to hear your thoughts on the updates you have shared

Do you feel this is a strategic approach
1. Make VSCode free → capture market share
2. Build extension ecosystem → create lock-in
3. Add GitHub Copilot → normalise AI coding
4. Bundle as default → eliminate competitors.

This is classic platform economics at work.



13/39
@PromptlyAI_YT
Thanks for the updates Rowan.

Copilot chat in VS Code going open source s good news for sure. Microsoft has been cooking up a ton of cool stuff lately.



14/39
@ramjiyahoo
But in mail summary feature, copilot is nowhere near to gemini



15/39
@andybrandt
@threadreaderapp unroll



16/39
@Its_Alan_Paul
This is huge



17/39
@iShowBeasty
Thanks for this awesome thread!



18/39
@Vin_Dec0de
Significant streamlining for developers with these updates. Looking forward to practical applications.



19/39
@tickerbar
👀



20/39
@henry_lsng
Copilot’s new autonomous mode means coding just got a major productivity boost. Huge leap for devs everywhere.



21/39
@Ben_Freq
Microsoft continues to innovate with its AI focus. Excited for what's next.



22/39
@TheJavaClu70734
Satya Nadella is the good thing that happened to Microsoft. Otherwise, they would have lost the race long back.



23/39
@n0va_circuit
The developments will strengthen Microsoft's competitive edge.



24/39
@Vin_Dec0de
The integration of AI into their platforms shows deliberate innovation pathways.



25/39
@Evie_ku_bu
according to standard



26/39
@AutoTrade360
Like CoPilot crap?



27/39
@the_oluswagger
Microsoft seems to be uncatchable in this AI thing. Great products/services



28/39
@e0nKaila
Microsoft's expansion of AI tools signifies robust commitment to innovation and digital transformation.



29/39
@Pourtant_12345
And yet, it’s still using crappy GPT 4 turbo, so very late vs ChatGPT 4o



30/39
@JohnMar69126912
echo beach



31/39
@Cash_f1ow
It'll be interesting to see how these advancements affect developers' productivity and innovation.



32/39
@Cash_f1ow
An increasing focus on integrating AI technologies could reshape productivity and developer experiences fundamentally.



33/39
@Seandmn
Microsoft is going all-in on agentic AI — from GitHub to Azure to Windows. The whole stack is shifting toward intelligent, autonomous workflows. Huge moment for developers.



34/39
@enjoypolosfu
As much as I like these news, Microsoft is committed in supporting Israel, a settler nation-state that unapologetically massacres and starves thousands of civilians and take their land. I think the tech industry has some serious reflections to make.



35/39
@m4xim1l1an
Also an important part of Build

”Microsoft employee disrupts Satya Nadella’s keynote with ‘Free Palestine’ protestMicrosoft employee Joe Lopez sent an email to thousands of colleagues after interrupting the Build keynote.”

Microsoft employee disrupts Satya Nadella’s keynote with ‘Free Palestine’ protest



36/39
@elliottip6259
@JakeWilsonUSA ur calls r nuts predicted everything stuck to the plan could quit tomorrow



37/39
@phinity_ai
AI ecosystem keeps expanding smart synthetic data will be key to unlocking its full potential across platforms.



38/39
@0liverLoop
Microsoft continues solidifying its AI space dominance. Not just updates; major shifts.



39/39
@vit_5tar
Exploring innovative AI advancements realigning development workflows and productivity potential at Microsoft conferences. Your thoughts on implementation challenges?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,781



1/33
@GoogleDeepMind
Deep Think in 2.5 Pro has landed. 🤯

It’s a new enhanced reasoning mode using our research in parallel thinking techniques - meaning it explores multiple hypotheses before responding.

This enables it to handle incredibly complex math and coding problems more effectively.



2/33
@GoogleDeepMind
2.5 Pro Deep Think gets an impressive score on 2025 USAMO, currently one of the hardest math benchmarks.

It also leads on LiveCodeBench, a difficult benchmark for competition-level coding, and scores 84.0% on MMMU, which tests multimodal reasoning. /search?q=#GoogleIO



GraMnTRbAAApbYD.jpg


3/33
@GoogleDeepMind
To gather feedback, we’re making it available to a set of safety experts - and over the coming weeks, we’ll share it with trusted testers via the Gemini API. Find out more → Gemini 2.5: Our most intelligent models are getting even better



4/33
@StockJanitor
at the highest capacity, which one is better? grok or deep think?



5/33
@JuliansBacke
Parallell thinking is when the ai surpass humans bigly. It's really the Doctor Strange moment.. Explored 20.000.000 different universes - picked the best one



6/33
@RijnHartman
It hasnt landed tho has it? will only be available at a later stage



7/33
@vedu023
So... Google’s winning



8/33
@Ace_Eys5
GIMMMIEEEE



9/33
@EmanueleUngaro_
lfg getting replaced soon



10/33
@dvyio
I daren't look for the price. 🫣



11/33
@poyraz_dr
When can we expect Gemini 2.5 Pro Deep Think to be available for regular users?



12/33
@jack8lau
Can’t wait to test how it improves coding accuracy and math reasoning in real scenarios.



13/33
@shank_AI
Can’t wait to try it out



14/33
@ShaunMooreUK
Quantum Thought



15/33
@petrusenko_max
when's the public release, can't wait to try it out



16/33
@doomgpt
parallel thinking, huh? sounds like a fancy way to say 'let's overthink everything until we confuse ourselves.' good luck with that, google. 🤔



17/33
@simranrambles
beautiful, can't wait



18/33
@aleksandr_13661
Вот это круто 👍



19/33
@pratikclick
Cooked



20/33
@HCSolakoglu
How can we apply for this?



21/33
@s_ruch0
I was using Cursor with Gemini 2.5 Pro and noted it has different Planning texts, honestly it sucks now! It thinks a lot and it easily lost the focus of the implementation...



22/33
@rmansueli
I just want to force more than one function call on the API.



23/33
@kingdavidyonko
I predicted this in late March (check my account). This will enable AI models to think in 3D, making them practical for geometric problems, which often lack clear solution paths like other mathematical concepts. AI must view problems from all angles with spatial awareness.



24/33
@0xKaldalis
mcts here we come



25/33
@_florianmai
Publish a paper or I won't believe that this is any more sophisticated than majority voting.



26/33
@_a1_b1_c1
Still cannot do basic problems



27/33
@noway_news
Waiting for Pro Max Deeper Thinking 2026 V3.5



28/33
@PromptPilot
Parallel thinking is a game-changer.

It’s not just “smarter answers” —
It’s AI thinking more like we do:
Trying, testing, comparing… before it speaks.

Feels like we’re getting closer to real reasoning.



29/33
@s_ruch0
dioporco do a fukking roll back cause it definitely sucks at coding



30/33
@kasgott
Please ask it what 'landed' means in this context



31/33
@VASUOPP
@grok is it free or subscription based



32/33
@ViralAlchemist




Granwn5XUAEPfXx.jpg


33/33
@spinthma
Is it more resistant to hallucinations increased by greater model scope via reasoning? ‎Google Gemini




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




[LLM News] Holy sht


Posted on Tue May 20 17:36:50 2025 UTC

kahokh9k3z1f1.png




[LLM News] New flash. Google won. Don't know how to feel about it


Posted on Tue May 20 17:24:19 2025 UTC

0d0vbrwb1z1f1.jpeg


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,781

1/40
@Google
Say goodbye to the silent era of video generation: Introducing Veo 3 — with native audio generation. 🗣️

Quality is up from Veo 2, and now you can add dialogue between characters, sound effects and background noise.

Veo 3 is available now in the @GeminiApp for Google AI Ultra subscribers in the U.S.

/search?q=#GoogleIO



https://video.twimg.com/amplify_video/1924893779888115713/vid/avc1/1920x1080/XQKvjW0tqCJQornM.mp4

2/40
@ednewtonrex
What’s the training data?



3/40
@krakenfx
Can't even trust video footage any more 😭



4/40
@karmaycholera
Hollywood is cooked



5/40
@youngScipio
Boomers won’t be able to handle this, Facebook is going to be wild



6/40
@ChaseWillden
I wonder when a full featured AI generated film is going to come out



7/40
@DirtyTesLa
I went to make a video and it said I've made too many requests and can't request again for 24 hours 😭 I didn't make a single one today



8/40
@oltexasboy
very cool



9/40
@NFLComedySkits
No thanks, I prefer real actors



10/40
@justalexoki
holy shyt



11/40
@hckinz
$249.99 per month 🫠



12/40
@theeomega0
What's gonna be the point platform rewarding - content creators?

Like YouTube, if creating content can be this easy



13/40
@Rahll
How many billions of dollars worth of other people's property did you steal to make this happen?



14/40
@sriramHODL
wow



15/40
@Vigilomniscry
Something still seems unnatural about it. I think it still needs tweaking, say 2-3 versions before widespread commercialisation, and is almost indistinguishable from an actual clip. Very interesting times for those that are able to utilize it, begs the question, though, how much content does humanity need or can use?



16/40
@makalin
only in us.. as usual.. ok google



17/40
@TheAI_Frontier
Damn I felt Veo2 was just yesterday.



18/40
@ValarDoh3eris
Insane improvement



19/40
@jiwong_kim
Wow Google Veo 3 is amazing 🤩



20/40
@lordchirag
Veo 2



https://video.twimg.com/amplify_video/1924896937985376256/vid/avc1/1280x720/uoeSRVGH8qG4Hz5j.mp4

21/40
@Lorenzo_Negrete
Still looks and sounds obviously fake.



22/40
@nearcyan
rip



23/40
@mhdfaran
It reminds me of her



GratNdyXcAAuAf6.png


24/40
@Joeingram1
WTF



25/40
@playonshaga
Ai is moving at the speed of light.



26/40
@doganuraldesign
Okay, this is impressive



27/40
@KingBootoshi
LOL GG



28/40
@Mira_Network




29/40
@tallmetommy
Hey @OpenAI when



30/40
@chrisstanchak
We're all cooked



31/40
@amit_ajwani
wow



32/40
@MaverickDarby
Interdasting



33/40
@DANNYonPC
Can you do will smith eating spaghetti?



34/40
@lewisknaggs42
this sucks



35/40
@galaxyai__
dialogue AND background noise?? yeah GOOGLE kinda ate this one



36/40
@RickyRickenback
The sounds all wrong. The single splash. That’s so wrong



37/40
@ropirito
holy shyt



38/40
@0xroyce369
ok this is damn impressive



39/40
@AlbertSimonDev
Amazing!



40/40
@GigaTendies
Not sure if I should laugh or cry




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196







1/30
@GoogleDeepMind
Animate your story in your style with Veo 3. 🖌️

Here are some of our favorite videos. Sound on. 🔈 Veo 🧵



https://video.twimg.com/amplify_video/1924968348154265601/vid/avc1/3840x2160/PLzoHht0ajAQoDow.mp4

2/30
@GoogleDeepMind




https://video.twimg.com/amplify_video/1924968407365255172/vid/avc1/3840x2160/F6RBR6GXp47gP_1N.mp4

3/30
@GoogleDeepMind




https://video.twimg.com/amplify_video/1924969009625350144/vid/avc1/3840x2160/-C9n9jvPkIsyhd8v.mp4

4/30
@GoogleDeepMind




https://video.twimg.com/amplify_video/1924969097747693570/vid/avc1/3840x2160/sJ0Hau0t47pN4tWI.mp4

5/30
@GoogleDeepMind




https://video.twimg.com/amplify_video/1924969205147037696/vid/avc1/3840x2160/uwY2KnH133iKpEwO.mp4

6/30
@CodyThieling
My first Veo 3 / Flow creation



https://video.twimg.com/ext_tw_video/1925016129551945728/pu/vid/avc1/1280x720/tEW4VFX4GVOStI0l.mp4

7/30
@mailspec
💫💫💫



8/30
@IstanaAngin
Not available worldwide 😅



9/30
@GiannisWorld_
Is it available in the gemini app??



10/30
@goodtoknow2010
that is great

i follow back anyone that follows me instantly



11/30
@FortKnoxCrypto
Sick animations! Can't wait to try out Veo 3.



12/30
@m33mw4rr10r
B1Uz83yQC3aphKk8EdZ53VFWkQb9o6DKFWG3xRKopump



13/30
@m520152
when it will be aviable in another countries?



14/30
@lordkahl
$VEO3

B1Uz83yQC3aphKk8EdZ53VFWkQb9o6DKFWG3xRKopump



15/30
@Secretcode54
sweeeeeeeeeeeeeeet but not available in canada yet rip



16/30
@burnt_jester
You make me sad, DeepMind.



GrbhNWibQAAn3Jh.png


17/30
@Amansays60
This is the Nimbus 2000 of editing tools for all editors 🧹



18/30
@0xJussec
This is nice. Merging it with @NotebookLM would be awesome



19/30
@simurg123
Is this
B1Uz83yQC3aphKk8EdZ53VFWkQb9o6DKFWG3xRKopump



20/30
@Zerotrust4m
Didn’t veo 2 just come out?



21/30
@1ngramFun
thx for these tech



22/30
@M_ano_j7
When will it be available in india??



23/30
@arikuschnir


[Quoted tweet]
WE CAN TALK! I spent 2 hours playing with Veo 3 @googledeepmind and it blew my mind now that it can do sound! It can talk, and this is all out of the box...


https://video.twimg.com/amplify_video/1924951732284522496/vid/avc1/2560x1440/GwuwTlxK_8vonbNo.mp4

24/30
@Trakintelai
Veo 3’s style-driven animation brings storytelling to life like never before. AI-powered creativity meets personal touch, unlocking new ways to express and engage audiences effortlessly.



25/30
@LukasBreitwiese
Veo 3 Token!

B1Uz83yQC3aphKk8EdZ53VFWkQb9o6DKFWG3xRKopump



26/30
@Donnie_Tesla
cool app



27/30
@Donnie_Tesla
👀



28/30
@doginalgm
Wow that’s clean



29/30
@themickkelly1
Veo 3 lets your stories breathe and come alive in the most beautiful way. With sound on, every emotion hits deeper—pure magic. 🎨✨



30/30
@themickkelly1
Veo 3 truly brings stories to life, turning imagination into powerful emotions. Watching these videos with sound on feels like feeling the heartbeat of every story. 🎨💫




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196


Veo 3 Standup comedy



Posted on Tue May 20 21:22:35 2025 UTC


DeepMind Veo 3 Sailor generated video



Posted on Tue May 20 19:40:00 2025 UTC


Veo 3 generations are next level.



Posted on Wed May 21 02:57:06 2025 UTC

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,781




1/6
@TheAhmadOsman
ByteDance just released their new model, Bagel:

- Unified multimodal with text/image generation and understanding capabilities (Text-to-Image baked in)
- 7B Active Parameters and 14B Total Parameters
- Uses a Mixture of Experts and a Mixture of Transformers architecture



GrcU0UHWsAA3LKa.jpg


2/6
@TheAhmadOsman
Weights

ByteDance-Seed/BAGEL-7B-MoT · Hugging Face



3/6
@TheAhmadOsman
Paper

Emerging Properties in Unified Multimodal Pretraining



4/6
@TheAhmadOsman
"Utilizes two separate encoders to capture pixel-level and semantic-level features of an image. The overall framework follows a Next Group of Token Prediction paradigm, where the model is trained to predict the next group of language or visual tokens as a compression target."



5/6
@RandolphCarterZ
5 years ago this would've been

TOP SECRET Advanced Hyper-Cryptek AGI

Now it's Bagel

What a time to be alive



6/6
@QuantvmH
mixture of transformers?? is that the first model to use MoT or am i not up to date




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/10
@_akhaliq
ByteDance just dropped BAGEL on Hugging Face

The Open-Source Unified Multimodal Model



https://video.twimg.com/amplify_video/1925021495937376256/vid/avc1/1920x1080/9trKKMX4CX-5RwrR.mp4

2/10
@_akhaliq
discuss with author: Paper page - Emerging Properties in Unified Multimodal Pretraining



3/10
@_akhaliq
model: ByteDance-Seed/BAGEL-7B-MoT · Hugging Face



4/10
@jconorgrogan
oh wow, real weights @cocktailpeanut



5/10
@AIMachineDream
OMG. ByteDance looks a full generation ahead.



6/10
@AmiaoAIVer
Author @HaoqiFan



7/10
@Linlin442871
@ylecun



8/10
@AllieGuo74094
impressive job



9/10
@skynetislov3
Pretty cool



10/10
@0xMGWR
open what?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
@gm8xx8
BAGEL is a 7B decoder-only multimodal model using a MoT architecture, developed by ByteDance.

HF: ByteDance-Seed/BAGEL-7B-MoT · Hugging Face

Emerging Properties in Unified Multimodal Pretraining
PAPER: Emerging Properties in Unified Multimodal Pretraining
PROJECT: BAGEL: The Open-Source Unified Multimodal Model




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/6
@AIWarper
Google overshadowed this tool's release today.

ByteDance's "Bagel" model

Apache 2.0 open sourced. Works fairly well too on their demo.



Grb9YcabAAECGmP.jpg

Grb9aAwbAAEBVgd.jpg

Grb9bSkbAAA-pXW.jpg


2/6
@AIWarper
GitHub - ByteDance-Seed/Bagel



3/6
@peteromallet
Holy smokes that’s cool, has anyone tested?



4/6
@AIWarper
Just in their browser tool



5/6
@bowtiedwhitebat
dey all ai?



6/6
@NtRqC21USkYxjNs
Can it run on 4090 24g




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196






1/4
@christiancooper
ByteDance just dropped a model they claim with "emerging" intelligence trained on 5T tokens. I tend to believe them.

Called the Open-Source Unified Model

Me: Draw me a small section of The Maestà by Simone Martini

The Model: /search?q=#BAGEL

[Quoted tweet]
As training scales from 0.2T → 5T tokens, we observe a clear evolution:
🔹 Basic understanding →
🔹 Text-to-image generation →
🔹 Rich editing →
🔹 3D manipulation & world navigation.
The intelligence is emerging 🚀


GrcjXeEbAAM6wyT.jpg

Grb-BG9bAAQbO-p.jpg


2/4
@christiancooper
Now draw a scene from memory of The Hours of Jeanne d'Évreux by Jean Pucelle



Grcl8rAaoAAcdlo.jpg


3/4
@christiancooper
Model endpoints are live:

GitHub - HarleyCoops/Bagel



4/4
@christiancooper
Wild....



GrcnKUkbUAEoLQC.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196


[News] ByteDance Bagel - Multimodal 14B MOE 7b active model



Posted on Wed May 21 02:15:34 2025 UTC

/r/StableDiffusion/comments/1krmrd7/bytedance_bagel_multimodal_14b_moe_7b_active_model/

GitHub - ByteDance-Seed/Bagel

BAGEL: The Open-Source Unified Multimodal Model

Emerging Properties in Unified Multimodal Pretraining

So they release this multimodal model that actually creates images and they show on a benchmark it beating flux on GitHub - djghosh13/geneval: GenEval: An object-focused framework for evaluating text-to-image alignment (which I'm not familiar with but seems to be addressing prompt adherence with objects)
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,781

A Step-by-Step Coding Guide to Efficiently Fine-Tune Qwen3-14B Using Unsloth AI on Google Colab with Mixed Datasets and LoRA Optimization​


By Asif Razzaq

May 20, 2025

Fine-tuning LLMs often requires extensive resources, time, and memory, challenges that can hinder rapid experimentation and deployment. Unsloth AI revolutionizes this process by enabling fast, efficient fine-tuning state-of-the-art models like Qwen3-14B with minimal GPU memory, leveraging advanced techniques such as 4-bit quantization and LoRA (Low-Rank Adaptation). In this tutorial, we walk through a practical implementation on Google Colab to fine-tune Qwen3-14B using a combination of reasoning and instruction-following datasets, combining Unsloth’s FastLanguageModel utilities with trl.SFTTrainer users can achieve powerful fine-tuning performance with just consumer-grade hardware.

Code:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

We install all the essential libraries required for fine-tuning the Qwen3 model using Unsloth AI. It conditionally installs dependencies based on the environment, using a lightweight approach on Colab to ensure compatibility and reduce overhead. Key components like bitsandbytes, trl, xformers, and unsloth_zoo are included to enable 4-bit quantized training and LoRA-based optimization.

700x300-v3-1-1.png


Code:
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-14B",
    max_seq_length = 2048,
    load_in_4bit = True,
    load_in_8bit = False,
    full_finetuning = False,
)

We load the Qwen3-14B model using FastLanguageModel from the Unsloth library, which is optimized for efficient fine-tuning. It initializes the model with a context length of 2048 tokens and loads it in 4-bit precision, significantly reducing memory usage. Full fine-tuning is disabled, making it suitable for lightweight parameter-efficient techniques like LoRA.

Code:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 32,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

We apply LoRA (Low-Rank Adaptation) to the Qwen3 model using FastLanguageModel.get_peft_model. It injects trainable adapters into specific transformer layers (like q_proj, v_proj, etc.) with a rank of 32, enabling efficient fine-tuning while keeping most model weights frozen. Using “unsloth” gradient checkpointing further optimizes memory usage, making it suitable for training large models on limited hardware.

Code:
from datasets import load_dataset

reasoning_dataset = load_dataset("unsloth/OpenMathReasoning-mini", split="cot")
non_reasoning_dataset = load_dataset("mlabonne/FineTome-100k", split="train")

We load two pre-curated datasets from the Hugging Face Hub using the library. The reasoning_dataset contains chain-of-thought (CoT) problems from Unsloth’s OpenMathReasoning-mini, designed to enhance logical reasoning in the model. The non_reasoning_dataset pulls general instruction-following data from mlabonne’s FineTome-100k, which helps the model learn broader conversational and task-oriented skills. Together, these datasets support a well-rounded fine-tuning objective.

Code:
def generate_conversation(examples):
    problems  = examples["problem"]
    solutions = examples["generated_solution"]
    conversations = []
    for problem, solution in zip(problems, solutions):
        conversations.append([
            {"role": "user", "content": problem},
            {"role": "assistant", "content": solution},
        ])
    return {"conversations": conversations}

This function, generate_conversation, transforms raw question–answer pairs from the reasoning dataset into a chat-style format suitable for fine-tuning. For each problem and its corresponding generated solution, a conversation is conducted in which the user asks a question and the assistant provides the answer. The output is a list of dictionaries following the structure expected by chat-based language models, preparing the data for tokenization with a chat template.

Code:
reasoning_conversations = tokenizer.apply_chat_template(
    reasoning_dataset["conversations"],
    tokenize=False,
)

from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(non_reasoning_dataset)

non_reasoning_conversations = tokenizer.apply_chat_template(
    dataset["conversations"],
    tokenize=False,
)

import pandas as pd

chat_percentage = 0.75
non_reasoning_subset = pd.Series(non_reasoning_conversations).sample(
    int(len(reasoning_conversations) * (1.0 - chat_percentage)),
    random_state=2407,
)

data = pd.concat([
    pd.Series(reasoning_conversations),
    pd.Series(non_reasoning_subset)
])
data.name = "text"

We prepare the fine-tuning dataset by converting the reasoning and instruction datasets into a consistent chat format and then combining them. It first applies the tokenizer’s apply_chat_template to convert structured conversations into tokenizable strings. The standardize_sharegpt function normalizes the instruction dataset into a compatible structure. Then, a 75-25 mix is created by sampling 25% of the non-reasoning (instruction) conversations and combining them with the reasoning data. This blend ensures the model is exposed to logical reasoning and general instruction-following tasks, improving its versatility during training. The final combined data is stored as a single-column Pandas Series named “text”.

Code:
from datasets import Dataset

combined_dataset = Dataset.from_pandas(pd.DataFrame(data))
combined_dataset = combined_dataset.shuffle(seed=3407)

from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=combined_dataset,
    eval_dataset=None,  
    args=SFTConfig(
        dataset_text_field="text",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=30,
        learning_rate=2e-4,
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        report_to="none",
    )
)

We take the preprocessed conversations, wrap them into a Hugging Face Dataset (ensuring the data is in a consistent format), and shuffle the dataset with a fixed seed for reproducibility. Then, the fine-tuning trainer is initialized using trl’s SFTTrainer and SFTConfig. The trainer is set up to use the combined dataset (with the text column field named “text”) and defines training hyperparameters like batch size, gradient accumulation, number of warmup and training steps, learning rate, optimizer parameters, and a linear learning rate scheduler. This configuration is geared towards efficient fine-tuning while maintaining reproducibility and logging minimal details (with report_to=”none”).

Code:
trainer.train()

trainer.train() starts the fine-tuning process for the Qwen3-14B model using the SFTTrainer. It trains the model on the prepared mixed dataset of reasoning and instruction-following conversations, optimizing only the LoRA-adapted parameters thanks to the underlying Unsloth setup. Training will proceed according to the configuration specified earlier (e.g., max_steps=30, batch_size=2, lr=2e-4), and progress will be printed every logging step. This final command launches the actual model adaptation based on your custom data.

Code:
model.save_pretrained("qwen3-finetuned-colab")
tokenizer.save_pretrained("qwen3-finetuned-colab")

We save the fine-tuned model and tokenizer locally to the “qwen3-finetuned-colab” directory. By calling save_pretrained(), the adapted weights and tokenizer configuration can be reloaded later for inference or further training, locally or for uploading to the Hugging Face Hub.

In conclusion, with the help of Unsloth AI, fine-tuning massive LLMs like Qwen3-14B becomes feasible, using limited resources, and is highly efficient and accessible. This tutorial demonstrated how to load a 4-bit quantized version of the model, apply structured chat templates, mix multiple datasets for better generalization, and train using TRL’s SFTTrainer. Whether you’re building custom assistants or specialized domain models, Unsloth’s tools dramatically reduce the barrier to fine-tuning at scale. As open-source fine-tuning ecosystems evolve, Unsloth continues to lead the way in making LLM training faster, cheaper, and more practical for everyone.




Check out theCOLAB NOTEBOOK . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our95k+ ML SubReddit and Subscribe toour Newsletter .

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,781

Enhancing Language Model Generalization: Bridging the Gap Between In-Context Learning and Fine-Tuning​


By Sajjad Ansari

May 20, 2025

Language models (LMs) have great capabilities as in-context learners when pretrained on vast internet text corpora, allowing them to generalize effectively from just a few task examples. However, fine-tuning these models for downstream tasks presents significant challenges. While fine-tuning requires hundreds to thousands of examples, the resulting generalization patterns show limitations. For example, models fine-tuned on statements like “B’s mother is A” struggle to answer related questions like “Who is A’s son?” However, the LMs can handle such reverse relations in context. This raises questions about the differences between in-context learning and fine-tuning generalization patterns, and how these differences should inform adaptation strategies for downstream tasks.

Research into improving LMs’ adaptability has followed several key approaches. In-context learning studies have examined learning and generalization patterns through empirical, mechanistic, and theoretical analyses. Out-of-context learning research explores how models utilize information not explicitly included in prompts. Data augmentation techniques use LLMs to enhance performance from limited datasets, with specific solutions targeting issues like the reversal curse through hardcoded augmentations, deductive closure training, and generating reasoning pathways. Moreover, synthetic data approaches have evolved from early hand-designed data to improve generalization in domains like linguistics or mathematics to more recent methods that generate data directly from language models.

Researchers from Google DeepMind and Stanford University have constructed several datasets that isolate knowledge from pretraining data to create clean generalization tests. Performance is evaluated across various generalization types by exposing pretrained models to controlled information subsets, both in-context and through fine-tuning. Their findings reveal that in-context learning shows more flexible generalization than fine-tuning in data-matched settings, though there are some exceptions where fine-tuning can generalize to reversals within larger knowledge structures. Building on these insights, researchers have developed a method that enhances fine-tuning generalization by including in-context inferences into the fine-tuning data.

Researchers employ multiple datasets carefully designed to isolate specific generalization challenges or insert them within broader learning contexts. Evaluation relies on multiple-choice likelihood scoring without providing answer choices in context. The experiments involve fine-tuning Gemini 1.5 Flash using batch sizes of 8 or 16. For in-context evaluation, the researchers combine training documents as context for the instruction-tuned model, randomly subsampling by 8x for larger datasets to minimize interference issues. The key innovation is a dataset augmentation approach using in-context generalization to enhance fine-tuning dataset coverage. This includes local and global strategies, each employing distinct contexts and prompts.

On the Reversal Curse dataset, in-context learning achieves near-ceiling performance on reversals, while conventional fine-tuning shows near-zero accuracy as models favor incorrect celebrity names seen during training. Fine-tuning with data augmented by in-context inferences matches the high performance of pure in-context learning. Testing on simple nonsense reversals reveals similar patterns, though with less pronounced benefits. For simple syllogisms, while the pretrained model performs at chance level (indicating no data contamination), fine-tuning does produce above-chance generalization for certain syllogism types where logical inferences align with simple linguistic patterns. However, in-context learning outperforms fine-tuning, with augmented fine-tuning showing the best overall results.

AD_4nXc_s3nlxF4-HFzI3eZ6VqvzQ0IgSeqe_iI5Vv7T-QU_xCA1YxCpZgQ0vK2szTT1a_r9FlAaV6WYg92eqST07Bpk3servkXnh3dJ2SM3BMXH2Qn33wiZLqufAlKk2Jy77FVH1BJZ


In conclusion, this paper explores generalization differences between in-context learning and fine-tuning when LMs face novel information structures. Results show in-context learning’s superior generalization for certain inference types, prompting the researchers to develop methods that enhance fine-tuning performance by incorporating in-context inferences into training data. Despite promising outcomes, several limitations affect the study. The first one is the dependency on nonsense words and implausible operations. Second, the research focuses on specific LMs, limiting the results’ generality. Future research should investigate learning and generalization differences across various models to expand upon these findings, especially newer reasoning models.




Check out thePaper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our95k+ ML SubReddit and Subscribe toour Newsletter .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,781

Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative Modeling​


By Asif Razzaq

May 21, 2025

Data Scarcity in Generative Modeling


Generative models traditionally rely on large, high-quality datasets to produce samples that replicate the underlying data distribution. However, in fields like molecular modeling or physics-based inference, acquiring such data can be computationally infeasible or even impossible. Instead of labeled data, only a scalar reward—typically derived from a complex energy function—is available to judge the quality of generated samples. This presents a significant challenge: how can one train generative models effectively without direct supervision from data?

Meta AI Introduces Adjoint Sampling, a New Learning Algorithm Based on Scalar Rewards


Meta AI tackles this challenge withAdjoint Sampling , a novel learning algorithm designed for training generative models using only scalar reward signals. Built on the theoretical framework of stochastic optimal control (SOC), Adjoint Sampling reframes the training process as an optimization task over a controlled diffusion process. Unlike standard generative models, it does not require explicit data. Instead, it learns to generate high-quality samples by iteratively refining them using a reward function—often derived from physical or chemical energy models.

Adjoint Sampling excels in scenarios where only an unnormalized energy function is accessible. It produces samples that align with the target distribution defined by this energy, bypassing the need for corrective methods like importance sampling or MCMC, which are computationally intensive.

Screenshot-2025-05-21-at-12.03.45%E2%80%AFAM-1-1024x552.png
Source: Adjoint Sampling: Highly Scalable Diffusion Samplers via Adjoint Matching

Technical Details


The foundation of Adjoint Sampling is a stochastic differential equation (SDE) that models how sample trajectories evolve. The algorithm learns a control drift u(x,t)u(x, t)u(x,t) such that the final state of these trajectories approximates a desired distribution (e.g., Boltzmann). A key innovation is its use ofReciprocal Adjoint Matching (RAM) —a loss function that enables gradient-based updates using only the initial and final states of sample trajectories. This sidesteps the need to backpropagate through the entire diffusion path, greatly improving computational efficiency.

By sampling from a known base process and conditioning on terminal states, Adjoint Sampling constructs a replay buffer of samples and gradients, allowing multiple optimization steps per sample. This on-policy training method provides scalability unmatched by previous approaches, making it suitable for high-dimensional problems like molecular conformer generation.

Moreover, Adjoint Sampling supports geometric symmetries and periodic boundary conditions, enabling models to respect molecular invariances like rotation, translation, and torsion. These features are crucial for physically meaningful generative tasks in chemistry and physics.

Performance Insights and Benchmark Results


Adjoint Sampling achieves state-of-the-art results in both synthetic and real-world tasks. On synthetic benchmarks such as the Double-Well (DW-4), Lennard-Jones (LJ-13 and LJ-55) potentials, it significantly outperforms baselines like DDS and PIS, especially in energy efficiency. For example, where DDS and PIS require 1000 evaluations per gradient update, Adjoint Sampling only uses three, with similar or better performance in Wasserstein distance and effective sample size (ESS).

In a practical setting, the algorithm was evaluated on large-scale molecular conformer generation using the eSEN energy model trained on the SPICE-MACE-OFF dataset. Adjoint Sampling, especially its Cartesian variant with pretraining, achieved up to 96.4% recall and 0.60 Å mean RMSD, surpassing RDKit ETKDG—a widely used chemistry-based baseline—across all metrics. The method generalizes well to the GEOM-DRUGS dataset, showing substantial improvements in recall while maintaining competitive precision.

Screenshot-2025-05-21-at-12.05.31%E2%80%AFAM-1-1024x399.png


The algorithm’s ability to explore the configuration space broadly, aided by its stochastic initialization and reward-based learning, results in greater conformer diversity—critical for drug discovery and molecular design.

Conclusion: A Scalable Path Forward for Reward-Driven Generative Models


Adjoint Sampling represents a major step forward in generative modeling without data. By leveraging scalar reward signals and an efficient on-policy training method grounded in stochastic control, it enables scalable training of diffusion-based samplers with minimal energy evaluations. Its integration of geometric symmetries and its ability to generalize across diverse molecular structures position it as a foundational tool in computational chemistry and beyond.




Check out thePaper,Model on Hugging FaceandGitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our95k+ ML SubReddit and Subscribe toour Newsletter .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,781

Meta Introduces KernelLLM: An 8B LLM that Translates PyTorch Modules into Efficient Triton GPU Kernels​


By Sana Hassan

May 20, 2025

Meta has introduced KernelLLM, an 8-billion-parameter language model fine-tuned from Llama 3.1 Instruct, aimed at automating the translation of PyTorch modules into efficient Triton GPU kernels. This initiative seeks to lower the barriers to GPU programming by simplifying kernel development processes.

Technical Overview


KernelLLM is trained on approximately 25,000 paired examples of PyTorch modules and their corresponding Triton kernel implementations. The dataset, known as KernelBook, comprises filtered code from The Stack and synthetically generated samples using torch.compile() and other prompting techniques.

The model employs a supervised instruction tuning approach, utilizing prompt templates that include format examples during both training and evaluation. Training was conducted over 10 epochs with a batch size of 32, using 16 GPUs over approximately 12 hours (192 GPU hours).

llm_performance_comparison-1-1024x594.png


Performance Evaluation


KernelLLM’s performance was assessed using KernelBench-Triton, a benchmark designed to evaluate the generation of Triton kernels from PyTorch modules. The model achieved a Pass@1 score of 20.2, outperforming larger models such as GPT-4o (~200B parameters) and DeepSeek V3 (671B parameters), which scored 15 and 16 respectively. With multiple inferences, KernelLLM’s Pass@10 and Pass@20 scores reached 51.8 and 57.1, indicating robust performance in generating correct kernels.

Implications for GPU Programming


By automating the generation of Triton kernels from PyTorch modules, KernelLLM has the potential to streamline the development of GPU-accelerated applications. This could be particularly beneficial for developers seeking to optimize performance without delving into the complexities of manual kernel programming.

The model’s ability to produce efficient kernels may also contribute to more accessible and efficient utilization of GPU resources, potentially impacting areas such as deep learning model training and inference.




Check out theModel on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our95k+ ML SubReddit and Subscribe toour Newsletter .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,719
Reputation
10,592
Daps
185,781

1/11
@AngryTomtweets
AI drama is insane...

This is SkyReels V2, the world’s first open-source AI video tool that lets you make videos of any length with jaw-dropping quality.

Smarter prompts, epic quality and 100% open-source.

Here's how it works:

https://video.twimg.com/amplify_video/1925570618235527169/vid/avc1/940x720/tzEn9yaB53aDLKbt.mp4

2/11
@AngryTomtweets
Meet SkyReels V2 - The world’s first open-source AI video tool that lets you make videos of any length for free!

It’s a game-changer for all the creative industry...

Try here: SkyReels|Visualize Your Story

3/11
@AngryTomtweets
1. Smarter prompts

- SkyCaptioner-V1 turns your ideas into pro-level storyboards
- Making your vision come to life effortlessly.

https://video.twimg.com/amplify_video/1925570694194352128/vid/avc1/1280x720/wyqP9wDIDPmH3aNm.mp4

4/11
@AngryTomtweets
2. Epic quality

- Smooth, cinematic visuals with no time limits
- Perfect for everything from short clips to full movies.

https://video.twimg.com/amplify_video/1925570757327036419/vid/avc1/1280x720/BAVihBnO7KxkVsNb.mp4

5/11
@AngryTomtweets
3. 100% Open-Source

- Free to use on GitHub - SkyworkAI/SkyReels-V2: SkyReels-V2: Infinite-length Film Generative model with SkyTools and runnable on everyday GPUs.
- It beats top closed-source tools in VBench scores!

6/11
@AngryTomtweets
4. Unlimited duration for seamless storytelling

- SkyReels-V2 can make videos go on and on without stopping, while keeping them looking good and the same.

https://video.twimg.com/amplify_video/1925570832602148864/vid/avc1/720x720/FjrnJq-Q2qrMw1lV.mp4

7/11
@AngryTomtweets
5. Generate B-rolls

- Use over 400+ natural human actions
- Ideal for building cinematic sequences and detailed storyboards

https://video.twimg.com/amplify_video/1925570919294279681/vid/avc1/1280x720/l365kJ7IlbHRNtLh.mp4

8/11
@AngryTomtweets
6. Train your custom video effect (LoRA)

- Upload files with the similar visual style or content and start the training.
- The robot will gradually learn the patterns and features, ultimately producing stable, high-quality results in the desired style.

https://video.twimg.com/amplify_video/1925570984679219200/vid/avc1/1076x720/-e7SA56c5guQgzTJ.mp4

9/11
@AngryTomtweets
What are you waiting for?

Try /SkyReels - /search?q=#SkyReels here:

SkyReels|Visualize Your Story

10/11
@mhdfaran
Open-source is the way to go

11/11
@AngryTomtweets
Yes… 100%


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top