bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,437
[Resources] I'm curating a list of every OCR out there and running tests on their features. Contribution welcome!



Posted on Thu Jul 10 16:09:38 2025 UTC


Hi! I'm compiling a list of document parsers available on the market and testing their feature coverage.

So far, I've tested 14 OCRs/parsers for tables, equations, handwriting, two-column layouts, and multiple-column layouts. You can view the outputs from each parser in the `results` folder. The ones I've tested are mostly open source or with generous free quota.

🚩 Coming soon: benchmarks for each OCR - score from 0 (doesn't work) to 5 (perfect)

Feedback & contribution are welcome!
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,437

1/1
@rohanpaul_ai
Multimodal reasoning models often ignore the image and guess from text, so answers break.

Perception-Aware Policy Optimization (PAPO) trains the policy to notice vision by punishing it when a masked image changes nothing .

The paper checks failures first, finding 67% mistakes come from bad perception.

Standard GRPO reward cares only about format and final numeric answer, not about looking.

PAPO adds a KL divergence loss between logits on the real image and a randomly masked version.

If the two distributions diverge, the model clearly used visual clues, and the loss rewards that.

Training Qwen2.5 VL 3B with this trick lifts average accuracy by roughly 4% and cuts perception errors 30%.

A double entropy term stops collapse when the KL weight grows too high.

Extra compute is a second forward pass, about 50 seconds per step on 4 H100s.

----

Paper – arxiv. org/abs/2507.06448

Paper Title: "Perception-Aware Policy Optimization for Multimodal Reasoning"

GveiK2IXYAAnLMn.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,437

1/1
@rohanpaul_ai
LLMs lose track when a prompt grows beyond tens of thousands tokens.

PERK solves this by writing the long context into a tiny LoRA adapter during inference.

The document is split into 256 token clips, processed together, and their gist is stored as weights.

The frozen base model then answers using that new memory, not the raw text.

Two nested training loops prepare this trick.

The inner loop learns to compress clips.

The outer loop tunes the starting adapter while skipping most inner gradients to save memory.

Only adapter parameters move, so training and inference stay light.

A 127M GPT‑2 with PERK beat larger 1.4B baselines by around 20% and gained 90% over its own prompt version.

With a 0.5B Qwen, PERK held steady from 1K to 32K tokens and handled 128K tokens on one H100 with 35.2GB.

Position shifts and extra reasoning hops barely hurt accuracy, while in‑context baselines collapse.

PERK turns long‑context reasoning into quick parameter learning instead of prompt juggling.

----

Paper – arxiv. org/abs/2507.06415

Paper Title: "PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning"

GvehgS5W4AEqn96.png



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,437

1/1
@rohanpaul_ai
Current code tests are few and similar, so language models look flawless.

The paper proves those suites miss many real bugs in generated programs.

It builds TCGBench and simple metrics that show random input sampling quickly plateaus.

To push past that wall the authors propose SAGA, a human plus LLM testing routine.

SAGA learns rules from correct code and blind spots from wrong code, then synthesizes sharp edge cases.

With 50 cases it spots 90.62% bad submissions and corrects 32.58% weak verifiers, far ahead of baselines.

A distilled 7B model, TCGCoder, scales the idea and powers CodeCompass, a benchmark that lowers model pass scores by about 10%.

Better tests also stop reward hacking in reinforcement learning pipelines, giving truer feedback.

----

Paper – arxiv. org/abs/2507.06920

Paper Title: "Rethinking Verification for LLM Code Generation: From Generation to Testing"

GvehPVVXUAAhID9.png



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,437

1/1
@rohanpaul_ai
Current large language provers weld reasoning and proof steps together, so they often miss hard Olympiad problems.

This paper shows that splitting those duties finally cracks 5 recent IMO challenges.

Researchers first note a stubborn gap.

Generative models routinely reach 80% informal accuracy on Putnam style tasks, yet formal success lingers near 8%.

That drop traces to reward schemes that favor quick tactic scripts and erode deeper thinking.

The team keeps each talent where it shines.

A general purpose Reasoner only proposes formal lemmas, while a lean Prover checks them, filters failures, then stitches the proof using those trusted pieces.

With that loop and 128 proof attempts per lemma, the system solves IMO 2000 P2, 2005 P3, 2011 P3, 2019 P1 and 2020 P2, a first for open source tools, and it does so in Lean.

All verified subgoals arrive in a public dataset, giving mathematicians new handles and giving provers a tougher gym for future work.

----

Paper – arxiv. org/abs/2507.06804

Paper Title: "Towards Solving More Challenging IMO Problems via Decoupled Reasoning and Proving"

Gvegv9HakAE2H_8.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,437

1/1
@rohanpaul_ai
Diff‑Mamba shows that subtracting two parallel Mamba paths cuts noise and boosts long‑context recall in language models.

The study records lower perplexity, faster convergence, and stronger retrieval than vanilla Mamba on WikiText‑103, Enwik8, Text8, PG19, and BABILong.

⚙️ Mamba spreads focus too widely and keeps irrelevant tokens alive through layers, hurting precision and wasting compute.

Since Mamba lacks the softmax damping of Transformers, its state‑space mix evenly weights everything it sees.

Diff‑Mamba feeds the same input into two lighter copies, normalizes their outputs, then subtracts one from the other with a learned lambda.

This cancellation drops shared noise yet leaves task signals intact, much like two microphones in a conference call.

In benchmarks, a 12‑layer variant cuts perplexity by 0.4 on WikiText‑103 and halves error on 64K‑token BABILong queries.

Its hidden states show higher signal‑to‑noise when examined with the Tuned Lens probe.

Training curves also flatten sooner, hinting that less noise smooths gradients.

Because subtraction runs inside one fused kernel, inference latency stays close to original Mamba.

----

Paper – arxiv. org/abs/2507.06204

Paper Title: "Differential Mamba"

Gvef9rFaIAAL3lQ.png



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,437

1/1
@rohanpaul_ai
Paper shows how to squeeze more skill from small web‑surfing LLMs without burning absurd compute.

Open agents copy a 70B teacher for a while then switch to reinforcement learning (RL), and that switch timing is the whole game.

Authors trained a 8B student on teacher demos, branched into RL at different checkpoints, and ran 1,370 mixes to see what sticks.

Early but not immediate branching beats pure imitation or pure RL, matching the best imitation score on MiniWoB++ using 55% of the flops.

The same recipe narrows, but does not close, the WorkArena gap, hinting that hard office workflows still need richer data or bigger brains.

Bootstrapped stats reveal stable knobs: decoding temperature 0.25, zero advantage filtering, grouped advantage, big batch 512, and modest 1e‑6 learning rate.

The playbook gives smaller teams a clear, cheaper path to teach open models reliable multi‑step browser habits.

----

Paper – arxiv. org/abs/2507.04103

Paper Title: "How to Train Your LLM Web Agent: A Statistical Diagnosis"

Gveea5NaYAASj-V.png



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,437

Moonshot AI’s Kimi K2 outperforms GPT-4 in key benchmarks — and it’s free​


Michael Nuñez@MichaelFNunez

July 11, 2025 3:56 PM

Credit: VentureBeat made with Midjourney


Credit: VentureBeat made with Midjourney



Moonshot AI, the Chinese artificial intelligence startup behind the popular Kimi chatbot, released an open-source language model on Friday that directly challenges proprietary systems from OpenAI and Anthropic with particularly strong performance on coding and autonomous agent tasks.

The new model, called Kimi K2, features 1 trillion total parameters with 32 billion activated parameters in a mixture-of-experts architecture. The company is releasing two versions: a foundation model for researchers and developers, and an instruction-tuned variant optimized for chat and autonomous agent applications.

? Hello, Kimi K2! Open-Source Agentic Model!

? 1T total / 32B active MoE model

? SOTA on SWE Bench Verified, Tau2 & AceBench among open models

?Strong in coding and agentic tasks

? Multimodal & thought-mode not supported for now

With Kimi K2, advanced agentic intelligence… pic.twitter.com/PlRQNrg9JL

— Kimi.ai (@Kimi_Moonshot) July 11, 2025

“Kimi K2 does not just answer; it acts,” the company stated in its announcement blog. “With Kimi K2, advanced agentic intelligence is more open and accessible than ever. We can’t wait to see what you build.”

The model’s standout feature is its optimization for “agentic” capabilities — the ability to autonomously use tools, write and execute code, and complete complex multi-step tasks without human intervention. In benchmark tests, Kimi K2 achieved 65.8% accuracy on SWE-bench Verified, a challenging software engineering benchmark, outperforming most open-source alternatives and matching some proprietary models.



David meets Goliath: How Kimi K2 outperforms Silicon Valley’s billion-dollar models​


The performance metrics tell a story that should make executives at OpenAI and Anthropic take notice. Kimi K2-Instruct doesn’t just compete with the big players — it systematically outperforms them on tasks that matter most to enterprise customers.

On LiveCodeBench, arguably the most realistic coding benchmark available, Kimi K2 achieved 53.7% accuracy, decisively beating DeepSeek-V3‘s 46.9% and GPT-4.1‘s 44.7%. More striking still: it scored 97.4% on MATH-500 compared to GPT-4.1’s 92.4%, suggesting Moonshot has cracked something fundamental about mathematical reasoning that has eluded larger, better-funded competitors.

But here’s what the benchmarks don’t capture: Moonshot is achieving these results with a model that costs a fraction of what incumbents spend on training and inference. While OpenAI burns through hundreds of millions on compute for incremental improvements, Moonshot appears to have found a more efficient path to the same destination. It’s a classic innovator’s dilemma playing out in real time — the scrappy outsider isn’t just matching the incumbent’s performance, they’re doing it better, faster, and cheaper.

The implications extend beyond mere bragging rights. Enterprise customers have been waiting for AI systems that can actually complete complex workflows autonomously, not just generate impressive demos. Kimi K2’s strength on SWE-bench Verified suggests it might finally deliver on that promise.



The MuonClip breakthrough: Why this optimizer could reshape AI training economics​


Buried in Moonshot’s technical documentation is a detail that could prove more significant than the model’s benchmark scores: their development of the MuonClip optimizer, which enabled stable training of a trillion-parameter model “with zero training instability.”

This isn’t just an engineering achievement — it’s potentially a paradigm shift. Training instability has been the hidden tax on large language model development, forcing companies to restart expensive training runs, implement costly safety measures, and accept suboptimal performance to avoid crashes. Moonshot’s solution directly addresses exploding attention logits by rescaling weight matrices in query and key projections, essentially solving the problem at its source rather than applying band-aids downstream.

The economic implications are staggering. If MuonClip proves generalizable — and Moonshot suggests it is — the technique could dramatically reduce the computational overhead of training large models. In an industry where training costs are measured in tens of millions of dollars, even modest efficiency gains translate to competitive advantages measured in quarters, not years.

More intriguingly, this represents a fundamental divergence in optimization philosophy. While Western AI labs have largely converged on variations of AdamW, Moonshot’s bet on Muon variants suggests they’re exploring genuinely different mathematical approaches to the optimization landscape. Sometimes the most important innovations come not from scaling existing techniques, but from questioning their foundational assumptions entirely.



Open source as competitive weapon: Moonshot’s radical pricing strategy targets big tech’s profit centers​


Moonshot’s decision to open-source Kimi K2 while simultaneously offering competitively priced API access reveals a sophisticated understanding of market dynamics that goes well beyond altruistic open-source principles.

At $0.15 per million input tokens for cache hits and $2.50 per million output tokens, Moonshot is pricing aggressively below OpenAI and Anthropic while offering comparable — and in some cases superior — performance. But the real strategic masterstroke is the dual availability: enterprises can start with the API for immediate deployment, then migrate to self-hosted versions for cost optimization or compliance requirements.

This creates a trap for incumbent providers. If they match Moonshot’s pricing, they compress their own margins on what has been their most profitable product line. If they don’t, they risk customer defection to a model that performs just as well for a fraction of the cost. Meanwhile, Moonshot builds market share and ecosystem adoption through both channels simultaneously.

The open-source component isn’t charity — it’s customer acquisition. Every developer who downloads and experiments with Kimi K2 becomes a potential enterprise customer. Every improvement contributed by the community reduces Moonshot’s own development costs. It’s a flywheel that leverages the global developer community to accelerate innovation while building competitive moats that are nearly impossible for closed-source competitors to replicate.



From demo to reality: Why Kimi K2’s agent capabilities signal the end of chatbot theater​


The demonstrations Moonshot shared on social media reveal something more significant than impressive technical capabilities—they show AI finally graduating from parlor tricks to practical utility.

Consider the salary analysis example: Kimi K2 didn’t just answer questions about data, it autonomously executed 16 Python operations to generate statistical analysis and interactive visualizations. The London concert planning demonstration involved 17 tool calls across multiple platforms — search, calendar, email, flights, accommodations, and restaurant bookings. These aren’t curated demos designed to impress; they’re examples of AI systems actually completing the kind of complex, multi-step workflows that knowledge workers perform daily.

This represents a philosophical shift from the current generation of AI assistants that excel at conversation but struggle with execution. While competitors focus on making their models sound more human, Moonshot has prioritized making them more useful. The distinction matters because enterprises don’t need AI that can pass the Turing test—they need AI that can pass the productivity test.

The real breakthrough isn’t in any single capability, but in the seamless orchestration of multiple tools and services. Previous attempts at “agent” AI required extensive prompt engineering, careful workflow design, and constant human oversight. Kimi K2 appears to handle the cognitive overhead of task decomposition, tool selection, and error recovery autonomously—the difference between a sophisticated calculator and a genuine thinking assistant.



The great convergence: When open source models finally caught the leaders​


Kimi K2’s release marks an inflection point that industry observers have predicted but rarely witnessed: the moment when open-source AI capabilities genuinely converge with proprietary alternatives.

Unlike previous “GPT killers” that excelled in narrow domains while failing on practical applications, Kimi K2 demonstrates broad competence across the full spectrum of tasks that define general intelligence. It writes code, solves mathematics, uses tools, and completes complex workflows—all while being freely available for modification and self-deployment.

This convergence arrives at a particularly vulnerable moment for the AI incumbents. OpenAI faces mounting pressure to justify its $300 billion valuation while Anthropic struggles to differentiate Claude in an increasingly crowded market. Both companies have built business models predicated on maintaining technological advantages that Kimi K2 suggests may be ephemeral.

The timing isn’t coincidental. As transformer architectures mature and training techniques democratize, the competitive advantages increasingly shift from raw capability to deployment efficiency, cost optimization, and ecosystem effects. Moonshot seems to understand this transition intuitively, positioning Kimi K2 not as a better chatbot, but as a more practical foundation for the next generation of AI applications.

The question now isn’t whether open-source models can match proprietary ones—Kimi K2 proves they already have. The question is whether the incumbents can adapt their business models fast enough to compete in a world where their core technology advantages are no longer defensible. Based on Friday’s release, that adaptation period just got considerably shorter.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,437






1/11
@Kimi_Moonshot
🚀 Hello, Kimi K2! Open-Source Agentic Model!
🔹 1T total / 32B active MoE model
🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models
🔹Strong in coding and agentic tasks
🐤 Multimodal & thought-mode not supported for now

With Kimi K2, advanced agentic intelligence is more open and accessible than ever. We can't wait to see what you build!

🔌 API is here: Moonshot AI - Open Platform
- $0.15 / million input tokens (cache hit)
- $0.60 / million input tokens (cache miss)
- $2.50 / million output tokens

🔗 Tech blog: Kimi K2: Open Agentic Intelligence
🔗 Weights & code: moonshotai (Moonshot AI)
🔗 Github: GitHub - MoonshotAI/Kimi-K2: Kimi K2 is the large language model series developed by Moonshot AI team
Try it now at http://Kimi.ai or via API!



GvldjKMXEAAAJ1Z.jpg


2/11
@Kimi_Moonshot
Here are some vibe tests we ran:

1. Interactive 3D Mountain Scene



3/11
@Kimi_Moonshot
2. A ball bouncing in hexagon



4/11
@Kimi_Moonshot
3. Visual Analysis of Remote Work and Salary Trends

*Agent capabilities available via API. More tools coming soon on http://Kimi.ai



5/11
@Kimi_Moonshot
4. 3D Particle Galaxy Simulation



6/11
@Kimi_Moonshot
5. Coldplay 2025 Concert Trip Planner

*Agent capabilities available via API. More tools coming soon on http://Kimi.ai



7/11
@richardcsuwandi
This can't be a coincidence, right? @elonmusk



GvmDggEXMAARf5A.jpg


8/11
@Presstab_crypto
Is this cypher-alpha?



9/11
@philip_kiely
.@simonw Not the best pelican I've seen, not the worst. wdyt?



GvlfaP6aEAAYmp4.png


10/11
@alhasaniq
MuonClip is actually huge if it proves to scale to 1T params reliably.

Seems like a respectable step forward over AdamW



11/11
@iamfakhrealam
Congratulations to the team




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,437











1/11
@maviadam
🚀 This open-source AI video generator is INSANE!
🎴 Flood /search?q=#MeiGenMultiTalk
MeiGen-MultiTalk – a game-changer for video creation!
and singing ...
🔥 Check out these mind-blowing examples :
1 - Toon Characters



https://video.twimg.com/amplify_video/1936845987126870017/vid/avc1/960x480/2PDUQ_GgzQrQIMZc.mp4

2/11
@maviadam
2- Live show Conversation



https://video.twimg.com/amplify_video/1936846203192332289/vid/avc1/896x448/5BnaHCVLSkq-I8_N.mp4

3/11
@maviadam
3- Couples



https://video.twimg.com/amplify_video/1936846487792631808/vid/avc1/896x448/3dk2sltcc5UZccNb.mp4

4/11
@maviadam
4-Sing



https://video.twimg.com/amplify_video/1936846845658947585/vid/avc1/704x576/BjyoJcytrQGajJ0K.mp4

5/11
@maviadam
5-Couple sing



https://video.twimg.com/amplify_video/1936846914181345281/vid/avc1/896x448/4vFc9Wu0i30kBLJN.mp4

6/11
@maviadam
6-



https://video.twimg.com/amplify_video/1936847449307406336/vid/avc1/1280x704/iQMXS0K_99xBDadp.mp4

7/11
@maviadam
7- instruction-following videos



https://video.twimg.com/amplify_video/1936847685631250432/vid/avc1/1280x704/jRP2U3FH2vZEHsHB.mp4

8/11
@maviadam
8-



https://video.twimg.com/amplify_video/1936847984500613120/vid/avc1/896x448/iZCMb5ATj4i2fo6p.mp4

9/11
@maviadam
9- Creative videos



https://video.twimg.com/amplify_video/1936848096752746496/vid/avc1/576x704/gD1kxZ5baVbY5Dhw.mp4

10/11
@maviadam
10-



https://video.twimg.com/amplify_video/1936848152637685760/vid/avc1/832x1088/LZ9QJYdppF6awamT.mp4

11/11
@maviadam
Project page: http://meigen-ai.github.io/multi-talk/1194.2K

If you enjoyed this thread:

Follow me @maviadam for more! 🚀
RT the post below to share with your audience! 🔁




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/21
@angrypenguinPNG
Okay this is actually insane 🤯

A new lip sync method called MultiTalk just dropped and it's the best open-source quality I've ever seen.

Now works in ComfyUI!



https://video.twimg.com/amplify_video/1936543801717428224/vid/avc1/1280x704/ivga-i01Om_PDV13.mp4

2/21
@DreamStarter_1
For real?
I've seen Purz's stream and the results were not THIS good.



3/21
@angrypenguinPNG
i’ll be honest, i’ve yet to test locally on my own comp! this is from research paper :P

will check out Purz stream!



4/21
@1littlecoder
not yet on @FAL i guess? cc @jfischoff



5/21
@zsakib_
wait this isn't just the music video?



6/21
@CoffeeVectors
Ok now!!!



7/21
@codewithimanshu
Game-changer for AI-generated content.



8/21
@kiieford
the detail in the neck vein strain 👌😂 so good



9/21
@hey1tspriyanshu
how do you create long videos like 18 seconds with consistency. curious to know your workflow



10/21
@ttplanet
could you share how you set the promot to move the camera?my video stay solid with same background! Thanks



11/21
@Vijish68859437
Can @HeyGen_Official do this?
Opensource seems better.



12/21
@NewWaveAi2023
Have you got a link to the workflow?



13/21
@JoseMGalaZ
Can you share the model and the wf?



14/21
@paulbri1
ohhh shyt



15/21
@geoppls
What country is it from?



16/21
@SF5321916436736
A‘ūdhu billāhi min ash-shayṭāni r-rajīm



17/21
@nglprz
🔥🔥



18/21
@AgentFola
😰



19/21
@johndpope
we need the training code - Pull requests · MeiGen-AI/MultiTalk



20/21
@degengates
now im about to believe everything last 20 years in TV was all fake AI politicians deepstate military stuff mk ultra... and they now just told us that its all possible but they know it actually a long time ago already



21/21
@niushidazuan
3uR46fHUkJp8T7FxVk6xnM6a5m3zpPyi1d3QHeqCpump 非常棒的项目,真叙事非常好,让每一笔交易都变得有意义,加入我们,一起建设他。




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
@Norris29973102
Our New Open-Source Project: MeiGen-MultiTalk, multi-person conversational video generation
enables multi-person conversations 💬, singing 🎤, interaction control 👬, and cartoon creation 🙊.
🌐 MultiTalk
💻 GitHub - MeiGen-AI/MultiTalk: Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
Code and model is coming~



https://video.twimg.com/amplify_video/1927995490899218434/vid/avc1/1250x704/g_4gED_Lbqyj7QfS.mp4


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/4
@Norris29973102
👏 MeiGen-MultiTalk: the code and checkpoints are released. Enjoy!
Github: GitHub - MeiGen-AI/MultiTalk: Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
Hugging Face: MeiGen-AI/MeiGen-MultiTalk · Hugging Face



https://video.twimg.com/amplify_video/1932370627157192704/vid/avc1/720x940/IIFA-9m7AIA54vS3.mp4

2/4
@TomLikesRobots
Nice - some of these demos look great. Are you planning on a Comfy implementation?
The long video format looks interesting. It seems to use a sliding window with a fixed 3.2-second overlap and an audio cross-attention channel riding alongside.



https://video.twimg.com/amplify_video/1932451514762563584/vid/avc1/1250x704/NJZWUvTJZK7pBwii.mp4

3/4
@OriZilbershtein
Hey Yong, how can one reach you? :D Thanks



4/4
@wangleineo
Good work! What kind of hardware does it require if I want to run this locally?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
@SD_Tutorial
Generate your talking avatar with:
Multitalk+ Wan2.1 🧐👇

1. Setup Kijai's Wan setup:
Wan 2.1: Install & Generate Videos/Images locally with lower VRAM

2. Download Kijai's released model:
WanVideo_2_1_Multitalk_14B_fp8_e4m3fn.safetensors · Kijai/WanVideo_comfy at main




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
@kurtqian
multitalk text to speech which built on eSpeak NG TTS synthesizer



Gu51NC_XUAAO249.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
@mix_buchi_
こちらはMultitalkによるlipsync

Anime style
Flux.1 -> Wan2.1 I2V with multitalk

/search?q=#AIart
/search?q=#FLUX1
/search?q=#AImovie
/search?q=#Wan21
/search?q=#multitalk
/search?q=#ComfyUI



https://video.twimg.com/amplify_video/1945439821503279105/vid/avc1/480x640/aJ1SnbIHy30P2c_e.mp4


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196






1/1
@creator_kachun
Research - gen A.I. character singing + pose control with WanVideo MultiTalk + VACE

Song: 許美靜 - 傾城



https://video.twimg.com/amplify_video/1945708528817356800/vid/avc1/720x480/ooPdwKgzCZQW8_Ys.mp4


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/8
@fffiloni
Meigen MultiTalk @gradio demo is available on @huggingface 🤗

Duplicate on L40S for personal and unlimited inference, enjoy !

*Compatible with multi-GPU too 😉



GuSz_AqWgAAowlo.jpg


2/8
@fffiloni
Meigen MultiTalk HF Space: https://huggingface.co/spaces/fffiloni/Meigen-MultiTalk



3/8
@codewithimanshu
Personal, unlimited inference on L40S? Impressive!



4/8
@naundob
How long are these examples supposed to take? I’m over 12min on A100 and still no result.



5/8
@zaesarius
Oh



6/8
@theinsiderzclub
Looks interesting

Curious to see how MultiTalk performs



7/8
@LouisUn1t
Important context re: agentic shift-AMD + mimik's new edge AI fabric could enably more distributed LLM running. What will hyper-local model execution impact most?



8/8
@Mind_0S
Exciting progress-but how do we extend these agentic capabilities to disconnected, local, or privacy-first contexts beyond centralized infrastructure?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/3
@mesmerlord
being a shorts slop merchant is expensive.... and most of the costs is from the capcut sub(29.99/mo from next week) I had to pay for. the video itself only cost me like $0.3 in FAL api creds and my own runpod pipeline for multitalk



https://video.twimg.com/amplify_video/1946117532294782976/vid/avc1/720x1280/VtESOxE1BMsWmHQP.mp4

2/3
@EsotericCofe
Have you tried Hedra too? Just curious



3/3
@mesmerlord
wayyy back in the day when they were not that good, haven't gone back since maybe they're better than multitalk now?

I just like the option of making all these on my own site atm lol, costs me like 0.02-0.03 per hook vid and I make like 10 and choose best 3 or so



GwIBBr6WUAA4Zol.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,437


1/2
@maxinnerly
A serious leap just happened in AI video production:
MultiTalk, a new open-source model by MeiGen-AI, can generate multi-speaker, lip-synced, animated dialogues with stunning precision. Think crypto Twitter threads turned into ultra-clean, deepfake panels, in 15 seconds flat.

And the kicker? It works with just a 4090 GPU, supports up to 15s videos, and syncs multiple voices to multiple faces… without needing post-editing.


This isn’t just a tech demo. It’s a direct threat to expensive video shoots, particularly in the crypto marketing industry.

•Launch trailer for your L2? Skip the studio.
•Partner spotlight? Render it with AI faces.
•AMA clip in 7 languages? Done by lunch.
•Explainer for DeFi flows? Generate it with multilingual characters.

MultiTalk solves the last pain point in video generation: accurate, cheap, scalable lip-sync for dialogued content.

The real innovation?
> Label Rotary Position Embedding (L-RoPE) = tracks voice-to-face identity with ridiculous accuracy
> Built for multi-person scenes (unlike Hedra or Runway)
> Works with animated or real faces
> Free on Hugging Face, integrated with ComfyUI

Crypto teams should immediately test use cases across:

- Token utility demos (animated personas)
- Announcements (subtitled in 5+ languages)
- Founder explainers (using AI-generated avatars)
- Community-driven storytelling (UGC turned into visual panels)

Combine with tools like Topaz, Real-ESRGAN, or Luma AI to upscale to 4K and control camera motion.



https://video.twimg.com/amplify_video/1944080317079597060/vid/avc1/850x570/_cgAE7tXmoKYkvsp.mp4

2/2
@maxinnerly
try it out for free: MeiGen-AI/MeiGen-MultiTalk · Hugging Face




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,437
Sama tweet on gold medal performance, also says GPT-5 soon



Posted on Sat Jul 19 14:10:21 2025 UTC

wp0p50af9udf1.jpg

gh4joz9f9udf1.jpg



OpenAI researcher confirms IMO gold was achieved with pure language based reasoning



Posted on Sat Jul 19 10:51:18 2025 UTC

q7t7vtqw9tdf1.png



[Discussion] What are the new techniques he's talking about?



Posted on Sat Jul 19 12:55:55 2025 UTC

ptnpm2nyvtdf1.png




Commented on Sat Jul 19 14:02:03 2025 UTC

Let's assume OpenAI employees are being forthcoming.

Jerry Tworek: all natural language proofs, no evaluation harness, little IMO-specific work, same RL system as agent/coder

Alexander Wei: no tools or internet, ~100 mins thinking, going beyond "clear-cut, verifiable rewards," general-purpose RL + test-time compute scaling

Sheryl Hsu: no tools like lean or coding, completed the competition in 4.5 hours, the models tests different strategies/hypotheses and makes observations

What they're saying is that they've gone beyond RLVR. Which is pretty wild. With RLVR, you only get reward feedback after completing an entire task. The signal is faint. It sounds like they've figured out how to let the model reward itself for making progress by referencing an internal model of the task. Makes sense? Let the model make competing predictions about how things will unfold, and it can use these to anchor its reasoning.


│ Commented on Sat Jul 19 16:04:06 2025 UTC

│ Noam and others have said RL for unverifiable rewards.

│ We know this is what they did. We know it's a big deal. Like that paradigm scales up to writing great novels and doing hours of low context work (as we saw in coding competition this week).

│ We don't know what was actually done to make that paradigm work, but this is a good guess 👍


Commented on Sat Jul 19 13:17:37 2025 UTC

Since it seems DeepMind also has gold, their inevitable blogpost could give us some pointers.

Though from previous history, it always feels like the super impressive math results don't necessarily translate to other areas' capabilities just as well, so their new techniques could be very tailored to math-oriented CoT, I have no idea.

Tackling the IMO specifically was already a well-known challenge being optimized for (I assume through math formalizers), so we'll need a lot more technical detail from them to know how actually "general" their general LLM is here. (EDIT: Jerry Tworek (@MillionInt) | https://nitter.poast.org/MillionInt/status/1946551400365994077 | https://xcancel.com/MillionInt/status/1946551400365994077 | Jerry Tworek @MillionInt, Twitter Profile | TwStalker trained general models rather than optimizing specifically for the IMO. Really impressiv, damn. It's possible their new techniques still suit formal math proofs better than anything since it's a pretty valued research area since 2023, but the fact the model is actually a general reasoning LLM is seriously impressive)

From what Noam said though it's definitely related to TTC.


GPT 5 won't get Gold IMO capabilities



Posted on Sat Jul 19 15:30:38 2025 UTC

78wg6wqonudf1.png




With new OpenAI thinking model , order of magnitude of thinking time is now in a standard work-day range.



Posted on Sat Jul 19 08:47:36 2025 UTC


2xz0p6qtnsdf1.jpg

eqaa7c2unsdf1.jpg



Post too soon, and you publish a fossil



Posted on Sat Jul 19 09:45:08 2025 UTC

983x6fu3ysdf1.jpeg



I am feeling extremely anxious over the chatgpt Math olympiad results, what exactly are humans supposed to do now?



Posted on Sat Jul 19 11:18:18 2025 UTC

/r/singularity/comments/1m3tras/i_am_feeling_extremely_anxious_over_the_chatgpt/

I loved to learn new things, and from a personal perspective, always wanted myself to be smarter than my previous self.
I loved math and physics.

Now I feel, all that is in vain, as this LLM is going to do what I want to do, and do it even better.
The other day I was making a 3 body problem visualiser for half a day. But some guy on twitter one-shotted a black hole visualiser using Grok Heavy.

I liked doing the "intellectually heavy" tasks. Now? I feel LLM will defeat me in this. If not today, 2 years from now. What exactly am I supposed to do. Art? Gone. Music? Gone. Programming, my passion? Gone. Math and Physics? Going soon. The only thing left to do is be a company founder of sorts, forming just the problem statement, and use these tools to solve problems. But I wanted to be the problem solver.

Edit : Art, music and other fun things may still be relevant. But when its about pushing the boundaries of humanity, I feel humans will no longer be needed.

Sam Altman on the model



Posted on Sat Jul 19 14:20:37 2025 UTC

mww4r7h6budf1.png



He is starting to beleive



Posted on Sat Jul 19 15:35:48 2025 UTC

fyemq6sloudf1.png




GPT-5 reasoning alpha





OpenAI achieved IMO gold with experimental reasoning model; they also will be releasing GPT-5 soon



Posted on Sat Jul 19 08:08:40 2025 UTC

bb0ej5ejgsdf1.png

ufs0jdemgsdf1.png

ji63dgqngsdf1.png

5jzw7bgpgsdf1.png


 
Top