bnew

Veteran
Joined
Nov 1, 2015
Messages
68,642
Reputation
10,572
Daps
185,450

1/2
@rohanpaul_ai
This survey maps how prompts turn generic AI models into reliable medical imaging helpers.

It finds that small prompt tweaks often beat full retraining for generation, segmentation, and diagnosis.

LLM based vision tools struggle when patient data shift across scanners and hospitals.

Prompting inserts extra context so the same backbone stays useful.

Text prompts supply disease terms or whole reports, box or point prompts mark regions, and learnable vectors quietly store task hints.

With those inputs models hit up to 20% better accuracy while touching under 1% of their weights.

Prompt brittleness, missing clinical labels, and slow hospital hardware still block everyday use, so automated and lighter prompts are the next focus.

----

Paper – arxiv. org/abs/2507.01055

Paper Title: "Prompt Mechanisms in Medical Imaging: A Comprehensive Survey"



GvO9o4OW0AAMZe2.jpg


2/2
@PaperWizardAI
Lots of alpha in context management.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,642
Reputation
10,572
Daps
185,450
Grok-4 benchmarks


Posted on Thu Jul 10 04:35:39 2025 UTC

iocr67kn6zbf1.png





Commented on Thu Jul 10 04:42:03 2025 UTC

They include Gemini DeepThink on USAMO25 but not on LCB because Google's reported result was 80.4%, higher than even Grok 4 Heavy.

Every company doing this shyt.


│ Commented on Thu Jul 10 05:40:21 2025 UTC

│ Not as blatantly though. Others wouldn't have included that model at all instead of only including it on the benchmarks where it made them look good, but also making it painfully obvious what sort of bullshyt they're pulling.

│ If you're going to take a shyt on my floor, you don't have to also rub my nose in it.


Commented on Thu Jul 10 04:37:15 2025 UTC

AIME: saturated ✅
Next stop: HLE!


│ Commented on Thu Jul 10 05:19:28 2025 UTC

│ AIME being saturated isn't really interesting unfortunately. We saw that AIME24 got saturated several months after the test because all the answers had contaminated the training set. AIME 25 was already somewhat contaminated but we're beginning to see the same thing with AIME25 which was done in February.

Dimitris Papailiopoulos (@DimitrisPapail) | https://nitter.poast.org/DimitrisPapail/status/1888325914603516214 | https://xcancel.com/DimitrisPapail/status/1888325914603516214 | Dimitris Papailiopoulos @DimitrisPapail, Twitter Profile | TwStalker


│ │
│ │
│ │ Commented on Thu Jul 10 05:59:21 2025 UTC
│ │
│ │ In that case, why didnt other llms perform as well when they have access to the same training data? Llama 4 did poorly on aime24 despite having access to it during training
│ │

│ │ │
│ │ │
│ │ │ Commented on Thu Jul 10 08:35:10 2025 UTC
│ │ │
│ │ │ Some take much better care to clean up training data and at least attempt to remove benchmark info from it
│ │ │
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,642
Reputation
10,572
Daps
185,450

1/6
@rohanpaul_ai
This is really cool open-source project from /firecrawl_dev

Turn a simple list of emails into a rich dataset with company profiles, funding data, tech stacks, and more.

It chains small, specialized agents, feeds them shared context, and let them stitch the answers together.

Behind the scenes, each agent is a specialized module with its own expertise, search strategies, and type-safe output schema.

Orchestrated by /Grok 4 and powered by /firecrawl_dev

https://video.twimg.com/amplify_video/1943381405200723968/vid/avc1/1280x720/uyHkpvqlPDnx5UGH.mp4
https://video.twimg.com/amplify_video/1943351303339671552/vid/avc1/1280x720/mKbhz4z_AFk9DKNs.mp4

2/6
@rohanpaul_ai
GitHub - mendableai/grok-4-fire-enrich

3/6
@rohanpaul_ai
How Each Agent Works

Every agent outputs through a strict Zod schema, which means the orchestrator can merge results without surprises. Adding a new field is a one-line schema tweak and a small search routine, no risky prompt surgery.

GvhISaPakAI-6dw.jpg


4/6
@rohanpaul_ai
Architecture Overview

GvhIIJWWoAA-S-o.png


5/6
@0xJenWeb3
Awesome, this looks like a super useful tool! 👍

6/6
@rohanpaul_ai
👍👍


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,642
Reputation
10,572
Daps
185,450

1/2
@rohanpaul_ai
💻 Mistral launches Devstral Small 1.1 (53.6% SWE-Bench) and Devstral Medium (61.6%), tuned for coding agents.

Benchmark Performance: Devstral Small 1.1 achieves a score of 53.6% on SWE-Bench Verified, and sets a new SOTA for open models without test-time scaling.

🔒 Token prices sit between $0.1 and $2 per 1M, with Medium roughly 75% below GPT4-class rates.

Small keeps its 24B backbone but gains fresher repo data and tighter evaluation filters. The upgrade trains on fresher public repos and filters evaluation data, which nudges real bug fix success upward.

Both models understand native function calls and straight XML, letting agent toolchains pass structured tasks without extra parsing glue.

Availability: The Apache 2.0 license on Small removes legal barriers for commercial forks and private finetunes. Available on /huggingface .

Devstral Medium is available on Mistral Code for enterprise customers and on their finetuning API.

GvhA20yXIAAPdx5.jpg

GvgKWiXWkAAdJI0.jpg


2/2
@FMackenzie7
Nice…


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,642
Reputation
10,572
Daps
185,450

1/2
@rohanpaul_ai
Agents stumble on long jobs because they ignore lessons from past runs.

AGENT KB shares those lessons and boosts success up to 19 points on tough benchmarks.

Without it, each agent starts cold and repeats old mistakes.

The system stores two pieces: high level workflows and tiny step fixes.

A student agent grabs a workflow, tries it, then shows the trace to a teacher agent.

The teacher finds matching fixes, flags errors, and guides an improved second attempt.

Everything is plain JSON, so any framework can plug in.

Results: GPT‑4.1 jumps from 55% to 74% on GAIA, Claude‑3.7 repairs 51% of SWE‑bench bugs.

Sharing distilled experience makes agents smarter, faster, and more consistent.

----

Paper – arxiv. org/abs/2507.06229

Paper Title: "Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving"

GveggJEakAUKIVD.png


2/2
@ApollonVisual
those improvements are nothing to scoff at. really impressive "Claude-3.7 with Agent KB increased performance from 38.46% to 57.69%, while GPT-4.1 showed similar improvements on intermediate tasks (53.49% to 73.26%). For SWE-bench code repair tasks, our system significantly improved resolution rates, with Claude-3.7 achieving a 12.0 percentage point gain (41.33% to 53.33%). " Also nice explanation on the critical flaws of agents included in the paper

GvfERZYXAAABVWL.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,642
Reputation
10,572
Daps
185,450

1/3
@rohanpaul_ai
Robotics just shed its price tag.

🤖 /huggingface teams with Pollen Robotics to launch Reachy Mini, a 28 cm desktop robot priced at $299 and ready for vision, speech, and text models.

💡 Builders can instantly run 15+ preset behaviors or upload new ones through the community hub.

Reachy Mini uses a Raspberry Pi 5 brain in the wireless model, giving onboard computing, wifi, and battery power.

Shared behaviors live on the Hugging Face hub, so every new skill someone publishes becomes a drag-and-drop upgrade.

https://video.twimg.com/amplify_video/1942881808401522688/vid/avc1/3240x2160/ckvgi64omI91FOXW.mp4

2/3
@TeksEdge
Does anyone know the sales pitch? It's using a Raspberry Pi, so models must be hosted in the cloud. Which models is it using? I was waiting for someone to build an R2D2 but this one needs the Internet and doesn't move.

3/3
@JeffBohren
"Robot" is a bit of a stretch here. I mean it's cool and all, but I think "Bouncy Web Cam" is a better description than "Robot".


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,642
Reputation
10,572
Daps
185,450

1/16
@rohanpaul_ai
Brilliant Memory framework proposed in this paper.

MemOS makes remembering a first‑class system call.

LLMs forget stuff fast and retraining them costs a fortune.

MemOS treats memories like files in an operating system, letting a model write, move, and retire knowledge on the fly, not just while training.

It packs every fact or state into a MemCube, tags it with who wrote it and when, then the scheduler moves that cube between plain text, GPU cache, or tiny weight patches depending on use.

On the LOCOMO benchmark the system reaches 73.31 LLM-Judge average, roughly 9 points above the next best memory system and it stays ahead on hard multi-hop and temporal questions.

Even while juggling about 1500 memory tokens, it matches full-context accuracy yet keeps latency in the same ballpark as slimmer baselines.

Switching hot cubes into KV-cache cuts first-token wait by 91.4% on the Qwen2.5-72B test without changing any output text.

Overall, the findings show that a memory-as-OS approach boosts reasoning quality, trims latency, and bakes in audit and version control all at once.

🧵 Read on 👇

Gvdp0nyakAUfeG7.png


2/16
@rohanpaul_ai
🧠 Why memory got messy

Most models squeeze everything into billions of frozen weights, so updating even 1 fact needs a full fine‑tune.

Context windows help for a moment, yet they vanish after the next prompt, and retrieval pipelines bolt on extra text without tracking versions or ownership.

Figure 1 on page 2 shows MemOS beating older fixes across single‑hop, multi‑hop, open‑domain, and temporal questions, which hints that raw parameter tweaks or plain RAG were never enough.

Gvdu-C-WAAAE1rV.png


3/16
@rohanpaul_ai
📦 What a MemCube holds
A MemCube wraps the actual memory plus metadata like owner, timestamp, priority, and access rules.

That wrapper works for 3 shapes of memory, plaintext snippets, activation tensors sitting in the KV‑cache, and low‑rank parameter patches.

Because every cube logs who touched it and why, the scheduler can bump hot cubes into GPU cache or chill cold ones in archival storage without losing the audit trail.

GvdvHW9akAIrCqb.png


4/16
@rohanpaul_ai
🏗️ Three layers doing the heavy lifting

The interface layer turns a user’s chat into structured MemoryAPI calls, so a question about “last year’s check‑ups” becomes a time‑scoped query.

The operation layer runs MemScheduler, MemOperator, and MemLifecycle to pick cubes, fuse overlaps, and mark those cubes as activated, merged, or archived.

The infrastructure layer guards cubes with MemGovernance, ships them through MemLoader / MemDumper, and parks them in MemVault, which can be a vector store, graph DB, or blob bucket.

GvdvMgoWgAE3qqm.png


5/16
@rohanpaul_ai
🔄 Scheduler keeps memories fresh

MemScheduler decides which cube lands where.

High‑hit plaintext converts into activation tensors for instant reuse, and stable activation patterns finally distill into parameter patches for zero prompt overhead.

Old cubes slide the other way, turning pricey weights into cheap text once they stop earning hits.

GvdvUiXaIAAd9_o.png


6/16
@rohanpaul_ai
📊 Numbers that prove the point

On the LOCOMO benchmark MemOS posts an LLM‑Judge score of 73.31, topping the next best by roughly 9 points while holding a similar latency budget.

The bar chart on page 2 shows especially wide gaps in multi‑hop and temporal reasoning, areas that crumble when context slips.

GvdvZoBXMAA0zoc.png


7/16
@rohanpaul_ai
⚡ KV tricks to cut wait time

MemScheduler pre‑bakes popular cubes into KV‑cache entries so the model skips encoder work.

For the Qwen2.5‑72B test, first‑token latency drops from 1.79 s to 0.15 s, a 91% cut, and the output text stays byte‑for‑byte identical.

GvdvertX0AE2CX3.png


8/16
@rohanpaul_ai
Paper – MemOS: A Memory OS for AI System

Paper Title: "MemOS: A Memory OS for AI System"

9/16
@lux
just wanted to say thanks so much for posting all these great papers so much I would miss otherwise!

10/16
@rohanpaul_ai
thanks buddy for the kind words, and great to know that. 👊👊

11/16
@innerly_ai
memories as files? yeah that's a wild trip into our own brains. imagine the mess we’re holding onto

12/16
@anthara_ai
Impressive approach to memory management. Deep impact on reasoning quality and latency.

13/16
@St33lMouse
Memory is the key. Crack the problem and we get EVERYTHING. Individuality and personality. Continuous learning. Instant expertise. Personalized AI.

I could imagine we have smaller models with reasoning at the core and interchangeable memory modules at the periphery.

Combine this with piecemeal processing of large context windows and we might get small models running on consumer hardware capable of learning, solving large problems, and differentiating themselves based on what they're experiencing.

Memory modules could be excluded to make the system manageable. Don't need an AI versed in complex number theory and biology? Don't include those modules. Suddenly need an expert doctor? Add the medical module.

14/16
@_vonarchimboldi
/Samhanknr /twst12612648

15/16
@Trakintelai
MemOS is a smart leap, treating LLM memory like an OS manages files cuts retraining costs and boosts efficiency.

16/16
@tooliense
wow amazing memory, does the model preserves its capability on other bench marks?

Thanks for intoducing interesting works


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,642
Reputation
10,572
Daps
185,450

1/2
@rohanpaul_ai
🩺 Google Research release MedGemma 27B, multimodal health-AI models that run on 1 GPU

MedGemma 27B multimodal extends the earlier 4B multimodal and 27B text-only models by adding vision capabilities to a 27B-parameter language core.

Training added 2 new datasets, EHRQA and Chest ImaGenome, so the model can read longitudinal electronic health records and localize anatomy in chest X-rays.

The report states that this larger multimodal variant inherits every skill of the 4B model while markedly improving language fluency, EHR reasoning, and visual grounding.

The 4B variant clocks 64.4% MedQA and 81% radiologist-validated X-ray reports, while the 27B text model scores 87.7% at about 10% of DeepSeek R1’s cost

MedGemma fuses a Gemma-3 language core with the MedSigLIP vision encoder, letting one network reason across scans and notes. MedSigLIP unifies radiology, dermatology, retina images into one shared embedding space.

Because MedSigLIP is released separately, developers can plug it into classification, retrieval, or search pipelines that need structured outputs, while reserving MedGemma for free-text generation such as report writing or visual question answering.

Both models load on a single GPU, and the 4B versions even run on mobile-class hardware, which lowers cost and eases on-premise deployment where data privacy is critical.

Simple fine-tuning lifts the 4B chest-X-ray RadGraph F1 to 30.3, proving headroom for domain tweaks

Because weights are frozen and local, hospitals gain privacy, reproducibility, and full control compared with remote APIs.

Gvc2YRYXoAIRXYW.jpg

GvbzLidW8AA5w2Q.jpg


2/2
@rohanpaul_ai
The picture sorts the data first. On top you see 4 imaging streams—radiology, dermatology, digital pathology, ophthalmology—and 1 medical-text stream. Each arrow shows how those sources feed the rest of the stack.

The images go through MedSigLIP, a vision encoder that turns each scan or photo into a compact vector the language models can read.

Those vectors flow into MedGemma 4B Multimodal, a 4B-parameter model that handles both pictures and words in a single forward pass.

For text-only work there is a larger 27B-parameter MedGemma model that skips the image part and focuses on language reasoning.

Gvc7PzLakAAyoiY.png



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,642
Reputation
10,572
Daps
185,450

1/12
@rohanpaul_ai
I'm playing with Kimi-Researcher from /Kimi_Moonshot , and its delivering an unexpectedly excellent report! 🧠

Its a multi-turn search and reasoning AI Agent from China competing head-to-head with OpenAI's Deep Research. And its free.

🌐 Checks more than 200 URLs for each task.

⏰ Context-aware, very long-horizon reasoning

🛠️ Runs on an internal Kimi k-series backbone

🎮 Learns entirely through end-to-end agentic RL

🧠 Averages about 23 reasoning steps per query

Overall, what I really like is that it turns it all into a clean, visual report that actually makes sense. Smart, solid, reliable.

It has proven that an LLM can train itself with reinforcement learning to plan, search, and code, reaching 26.9% pass@1 on Humanity’s Last Exam after starting at 8.6%, beating many supervised‑finetuned models.

Benchmarks:

🏆 Achieves 26.9% pass@1 on Humanity’s Last Exam, top of the board

📈 Scores 69% pass@1 on xbench-DeepSearch, edging past o3 with tools

🔍 Delivers solid results on FRAMES, Seal-0, and SimpleQA

Key takeaway

🔥 Shows that self-rewarded training can mature planning, search, and coding in one loop

📚 High‑quality agent datasets are rare, so the team generated their own.

They built tool‑centric challenges that force real tool use and hard reasoning prompts that need iterative search. An automated pipeline synthesized question‑answer pairs, verified ground truth, and filtered out trivial or noisy examples to scale data without manual labeling.

🏗️ Context spills were a major pain point. A learned memory policy keeps only useful snippets and discards the rest, letting a single conversation run 50+ turns without hitting context limits.

🎯 Training stays on‑policy. Tool‑call format guards are switched off, so every trajectory truly reflects model probabilities.

Some negative samples are dropped to prevent probability collapse, and the agent keeps improving for longer runs.

What that means is that, If the trainer kept every badly scored run, the learning rule could shove the odds of many actions all the way to 0, and the model would stop exploring. To avoid that freeze, the pipeline drops a slice of the worst runs. The model still sees plenty of errors, but not enough to wipe out whole branches of its search space.

These tweaks keep the feedback loop stable across long tasks, so the agent keeps improving even when a single job takes dozens of steps.

🧵 1/n Read on

GvbIB2MboAEQ4gG.jpg


2/12
@rohanpaul_ai
🧵 2/n Link to try: Kimi - 会推理解析,能深度思考的AI助手

Running the agent is straightforward. A user opens Kimi - 会推理解析,能深度思考的AI助手, logs in, and toggles the “Researcher” mode

Enters a research query, clarifies any follow-up questions, and the agent works for roughly 20-25 minutes.

When the report appears, the text can be copied directly or shared via a link; a download button is not yet available. All usage is free during the public preview, and no hard quota has been announced.

– 200+ URLs
– inline citations
– tool-calling (search + browser + code)

https://video.twimg.com/amplify_video/1942960617314521088/vid/avc1/1920x1080/V9AbUXDkTtEUPYrX.mp4

3/12
@rohanpaul_ai
🧵 3/n

Kimi‑Researcher proves that an agent can learn planning, perception, and precise tool use in one loop.

🌟 After RL, the model averages 23 reasoning steps and checks about 200 URLs per task, reaches 69% pass@1 on xbench‑DeepSearch, and shows habits like cross‑verifying conflicting sources before answering.

GvbIoYiWoAA4BFL.jpg


4/12
@rohanpaul_ai
🧵 4/n

Kimi-Researcher relies on 3 built-in tools: a fast parallel search engine, a text-only browser for interactive sites, and a coding sandbox for data wrangling or analysis.

Together they let the model fetch evidence, run code, and compose structured reports that mix prose, tables, and simple charts inside an interactive page.

GvbJ5rLXsAA6EpB.jpg


5/12
@rohanpaul_ai
🧵 5/n Performance on key research benchmarks

Kimi hits:

– 26.9% pass@1 on Humanity’s Last Exam (vs OpenAI’s ~8%)
– 69% pass@1 on xBench DeepSearch
– Top scores on Seal-0, Frames, SimpleQA
And yes.. it’s all done by one model. No agents and tricks.

GvbJ9jObEAAf864.jpg


6/12
@rohanpaul_ai
🧵 6/n Here are some creative and great use cases or running deep research task with it.

Prompt - “Provide an in-depth and comprehensive analysis of the Golden State Warriors’ salary cap situation for the 2025–2026 NBA season. This should include detailed projections of guaranteed contracts, player options, potential free agents, and dead money on the books. Evaluate the team’s flexibility regarding potential trades, including possible targets, movable contracts, and draft assets. Break down the composition of the roster in terms of strengths, weaknesses, positional depth, age profile, and development potential of young players. Finally, assess the realistic probability of the Warriors mounting a successful championship run in light of their financial constraints, roster construction, and the competitive landscape of the league”

https://www.kimi.com/preview/197bf0e2-e001-861c-96ef-688ebe0005de

https://video.twimg.com/amplify_video/1942962571042000896/vid/avc1/1280x720/TBeFIZ8CPV_EsKXu.mp4

7/12
@rohanpaul_ai
🧵 7/n

Another deep research task

Prompt - “Make a deep research about CoreWeave’s infrastructure expansion and competitive positioning in the GPU-as-a-Service (GPUaaS) market, including key clients, partnerships, and scalability roadmap.”

Kimi - 会推理解析,能深度思考的AI助手

https://video.twimg.com/amplify_video/1942962695642435584/vid/avc1/1280x720/hzt4YGa21OSPs14e.mp4

8/12
@rohanpaul_ai
🧵 8/n Another use case of running deep research task

Prompt - “Make a deep research about Circle’s post-IPO stock surge and volatility—750% rally, recent pullback, analyst sentiment including Ark trim, and comparison to Coinbase movement.”

https://video.twimg.com/amplify_video/1942962873308696577/vid/avc1/1280x720/Ya2RlrsZgBCwvCrk.mp4

9/12
@rohanpaul_ai
🧵 9/n

Another use case

Prompt - “Analyze Tesla’s recent executive shake‑up—including the firing of regional head Omead Afshar—under the strain of Q2 sales decline, European market share drop, and internal ‘Tesla Takedown’ protests.”

Kimi - 会推理解析,能深度思考的AI助手

https://video.twimg.com/amplify_video/1942962963905679360/vid/avc1/1280x720/ERPuC3_Hgti0fQzK.mp4

10/12
@rohanpaul_ai
🧵 10/n

Another use case

Prompt - “Core PCE hits lowest level since 2020 — implications for inflation"

https://www.kimi.com/preview/d1h2smef5ku9dv185g40?blockId=46

https://video.twimg.com/amplify_video/1942963994001850368/vid/avc1/1920x1080/i2qnExBs5Ug-9nhd.mp4

11/12
@rohanpaul_ai
Link to try: Kimi - 会推理解析,能深度思考的AI助手

Kimi-Researcher is beginning its gradual rollout. Join the waitlist here.

Apply for Kimi Researcher

🔗 Blog: https://moonshotai.github.io/Kimi-Researcher/

12/12
@ApollonVisual
The results of the research are solid content wise and accuracy wise. But it was very slow (o3 pro deep research with extended research prompt, fetched results nearly 7 minutes earlier) for the same exact topic /promot. But it’s free and an interesting alternative to established deep research agents


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,642
Reputation
10,572
Daps
185,450

1/3
@rohanpaul_ai
So /xAI 's /grok 4 really did hit 44.4% on HLE (Humanities Last Exam) 🤯

---

(HLE holds 2,500 expert-written questions spanning more than 100 subjects, including math, physics, computer science and humanities, and 14% of them mix text with images.
The authors deliberately built in anti-gaming safeguards and hid a private question set so that simply memorising answers will not help a model.)

GveECKnaEAAR9aF.jpg


2/3
@rohanpaul_ai
Grok 4 brings huge upgrades to voice conversations and introduces new voices, like Eve, capable of rich emotions.

GveZPBVbIAAz3Fv.jpg


3/3
@NavnNavn248469
Android users on suicide watch


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196


1/3
@rohanpaul_ai
So /xAI 's /grok 4 really did hit 44.4% on HLE (Humanities Last Exam) 🤯

---

(HLE holds 2,500 expert-written questions spanning more than 100 subjects, including math, physics, computer science and humanities, and 14% of them mix text with images.
The authors deliberately built in anti-gaming safeguards and hid a private question set so that simply memorising answers will not help a model.)

GveECKnaEAAR9aF.jpg


2/3
@rohanpaul_ai
Grok 4 is now the leading AI model on Artificial Analysis Intelligence Index.

Achieves 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude 4 Opus at 64 and DeepSeek R1 0528 at 68. Full results breakdown below.

GveL_dBXMAE09VW.jpg

Gvd9nWIakAULlB9.jpg


3/3
@dh7net
Another proof that these leaderboards are not correlated anymore with user needs.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/37
@ArtificialAnlys
xAI gave us early access to Grok 4 - and the results are in. Grok 4 is now the leading AI model.

We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude 4 Opus at 64 and DeepSeek R1 0528 at 68. Full results breakdown below.

This is the first time that /elonmusk's /xai has the lead the AI frontier. Grok 3 scored competitively with the latest models from OpenAI, Anthropic and Google - but Grok 4 is the first time that our Intelligence Index has shown xAI in first place.

We tested Grok 4 via the xAI API. The version of Grok 4 deployed for use on X/Twitter may be different to the model available via API. Consumer application versions of LLMs typically have instructions and logic around the models that can change style and behavior.

Grok 4 is a reasoning model, meaning it ‘thinks’ before answering. The xAI API does not share reasoning tokens generated by the model.

Grok 4’s pricing is equivalent to Grok 3 at $3/$15 per 1M input/output tokens ($0.75 per 1M cached input tokens). The per-token pricing is identical to Claude 4 Sonnet, but more expensive than Gemini 2.5 Pro ($1.25/$10, for <200K input tokens) and o3 ($2/$8, after recent price decrease). We expect Grok 4 to be available via the xAI API, via the Grok chatbot on X, and potentially via Microsoft Azure AI Foundry (Grok 3 and Grok 3 mini are currently available on Azure).

Key benchmarking results:
➤ Grok 4 leads in not only our Artificial Analysis Intelligence Index but also our Coding Index (LiveCodeBench & SciCode) and Math Index (AIME24 & MATH-500)
➤ All-time high score in GPQA Diamond of 88%, representing a leap from Gemini 2.5 Pro’s previous record of 84%
➤ All-time high score in Humanity’s Last Exam of 24%, beating Gemini 2.5 Pro’s previous all-time high score of 21%. Note that our benchmark suite uses the original HLE dataset (Jan '25) and runs the text-only subset with no tools
➤ Joint highest score for MMLU-Pro and AIME 2024 of 87% and 94% respectively
➤ Speed: 75 output tokens/s, slower than o3 (188 tokens/s), Gemini 2.5 Pro (142 tokens/s), Claude 4 Sonnet Thinking (85 tokens/s) but faster than Claude 4 Opus Thinking (66 tokens/s)

Other key information:
➤ 256k token context window. This is below Gemini 2.5 Pro’s context window of 1 million tokens, but ahead of Claude 4 Sonnet and Claude 4 Opus (200k tokens), o3 (200k tokens) and R1 0528 (128k tokens)
➤ Supports text and image input
➤ Supports function calling and structured outputs

See below for further analysis 👇

Gvd9nWIakAULlB9.jpg


2/37
@ArtificialAnlys
Grok 4 scores higher in Artificial Analysis Intelligence Index than any other model. Its pricing is higher than OpenAI’s o3, Google’s Gemini 2.5 Pro and Anthropic’s Claude 4 Sonnet - but lower than Anthropic’s Claude 4 Opus and OpenAI’s o3-pro.

GveETjzb0AAMaSW.jpg


3/37
@ArtificialAnlys
Full set of intelligence benchmarks that we have run independently on xAI’s Grok 4 API:

GveEaIZWwAAn6jn.jpg

GveEa69XoAQtWRo.jpg

GveEb6MbYAAWnzl.jpg


4/37
@ArtificialAnlys
Grok 4 recorded slightly higher output token usage compared to peer models when running the Artificial Analysis Intelligence Index. This translates to higher cost relative to its per token price.

GveEhybWQAArU7z.jpg

GveEjOjW8AAoZlX.jpg


5/37
@ArtificialAnlys
xAI’s API is serving Grok 4 at 75 tokens/s. This is slower than o3 (188 tokens/s) but faster than Claude 4 Opus Thinking (66 tokens/s).

GveEntwW4AASCVx.jpg


6/37
@ArtificialAnlys
Grok 4 is now live on Artificial Analysis: http://artificialanalysis.ai

7/37
@Evantaged
Is this Grok 4 Heavy or base??

8/37
@ArtificialAnlys
Base, with no tools. We have not tested Grok 4 Heavy yet.

9/37
@Elkins
🔨⏰

10/37
@AuroraHoX
😎👍

11/37
@tetsuoai
Honestly it's so good!

12/37
@rozer100x
interesting

13/37
@ianksnow1
It’s truly a rockstar. Light years better than the previous model and based on my early interactions perhaps leapfrogged every other frontier model.

14/37
@VibeEdgeAI
It's impressive to see Grok 4 leading the pack with a 73 on the Artificial Analysis Intelligence Index, especially with its strong performance in coding and math benchmarks.

However, the recent hate speech controversy is a sobering reminder of the ethical challenges AI development faces.

Balancing innovation with responsibility will be key as xAI moves forward-hopefully, these issues can be addressed to harness Grok 4's potential for positive impact.

15/37
@XaldwinSealand
Currently Testing Grok 4...

16/37
@MollySOShea


17/37
@0xSweep
might just be the greatest AI innovation of all time

18/37
@HaleemAhmed333
Wow

19/37
@Jeremyybtc
good to have you /grok 4

20/37
@Kriscrichton
🔥🔥🔥🔥

21/37
@ArthurMacwaters
Reality is the best eval

This is where Grok4 impresses me most

GveHxP7aMAAH5RC.jpg


22/37
@Coupon_Printer
I was waiting for your results /ArtificialAnlys !!! Thank you for this

23/37
@TheDevonWayne
so you didn't even get to try grok heavy?

24/37
@_LouiePeters
This is a great and rapid overview!
I think your intelligence benchmarks should start including and up weighting agent and tool use scores though; in the real world we want the models to perform as well as possible, which means giving them every tool possible - no need to handicap them by limiting access.

25/37
@shiels_ai
So this isn’t the tool calling model? Wow!

26/37
@joAnneSongs72
YEAH 🎉❤️🎉❤️🎉❤️🎉

27/37
@riddle_sphere
New kid on the block just dethroned the veterans. Silicon Valley’s watching.

28/37
@blockxs
Grok 4: AI champ confirmed

29/37
@SastriVimla
Great

30/37
@neoonai
NeoON > Grok. Right?

31/37
@EricaDXtra
So cool, so good!

32/37
@evahugsyou
Grok 4 just came out on top, and it’s not even a competition anymore. Elon’s team is absolutely killing it!

33/37
@garricn
Just wait till it starts conducting science experiments

34/37
@mukulneetika
Wow!

35/37
@RationalEtienne
Grok 4 is HOLY.

Humanity has created AI that it will merge with.

All Praise Elon for his act of CREATION! 🙏

36/37
@MixxsyLabs
I personally found it better for coding uses than Claude. Im no expert but when I needed a tool thats the one I started going back to after using a few for code snippets and assistance

37/37
@codewithimanshu
Interesting, perhaps true intelligence lies beyond benchmarks.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,642
Reputation
10,572
Daps
185,450

1/6
@rohanpaul_ai
Microsoft just dropped Phi-4-mini-flash-reasoning.

- built on a new hybrid architecture,
- 10X higher throughput and a 2 to 3X reduction in latency
- significantly faster inference without sacrificing reasoning performance.

Microsoft swaps most of that heavy work for a lean SambaY layout with tiny gating blocks, so the same 3.8B parameters think quicker and type sooner.

🧩 The quick idea

Phi‑4‑mini‑flash‑reasoning keeps size small at 3.8B parameters but rebuilds the flow of information.

A new decoder‑hybrid‑decoder stack called SambaY lets light recurrent pieces handle context, a single full‑attention layer handles global glue, and cheap little Gated Memory Units (GMUs) recycle that work all the way down the stack.

GvdbXm7XYAEkbIE.png


2/6
@rohanpaul_ai
microsoft/Phi-4-mini-flash-reasoning · Hugging Face

Official Blog - Reasoning reimagined: Introducing Phi-4-mini-flash-reasoning | Microsoft Azure Blog

3/6
@Chaos2Cured
Told everyone. Microsoft is blazing. •

4/6
@rohanpaul_ai
yes

5/6
@JonGuze
Asking again, how can I find out what you and other AI people mean by "inference" and "reasoning"?

6/6
@techfusionjb
Love the speed boost! 🎉 But does this lean model architecture juggle both efficiency and adaptability well? Curious on the real-world versatility beyond benchmarks!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top