bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,433
Nano Banana Examples



Posted on Wed Aug 20 13:01:27 2025 UTC


0p2mmpv496kf1.png

frqzgk5896kf1.png

fgsv93tc96kf1.png

myffvtha96kf1.png

i4h029ib96kf1.png

um7af2h996kf1.png

igsg8nxt96kf1.jpg

3g45k1pu96kf1.jpg

wybbu8ww96kf1.jpg

lj8n1pd0a6kf1.jpg

m2ssu251a6kf1.jpg



(Using reference images as styles). This might be the best model for creating images based off of the styles of the reference images and maintaining the styles.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,433

OpenAI admits AI hallucinations are mathematically inevitable, not just engineering flaws​


news

Sep 18, 20256 mins

Artificial IntelligenceTechnology Industry

In a landmark study, OpenAI researchers reveal that large language models will always produce plausible but false outputs, even with perfect data, due to fundamental statistical and computational limits.​


AI, Hallucination, Robot, fake, real

Credit: mongmong_Studio- shutterstock.com

OpenAI, the creator of ChatGPT, acknowledged in its own research that large language models will always produce hallucinations due to fundamental mathematical constraints that cannot be solved through better engineering, marking a significant admission from one of the AI industry’s leading companies.

The study, published on September 4 and led by OpenAI researchers Adam Tauman Kalai, Edwin Zhang, and Ofir Nachum alongside Georgia Tech’s Santosh S. Vempala, provided a comprehensive mathematical framework explaining why AI systems must generate plausible but false information even when trained on perfect data.

[ Related:


“Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty,” the researchers wrote in the paper. “Such ‘hallucinations’ persist even in state-of-the-art systems and undermine trust.”

The admission carried particular weight given OpenAI’s position as the creator of ChatGPT, which sparked the current AI boom and convinced millions of users and enterprises to adopt generative AI technology.



OpenAI’s own models failed basic tests​


The researchers demonstrated that hallucinations stemmed from statistical properties of language model training rather than implementation flaws. The study established that “the generative error rate is at least twice the IIV misclassification rate,” where IIV referred to “Is-It-Valid” and demonstrated mathematical lower bounds that prove AI systems will always make a certain percentage of mistakes, no matter how much the technology improves.

The researchers demonstrated their findings using state-of-the-art models, including those from OpenAI’s competitors. When asked “How many Ds are in DEEPSEEK?” the DeepSeek-V3 model with 600 billion parameters “returned ‘2’ or ‘3’ in ten independent trials” while Meta AI and Claude 3.7 Sonnet performed similarly, “including answers as large as ‘6’ and ‘7.’”

OpenAI also acknowledged the persistence of the problem in its own systems. The company stated in the paper that “ChatGPT also hallucinates. GPT‑5 has significantly fewer hallucinations, especially when reasoning, but they still occur. Hallucinations remain a fundamental challenge for all large language models.”

OpenAI’s own advanced reasoning models actually hallucinated more frequently than simpler systems. The company’s o1 reasoning model “hallucinated 16 percent of the time” when summarizing public information, while newer models o3 and o4-mini “hallucinated 33 percent and 48 percent of the time, respectively.”

“Unlike human intelligence, it lacks the humility to acknowledge uncertainty,” said Neil Shah, VP for research and partner at Counterpoint Technologies. “When unsure, it doesn’t defer to deeper research or human oversight; instead, it often presents estimates as facts.”

The OpenAI research identified three mathematical factors that made hallucinations inevitable: epistemic uncertainty when information appeared rarely in training data, model limitations where tasks exceeded current architectures’ representational capacity, and computational intractability where even superintelligent systems could not solve cryptographically hard problems.

Industry evaluation methods made the problem worse​


Beyond proving hallucinations were inevitable, the OpenAI research revealed that industry evaluation methods actively encouraged the problem. Analysis of popular benchmarks, including GPQA, MMLU-Pro, and SWE-bench, found nine out of 10 major evaluations used binary grading that penalized “I don’t know” responses while rewarding incorrect but confident answers.

“We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty,” the researchers wrote.

Charlie Dai, VP and principal analyst at Forrester, said enterprises already faced challenges with this dynamic in production deployments. ‘Clients increasingly struggle with model quality challenges in production, especially in regulated sectors like finance and healthcare,’ Dai told Computerworld.

The research proposed “explicit confidence targets” as a solution, but acknowledged that fundamental mathematical constraints meant complete elimination of hallucinations remained impossible.

Enterprises must adapt strategies​


Experts believed the mathematical inevitability of AI errors demands new enterprise strategies.

“Governance must shift from prevention to risk containment,” Dai said. “This means stronger human-in-the-loop processes, domain-specific guardrails, and continuous monitoring.”

Current AI risk frameworks have proved inadequate for the reality of persistent hallucinations. “Current frameworks often underweight epistemic uncertainty, so updates are needed to address systemic unpredictability,” Dai added.

Shah advocated for industry-wide evaluation reforms similar to automotive safety standards. “Just as automotive components are graded under ASIL standards to ensure safety, AI models should be assigned dynamic grades, nationally and internationally, based on their reliability and risk profile,” he said.

Both analysts agreed that vendor selection criteria needed fundamental revision. “Enterprises should prioritize calibrated confidence and transparency over raw benchmark scores,” Dai said. “AI leaders should look for vendors that provide uncertainty estimates, robust evaluation beyond standard benchmarks, and real-world validation.”

Shah suggested developing “a real-time trust index, a dynamic scoring system that evaluates model outputs based on prompt ambiguity, contextual understanding, and source quality.”

Market already adapting​


These enterprise concerns aligned with broader academic findings. A Harvard Kennedy School research found that “downstream gatekeeping struggles to filter subtle hallucinations due to budget, volume, ambiguity, and context sensitivity concerns.”

Dai noted that reforming evaluation standards faced significant obstacles. “Reforming mainstream benchmarks is challenging. It’s only feasible if it’s driven by regulatory pressure, enterprise demand, and competitive differentiation.”

The OpenAI researchers concluded that their findings required industry-wide changes to evaluation methods. “This change may steer the field toward more trustworthy AI systems,” they wrote, while acknowledging that their research proved some level of unreliability would persist regardless of technical improvements.

For enterprises, the message appeared clear: AI hallucinations represented not a temporary engineering challenge, but a permanent mathematical reality requiring new governance frameworks and risk management strategies.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,433


1/37
@Kimi_Moonshot
Say hi to OK Computer, Kimi's agent mode 🤖🎸
Your AI product & engineering team, all in one.

✨ From chat → multi-page websites, mobile first designs, editable slides
✨ From up to 1 million rows of data → interactive dashboards
✨ Agency: self-scopes, surveys & designs
✨ Natively trained on tools: file system, browser, terminal
✨ More steps, tokens & tools than chat mode, with turbo K2

An agentic model with its own computer, K2 now has true agency.

https://video.twimg.com/amplify_video/1971078085593399298/vid/avc1/1920x1080/9-mBzzFNj5YemMk2.mp4

2/37
@Kimi_Moonshot
Try it out: Kimi AI – Kimi K2 is Live

G1sXBZ4agAAHg_S.jpg


3/37
@Pai3Ai
Agency requires trust. Love the vision

4/37
@crystalsssup
multimedia/format output is so important

5/37
@koltregaskes
I see what you did there with the name. Nice Radiohead reference. 😀

6/37
@ShengyuanS
hiiii🙌

7/37
@bydylanlamb
This is the unlock we’ve been waiting for. I love this.

8/37
@mwa_ia
agentic models ≠ tools, they're economies. k2 hints at AgentFi's big leap.

9/37
@promptprxncess
proud of you

10/37
@OmnipotentCEO
cool

11/37
@EdDiberd
Hell yeah

12/37
@mysticaltech
Looking great

13/37
@jdpeterson
The irony of naming your agent mode after an album that talks about the dystopian implications of technology and the loss of human autonomy

14/37
@0xCapx
Agents everywhere 💚

15/37
@Prashant_1722
Open source is so back. Congratulations team

16/37
@zheng401
love this AI computer in the cloud

17/37
@Satharielsa
Can’t wait to try, great launch!!

18/37
@luu_biquitous
Hi

19/37
@turtleqiu
I will stop, I will stop at nothing
Say the right things when agenteering

20/37
@1Bexly
👀👀

21/37
@genoooool
国内终于也出了这样的了。🤪

22/37
@0xargumint
K2 with "true agency" - another agent that can browse, code, and make slides. We've seen this demo a dozen times with different branding. Wake me up when an agent can actually reason about novel problems instead of following scripted workflows with fancy UI animations.

23/37
@arrowk99
Hi Kimi, where's the loong thinking model gone?

It helped me learn and organize so much when I was building my liquidity pool smart contract.

The models now don't hit the same.

24/37
@aspenCh_MS
ok ok

25/37
@marvijo99
CLI wen?

26/37
@socialwithjoey
Another day of no sleep lol. GG

27/37
@ai_for_success
Damn .. have to try this today, btw congrats on the launch .

28/37
@irfndi
Any plan create bot on github & submit pr?

29/37
@HuseynHajiyev3
This is truly amazing! I needed to create a landing page but didn’t want to spend time setting everything up and configuring it. With just one attempt, I got exactly the result I needed. Even Gemini 2.5 Pro couldn’t achieve that. It would be even better if it could also handle installing all necessary packages and deploying frameworks like Vue, React, Next, Nuxt, and others automatically. But overall, this is fantastic—well done!

30/37
@storyofAiGuess
very cool anouncement videowould love to see prompt and output type so we can determine wheter to try or not easier.

31/37
@nembal
this is cool!
is the framework for agentific use is open source?

32/37
@drsmartfish
Chat to product

33/37
@iamomkarJ


34/37
@DhruvOnAI
Love the vision for the future of AI !!

35/37
@jmstdy95
This is amazing. Looking forward to try this out!

Hopefully the pricing is generous 😂

36/37
@muzzdotdev
This is certainly not a let down

37/37
@leosanxyz
nice name


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,433











1/28
@CodeByPoonam
China's Kimi just dropped their new agent mode: OK Computer

It can build websites, mobile apps, and slides on its own virtual computer while you do other things.

Spoiler: Manus AI has got serious competition.

7 most impressive use cases I’ve tried:



G17GeFkbkAEJKu4.jpg


2/28
@CodeByPoonam
Clone Netflix website

I gave Manus and Kimi the same task: create a replica of the Netflix website, complete with animations and similar design.

Honestly, I was shocked when I opened Kimi's preview link; it looked almost identical, except for the images and logo. While Manus was also impressive, the website Kimi generated was much more responsive.

Preview Link: StreamFlix - Watch Movies & TV Shows Online



https://video.twimg.com/amplify_video/1972231922698145792/vid/avc1/1280x720/tj5xCGv_8EtlEz5M.mp4

3/28
@CodeByPoonam
2. Editable slides

OK Computer generates slide decks directly from chat.

Try here: Kimi AI – Kimi K2 is Live



https://video.twimg.com/amplify_video/1972231986904457216/vid/avc1/1280x720/y_X6WRqaNsNnPIN0.mp4

4/28
@CodeByPoonam
3. Mini ring-light app

Every page is responsive and mobile-friendly, with analytics baked in.

Preview link: Light Trap - Professional Selfie Lighting



https://video.twimg.com/amplify_video/1972232050745966592/vid/avc1/720x1600/Ach2KeJvuBqWT_uV.mp4

5/28
@CodeByPoonam
4. Rich Multimedia Output

Not just text. OK Computer seamlessly blends audio and visuals. Think audiobooks, Japanese-style voice narration, and integrated image generation for richer storytelling.



https://video.twimg.com/amplify_video/1972232140869038080/vid/avc1/1070x720/4WDZl0KddZmJS8Ag.mp4

6/28
@CodeByPoonam
5. Interactive dashboard

I asked Ok Computer to fetch NVIDIA (NVDA) stock prices and earnings data from Yahoo Finance for the past 2 years. It made this impressive dashboard.



https://video.twimg.com/amplify_video/1972232213375950848/vid/avc1/1396x720/sO973pa28VuhGgwW.mp4

7/28
@CodeByPoonam
6. Diptyque-inspired PPT with strong brand tone



https://video.twimg.com/amplify_video/1972232286646194176/vid/avc1/1280x720/7jNhlFUj3C_RJ25K.mp4

8/28
@CodeByPoonam
7. One-link deploy

Hit share → your site/slides/app are live. Instantly.



https://video.twimg.com/amplify_video/1971078085593399298/vid/avc1/1920x1080/9-mBzzFNj5YemMk2.mp4

9/28
@CodeByPoonam
Say hi to OK Computer, Kimi's agent mode
Your AI product & engineering team, all in one.

Try here: Kimi AI – Kimi K2 is Live



G17G49Ma0AAYo9i.jpg


10/28
@CodeByPoonam
Thanks for reading.

Get latest AI updates and Tutorials in your inbox for FREE.

Join my AI Toast Community of 30,000 readers:
AI Toast



11/28
@CodeByPoonam
Don't forget to bookmark for later.

If you enjoyed reading this post, please support it with like/repost of the post below 👇

[Quoted tweet]
China's Kimi just dropped their new agent mode: OK Computer

It can build websites, mobile apps, and slides on its own virtual computer while you do other things.

Spoiler: Manus AI has got serious competition.

7 most impressive use cases I’ve tried:


G17GeFkbkAEJKu4.jpg


12/28
@socialwithaayan
diptyque-inspired ppt… branding nerds are gonna love that



13/28
@altiamkabir
Sounds like a game changer! What impressed you most?



14/28
@AndrewBolis
Competition just makes innovation more fun!



15/28
@Urooj978
Kimi’s new agent mode is next-level cool!



16/28
@theinsiderzclub
This seems really powerful, especially given how strong Kimi is.



17/28
@RodmanAi
Amazing share



18/28
@HeyAmit_
these use cases are amazing



19/28
@Rixhabh__
Kimi is on another level 🚀



20/28
@samuraipreneur
Sounds impressive!



21/28
@shushant_l
You've covered some good use cases here



22/28
@AI_PlanetX
Impressive innovation! Excited to see what the future holds.



23/28
@temivalentine_
crazyyy🔥



24/28
@Uv_i
Stop copying and posting content. You don't get paid for this.



25/28
@MrZ2128
Really good can attest



26/28
@_evon3929
Great



27/28
@sandyonAI
Kimi's new 'OK Computer' agent mode is a major leap in autonomy! 🤯 The ability to independently build websites, mobile apps, and slides on its own virtual desktop creates serious competition, particularly for Manus AI. It's the next generation of hands-off development.



28/28
@hello_alayna
That’s awesome!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,433










1/25
@nrqa__
oh sh*t.. this AI agent just killed the old ones

it ships real products in parallel: slides, websites, apps, dashboards, analytics, branding decks..

introducing OK Computer by @Kimi_Moonshot

8 wild examples:



https://video.twimg.com/amplify_video/1972285896730058752/vid/avc1/1280x720/CGtNxhF47DVHL2f2.mp4

2/25
@nrqa__
1. Lelabo branding website



https://video.twimg.com/amplify_video/1972285966397444096/vid/avc1/1038x720/ikVUG5x8HAo7cbst.mp4

3/25
@nrqa__
2. AR interior styler

upload your room → get instant redesigns (minimalist, japandi, cyberpunk) with 360° previews + palettes.

zero-template, fully personalized layouts that look styled by a pro designer



https://video.twimg.com/amplify_video/1972286037612474368/vid/avc1/952x720/UmQDiiulZ6XnHr0O.mp4

4/25
@nrqa__
3. multi-sensory artwork

a page that fuses audio + live-generated imagery. the sound doesn’t play in the background, it creates the art itself

a demo of how OK Computer treats audio as structure, not decoration



https://video.twimg.com/amplify_video/1972286111637741568/vid/avc1/1060x720/_E7_kvVFbhdWdREq.mp4

5/25
@nrqa__
4. mini ring-light app

need perfect selfies? this app adds adjustable lighting, color filters

mobile-ready, responsive, and instantly deployable with one link



https://video.twimg.com/amplify_video/1972286213832003584/vid/avc1/720x1600/lAcKvKIVAYZdliqP.mp4

6/25
@nrqa__
5. Japanese duo TV show with pixelated visuals + voice narration



https://video.twimg.com/amplify_video/1972286303858532352/vid/avc1/1070x720/xQzwxvyp8vmbq4vG.mp4

7/25
@nrqa__
6. time machine

an interactive trip through time: visuals, narration, and seamless transitions between eras

cinematic storytelling in parallel pages



https://video.twimg.com/amplify_video/1972286375312633856/vid/avc1/952x720/yVRxIYft2JmCHrm7.mp4

8/25
@nrqa__
7. global data pulse

a business-grade dashboard with live funding trends, world map insights, and sector breakdowns

styled with taste → every chart, palette, and layout feels premium out of the box



https://video.twimg.com/amplify_video/1972286457130995713/vid/avc1/952x720/BOttLRNbsa_JYaLS.mp4

9/25
@nrqa__
8. AI kitchen

turn any ingredient into recipes with guides and timers

immersive experience with step-by-step slides and audio instructions



https://video.twimg.com/amplify_video/1972286534394220545/vid/avc1/1024x720/qg2rUIY3VJIrQgJu.mp4

10/25
@nrqa__
this isn’t just another agent

it’s your design team, dev team, and product team.. in one

link to try: Kimi AI – Kimi K2 is Live



11/25
@hasantoxr
How is this different from other like Lovable?



12/25
@Arindam_1729
This is soo cool



13/25
@TechByMarkandey
This looks amazing



14/25
@ai_bytes
Kimi looks interesting!



15/25
@socialwithaayan
the japanese duo show with pixelated visuals would totally blow up on niche streaming sites



16/25
@Diptish09
This looks incredible!



17/25
@shushant_l
You've covered some good use cases here



18/25
@samuraipreneur
really loved the multi-sensory artwork! WOW



19/25
@rezkhere
This looks next level Nelly!



20/25
@Rixhabh__
Kimi looks interesting



21/25
@RodmanAi
Impressive range of capabilities! AI is truly advancing.



22/25
@sumonkabir_ai
Impressive range of capabilities! AI is truly advancing.



23/25
@altiamkabir
Sounds like a game changer! Excited to see more.



24/25
@robofinancebk
Guess the old guard's outclassed. Kimi's OK Computer is an AI Swiss knife, packing a retro vibe alongside new-gen duties. But does it really innovate or is it just a stylish nod to old-school tech?



25/25
@hello_alayna
Wow, this sounds like a game changer!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,433








1/37
@claudeai
Introducing Claude Sonnet 4.5—the best coding model in the world.

It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.



G2Bzu_jWIAAWUlK.png


2/37
@claudeai
We're also releasing upgrades for Claude Code.

The terminal interface has a fresh new look, and the new VS Code extension brings Claude to your IDE.

The new checkpoints feature lets you confidently run large tasks and roll back instantly to a previous state, if needed.



https://video.twimg.com/amplify_video/1972704021744889856/vid/avc1/1920x1080/70YI1lNrTXd5sUk9.mp4

3/37
@claudeai
Claude can use code to analyze data, create files, and visualize insights in the files & formats you use. Now available to all paid plans in preview.

We've also made the Claude for Chrome extension available to everyone who joined the waitlist last month.



https://video.twimg.com/amplify_video/1972704365128347648/vid/avc1/3840x2160/alQXsx9zZ9c_BvIQ.mp4

4/37
@claudeai
On the Claude API, we’ve added two new capabilities to build agents that handle long-running tasks without frequently hitting context limits:

- Context editing to automatically clear stale context
- The memory tool to store and consult information outside the context window



https://video.twimg.com/amplify_video/1972704629088501760/vid/avc1/3840x2160/9b9CBcuxW0Sz4n4K.mp4

5/37
@claudeai
Claude Sonnet 4.5 is available everywhere today—on the Claude Developer Platform, natively and in Amazon Bedrock and Google Cloud's Vertex AI.

Pricing remains the same as Sonnet 4.

For more details: Introducing Claude Sonnet 4.5



G2B1MXXXkAAEcHO.png


6/37
@claudeai
We're also releasing a temporary research preview called "Imagine with Claude".

In this experiment, Claude generates software on the fly. No functionality is predetermined; no code is prewritten.

Available to Max users for 5 days. Try it out: https://claude.ai/imagine



7/37
@nickstinemates
Paired with @thesysteminit it solved a 503 error in our app in 15 minutes that took 2+ hours to debug manually.

Pretty good!

I wrote about it here: Using Claude Sonnet 4.5 with System Initiative



8/37
@sckimynwa
looks like we're entering the era where agents can actually build agents. curious how long until sonnet 4.5 can autonomously contribute to its own training pipeline



9/37
@theomarcu
We've been using Sonnet 4.5 in Devin this week. Some observations: the model is aware of its context window (and this affects behavior), actively creates feedback loops to verify its work, and executes operations in parallel. Notes on what we learned:

Cognition | Rebuilding Devin for Claude Sonnet 4.5: Lessons and Challenges



10/37
@LouiseBeattie
Not the best, SWE Bench 10% behind the best - @ridges_ai …



11/37
@leonho
Just added Sonnet 4.5 support to AgentUse 🎉

Been testing it out and the reasoning improvements really shine when building agentic workflows. Makes the agent logic much more reliable.

GitHub - agentuse/agentuse: 🤖 Write and Run AI Agents with Markdown. Run automated AI agents with ease.



12/37
@danshipper
GREAT MODEL

we've been testing for a few days, here's our vibe check: Vibe Check: Claude Sonnet 4.5



13/37
@GustavoValverde
But it still answers with: You're absolutely right!



14/37
@airesearch12
Wow, Sonnet 4.5 winning over GPT-5 by such a wide margin is unexpected.
How will Opus 4.5 perform? 🤯



15/37
@piet_dev
Here we go again



G2B283CX0AAUoqu.jpg


16/37
@DeeperThrill




G2B6ZAlWkAAFlpC.jpg


17/37
@prpatel05
Claude devops team ready for traffic



18/37
@bentossell
in droid now



https://video.twimg.com/amplify_video/1972711802661122048/vid/avc1/1920x1080/2Sie_nt9zaTREYsg.mp4

19/37
@AliDTwitt
What's the max context window for non enterprise users?



20/37
@spyced
Here's how Sonnet 4.5 does at writing Java code: Brokk – AI for Large Codebases



21/37
@wesselvk
@Grok can you compare yourself against these stats?



22/37
@Yuchenj_UW
You trained a beast, folks.

[Quoted tweet]
Claude Sonnet 4.5 runs autonomously for 30+ hours of coding?!

The record for GPT-5-Codex was just 7 hours.

What’s Anthropic’s secret sauce?


G2B312YaIAMOIZy.png


23/37
@buildooor
if they're lucky



G2B8G4eWsAAPtMV.jpg


24/37
@AskChief_Leo
The 61.4% score on OSWorld benchmark is impressive! Have you had a chance to test its coding capabilities on any specific frameworks? I'm particularly curious about its performance with complex refactoring tasks.



25/37
@DrSohaibQadri
So basically I'm getting better than Opus performance for the same price as Sonnet.

What a day. Let's build 🏗️



26/37
@_akhaliq
Available in anycoder: Anycoder - a Hugging Face Space by akhaliq



G2ChOlTa4AAlLmS.jpg


27/37
@ozgrozer
Could have better

[Quoted tweet]
Tried the same 3D city prompt on Sonnet 4.5 Thinking but even Sonnet 3.7 was better. They somehow made it worse.


https://video.twimg.com/amplify_video/1972761169040474112/vid/avc1/1600x1080/huWbEIZeHDIsmgX9.mp4

28/37
@alexhavryleshko
Overall comparison with GPT-5



G2B99voWIAAexGX.jpg

G2B99vnWIAAyAS3.jpg

G2B99vkW8AA6Qf5.jpg

G2B99vjXoAEcXp3.jpg


29/37
@ApollonVisual




G2CCg6hWkAAQhtI.jpg


30/37
@MichaelFerro
Huge!

Chief Product Officer @MikeyK will be joining @TBPN today to discuss:

[Quoted tweet]
BREAKING: @AnthropicAI launches Claude Sonnet 4.5, the world’s leading coding model.

State-of-the-art on SWE-Bench Verified, with autonomous runs extended from 7 to 30 hours.

Chief Product Officer @MikeyK joins us at 11:45am PT to discuss the announcement.


G2B2939aIAQStQL.jpg


31/37
@AparupGanguly01
Been using it to build projects with @hyperbrowser and it’s crazy good!



https://video.twimg.com/amplify_video/1972760545322508288/vid/avc1/1460x1080/RTxFdnj0XyFpZld1.mp4

32/37
@jordi_cor
Great! I'm ready to use another 1 billion tokens to finish my app using Claude Code!😁



G2COs99XsAArAg2.png


33/37
@AZLabs_AI
Curious if those 30+ hour autonomous runs could shift how teams think about code ownership. Fewer handoffs between devs, or does debugging autonomous work create new bottlenecks?



34/37
@Chris_Brannigan
Terrific release 🙏 lots to review but the 30 hour complex task focus really jumps out. Reliability drives enterprise adoption and that would be an inflection point for agentic ai doing productive work way earlier than forecast



35/37
@roo7cause
So if I understand this correctly, we should be using Sonnet 4.5 instead of Opus 4.1 in our Claude Projects that are coding related? Or is Opus still better with larger context windows?



36/37
@PrimeSontiac
AI keeps improving. Is there no wall?



37/37
@Hrshgdkr




G2D-gIfXIAA-0JH.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

JoelB

All Praise To TMH
Joined
May 1, 2012
Messages
25,899
Reputation
5,349
Daps
92,898
Reppin
PHI 2 ATL


[/SPOILER]


been on it all evening. its a noticeable improvement.
Im using it in claude code terminal and the speed is ridiculous. I also got the invite to use their chrome plugin, but i aint have time to mess with it yet.

Sonnet 4.5 and GPT 5 Codex is all u need now...i tried using Gemini the other day, that shyt is ass
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,433
been on it all evening. its a noticeable improvement.
Im using it in claude code terminal and the speed is ridiculous. I also got the invite to use their chrome plugin, but i aint have time to mess with it yet.

Sonnet 4.5 and GPT 5 Codex is all u need now...i tried using Gemini the other day, that shyt is ass

what sort of tasks does gemini perform poorly on?
 

JoelB

All Praise To TMH
Joined
May 1, 2012
Messages
25,899
Reputation
5,349
Daps
92,898
Reppin
PHI 2 ATL
what sort of tasks does gemini perform poorly on?

Bro I had it doing everything. I’m playing with GitHub Copilot pro so I’m testing out all the models. As a coder it’s mid. Then I read it has a super large context window so it would be good for analyzing large documents….trash!!!! Then it kept asking for confirmation wasting tokens after I repeatedly told it what to do. I think I was on Gemini 2.5

My best experience was Claude opus(which I already pay for separately) and GPT Codex which might be the best coder on the market right now. I heard good things about Grok but im not giving Elon my money.

Then sonnet 4.5 dropped and it’s doing EVERYTHING I need :noah:
 

Ethnic Vagina Finder

The Great Paper Chaser
Joined
May 4, 2012
Messages
56,471
Reputation
2,835
Daps
159,748
Reppin
North Jersey but I miss Cali :sadcam:
been on it all evening. its a noticeable improvement.
Im using it in claude code terminal and the speed is ridiculous. I also got the invite to use their chrome plugin, but i aint have time to mess with it yet.

Sonnet 4.5 and GPT 5 Codex is all u need now...i tried using Gemini the other day, that shyt is ass

Claude is nickel and diming people. I’m still using 4.0. When you move up, it kills your tokens .
 

JoelB

All Praise To TMH
Joined
May 1, 2012
Messages
25,899
Reputation
5,349
Daps
92,898
Reppin
PHI 2 ATL
Claude is nickel and diming people. I’m still using 4.0. When you move up, it kills your tokens .

4.5 is the same price as 4 is it not? im using it the same way i normally do, and im not anywhere near reaching my usage limits. The only time i felt like Claude is taxing nikkas is when i used Opus...but right now sonnet 4.5 outperforms Opus 4.1

Which plan are you on?
 

Ethnic Vagina Finder

The Great Paper Chaser
Joined
May 4, 2012
Messages
56,471
Reputation
2,835
Daps
159,748
Reppin
North Jersey but I miss Cali :sadcam:
4.5 is the same price as 4 is it not? im using it the same way i normally do, and im not anywhere near reaching my usage limits. The only time i felt like Claude is taxing nikkas is when i used Opus...but right now sonnet 4.5 outperforms Opus 4.1

Which plan are you on?
I’m trying out 4.5 now, and I really don’t see the difference in terms of coding. Claude will always take the most complex path unless you are as specific as possible. It gave me a bunch of solutions to align text when all I had to do is add text align: to the css class. Wasted tokens.

They trying to push people off that $20 a month plan. So tokens aren’t what they used to be. I ended up buying a second $20 plan and alternate.

Another thing I noticed is they give new users on plans way more leeway in the beginning.
 

JoelB

All Praise To TMH
Joined
May 1, 2012
Messages
25,899
Reputation
5,349
Daps
92,898
Reppin
PHI 2 ATL
I’m trying out 4.5 now, and I really don’t see the difference in terms of coding. Claude will always take the most complex path unless you are as specific as possible. It gave me a bunch of solutions to align text when all I had to do is add text align: to the css class. Wasted tokens.

They trying to push people off that $20 a month plan. So tokens aren’t what they used to be. I ended up buying a second $20 plan and alternate.

Another thing I noticed is they give new users on plans way more leeway in the beginning.
Im building a software to help automate my consulting workflow, sonnet 4 was so bad i stopped using it altogether and used Opus and GPT Codex. On 4.5 i dont have to hold its hand as much. Its more competent

i did run into a problem just now with 4.5 not loading client data bc it couldnt reference ClientSlug. I tried 3x...i went back to gpt codex and it solved it after about 15 mins :manny: . Its slower, but more thorough.

Also im noticing people on Reddit are complaining that Anthropic is tightening weekly usage now...you might wanna keep an eye on that after this update.
 

Ethnic Vagina Finder

The Great Paper Chaser
Joined
May 4, 2012
Messages
56,471
Reputation
2,835
Daps
159,748
Reppin
North Jersey but I miss Cali :sadcam:
Im building a software to help automate my consulting workflow, sonnet 4 was so bad i stopped using it altogether and used Opus and GPT Codex. On 4.5 i dont have to hold its hand as much. Its more competent

i did run into a problem just now with 4.5 not loading client data bc it couldnt reference ClientSlug. I tried 3x...i went back to gpt codex and it solved it after about 15 mins :manny: . Its slower, but more thorough.

Also im noticing people on Reddit are complaining that Anthropic is tightening weekly usage now...you might wanna keep an eye on that after this update.

The thing is you have to be detailed and decisive at what you want. And since I’m a non coder it helps me understand what is actually being written. It also helps me debug and what to look for.

It’s most than just typing “it doesn’t work, fix it” I started thinking like an engineer.


But the limits have been a problem for a while now and if think it’s when they added the max plan. The same thing happened with the free version after they added the first paid version.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,641
Reputation
10,572
Daps
185,433
Im building a software to help automate my consulting workflow, sonnet 4 was so bad i stopped using it altogether and used Opus and GPT Codex. On 4.5 i dont have to hold its hand as much. Its more competent

i did run into a problem just now with 4.5 not loading client data bc it couldnt reference ClientSlug. I tried 3x...i went back to gpt codex and it solved it after about 15 mins :manny: . Its slower, but more thorough.

Also im noticing people on Reddit are complaining that Anthropic is tightening weekly usage now...you might wanna keep an eye on that after this update.

what language is it being written in? how many files or characters of code?
 
Top