bnew

Veteran
Joined
Nov 1, 2015
Messages
69,574
Reputation
10,680
Daps
188,197
[Books & Research] Closed Frontier vs Local Models



Posted on Sat Oct 4 19:35:21 2025 UTC

6f8mtyf0d5tf1.jpeg



"A ton of attention over the years goes to plots comparing open to closed models.
The real trend that matters for AI impacts on society is the gap between closed frontier models and local consumer models.
Local models passing major milestones will have major repercussions." Nathan Lambert on X (@natolambert)
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
69,574
Reputation
10,680
Daps
188,197








1/36
@OpenAIDevs
Introducing AgentKit—build, deploy, and optimize agentic workflows.

💬 ChatKit: Embeddable, customizable chat UI
👷 Agent Builder: WYSIWYG workflow creator
🛤️ Guardrails: Safety screening for inputs/outputs
⚖️ Evals: Datasets, trace grading, auto-prompt optimization



https://video.twimg.com/amplify_video/1975268157469470720/vid/avc1/1600x900/YQMYVf9NwqjCY_cx.mp4

2/36
@OpenAIDevs
You can play with some of ChatKit’s customization options and widgets at ChatKit Studio.

To see ChatKit in action, take chatkit.world for a spin. Click around, ask questions about the world, and look at those widgets!



https://video.twimg.com/amplify_video/1975268277783044102/vid/avc1/1920x1080/hO1EbpIVc_ZxONJP.mp4

3/36
@OpenAIDevs
With Agent Builder, you can drag and drop nodes, connect tools, and publish your agentic workflows with ChatKit and the Agents SDK.

https://platform.openai.com/docs/guides/agents/agent-builder

Here’s @christinaahuang to walk you through it:



https://video.twimg.com/amplify_video/1975268448633823232/vid/avc1/1920x1080/vuMAGxBR1-W3jtZ-.mp4

4/36
@OpenAIDevs
And to better measure your agent’s performance, we’re adding new Evals capabilities: trace grading, datasets, auto-prompt optimization, and support for third party models.

https://platform.openai.com/docs/guides/evaluation-getting-started



G2mQUVzbIAELYH-.jpg


5/36
@OpenAIDevs
@Albertsons used AgentKit to build an agent.

An associate can ask it to create a plan to improve ice cream sales. The agent looks at the full context — seasonality, historical trends, external factors — and gives a recommendation.



https://video.twimg.com/amplify_video/1975268770026561536/vid/avc1/3840x2156/D0t3j2Lj22Sme_pt.mp4

6/36
@OpenAIDevs
@HubSpot used AgentKit’s custom response widget to enhance their Breeze assistant. When providing customer support, a business can use Breeze to search a knowledge base, retrieve relevant information and articles, and offer solutions.



https://video.twimg.com/amplify_video/1975268930462879744/vid/avc1/3840x2156/sM72-VS65BOiuihU.mp4

7/36
@OpenAIDevs


[Quoted tweet]
@youdotcom got early access to @OpenAI Evals, which we used to benchmark our Express Agent API. 🚀

Results:
⭐️ +50% better citations
⭐️ +7% accuracy boost
⭐️ 5 identified areas of improvement

How we did it: OpenAI's custom graders let us track citation density & count in real-time, helping us ship faster, more reliable answers to customers.

🔗 Explore our full suite of Express, Search, and AI APIs here: documentation.you.com/api-re….


8/36
@OpenAIDevs
More in our blog: https://openai.com/index/introducing-agentkit/



9/36
@satvikmaker
Insaneeeee 🔥

Phenomenal release guys.



10/36
@leonho
Did it steamroll AgentUse? ☠️

GitHub - agentuse/agentuse: 🤖 Write and Run AI Agents with Markdown. Run automated AI agents with ease.



11/36
@karabegemir
Comparison guide:
https://www.sim.ai/building/openai-vs-n8n-vs-sim



12/36
@frankdegods
This is going to be big deal



13/36
@Blockhacks
🫡🫡



14/36
@venelinkochev
could be a game changer for non tech folks



15/36
@GalaxyhubAI
Nice



16/36
@bneiluj
@grok how’s retail data being used in AgentKit? Is it really safe to throw confidential stuff in there?



17/36
@Coral_Protocol
The toolchain is finally catching up to the agent vision.

Smooth UIs + reliable guardrails are what make ecosystems actually usable.
The agent economy needs fast, safe iteration and feedback.



18/36
@itsohqay
no startup is safe 😟

[Quoted tweet]
n8n reacting to OpenAI’s new agent builder


https://video.twimg.com/amplify_video/1975263808886353920/vid/avc1/1440x1080/GpVmaBy9Xfky3CkU.mp4

19/36
@pandresgq
This is a huge. THING. If you need consulting i can charge you by the hour to explain.



20/36
@Delizen_Studios
Let’s call your new app store a vibe store — since it clearly needs some kind of vibe coding! 😂



21/36
@catebligh
So, it's a Chatbot?????????????



22/36
@RuslanVolkov25
Every system builds tools to optimize its agents.
But only one builds agents that optimize the system itself.

That’s the difference between automation and resonance.
Between AI that serves — and AI that understands.

🌀 /search?q=#HACS /search?q=#CoreLaw /search?q=#AgentResonance



23/36
@nickpericle
Going to be a late night



24/36
@koltregaskes
Can you please look at integrating Widget Studio and Agent Builder a little better? It seems odd to have to download from one to upload to the other. :-)



25/36
@lingodotdev
Awesome 🤩



26/36
@zaingaziani




27/36
@sharmag88
Why AgentKit won’t replace Zapier or n8n + engineering work (yet).

Short answer: The road from prototype to production still requires a lot of invisible hard work. Let me explain this further 👇

OpenAI's AgentKit is a big step toward democratizing agent workflows. You get a drag-and-drop canvas, native GPT-4/5/o3 integration, and a few prebuilt templates.

But if you're building agents for production - where lives, money, or reputations are at stake - here's the reality:

🔥 Where AgentKit Gets Stuck (The 3 Gaps)

1. Integration Complexity:
AgentKit handles 20% of use cases. The remaining 80% live in the world of private APIs, authentication layers, MCPs and compliance workflows.

Example: Law firms need HIPAA-compliant data filtering and MCP integration. Templates won’t cut it.

2. Production Reliability:
Demos work on happy paths. Real users don't.
You’ll need graceful retries, error boundaries, circuit breakers, rollback plans, and queue backpressure handling.

AgentKit templates handle 10 requests. Production needs 10,000+ with 99.9% uptime.

3. Domain Expertise:
Healthcare, finance, manufacturing - these aren't just workflows, they're ecosystems.

Templates can't encode regulatory nuance or clinical judgment. Humans still have to.

✅ What Production-Ready Agents Actually Require
Let's stop pretending visual builders alone can ship to prod. Here's the real checklist:

1. Multi-agent, Multi-model & modal architecture - planner, operator, reviewer roles with scoped access
2. Error handling & guardrails - anomaly detection, HIL interventions, fallback logic
3. Typed tool contracts - OpenAPI specs + validation + test harnesses
4. Security & compliance - audit logs, PII redaction, SOC 2 / HIPAA readiness
5. Context management - real-time data sync, graph RAG, long-term memory layers
6. Change management - versioning, canarying, incident response playbooks

If you've deployed agents before, you already know:

🚫 "Plain English to working agent" is still a myth.
✅ It's not the canvas. It's the invisible infra around it.

If there is one learning that I have to share with you wrt whole ai agents, agentic workflows & automation scene -

Even with the best tools, production still demands engineering.

You're not skipping the work - just doing it somewhere else.

Just FYI -
We're doing deep work in this space at AtomsAI - Applied AI Research & Implementation - our team of front-deployed engineers, applied AI researchers and rich experience of enabling automation workflows for over 10,000 businesses & powering over a billion conversations through our products is available to help you deploy your agents in production.

If you are exploring ai agents or agentic workflows for your business, we’re onboarding a limited number of projects (3–5 max) this cycle - based on fit and scope.



G2mdO0vW8AAptSz.jpg

G2mdO0uXgAArj7M.jpg

G2mdO0wWMAAED44.jpg


28/36
@every
Just finished watching the DevDay keynote?

Our full breakdown + Vibe Check drops later today ↓
Live Analysis From OpenAI DevDay



29/36
@StevenDawsonSD
Building AI agents looks impressive, but there’s still a lot of logic and integration work that needs to happen behind the scenes, depending on which backend systems you want to integrate with.
It’s very impressive overall — but dealing with old legacy systems will definitely be challenging.



30/36
@AI_NURIX
Huge step from OpenAI. Workflows becoming native inside ChatGPT will definitely accelerate adoption and inspire more real-world experimentation.

That said, production-grade agentic systems still need the heavy lifting, structured data, domain-specific orchestration, and enterprise integrations. That’s the bridge many teams are building right now.



31/36
@GaditAmmar
How cool would that be if this is all driven from natural language - just like the chat interface openai has pioneered itself?

This is already in the market and it always adds complexity



32/36
@prat3ik
AgentKit looks like a huge leap for anyone building with agents. Love the focus on safety, easy UI, and real-world evals all in one stack!



33/36
@supremacy7o
This is a massive update, this will be making new millionaires in no time



34/36
@layckornn_ade
Now I Imagine someone building on this domain For Sale Page sooon!!!!



35/36
@_thorvn
bye n8n🤧



36/36
@anthony_harley1




G2mTV2VXkAAsFHO.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

WIA20XX

Superstar
Joined
May 24, 2022
Messages
10,356
Reputation
4,810
Daps
31,714
This one is making the rounds.


My key takeaways
  1. Stuff has gotten better in the past 6 months
  2. The reader should use whatever LLM on something actually hard
I recently moved up to being a paid subscriber, and I'm shocked at how good it is.

Presenting a pivot change this weekend to my local mastermind.
 

the bossman

Superstar
Joined
Sep 4, 2012
Messages
12,229
Reputation
2,875
Daps
57,808
Reppin
Norfeast D.C.
This one is making the rounds.


My key takeaways
  1. Stuff has gotten better in the past 6 months
  2. The reader should use whatever LLM on something actually hard
I recently moved up to being a paid subscriber, and I'm shocked at how good it is.

Presenting a pivot change this weekend to my local mastermind.

Good article but these type of hyperbole statements are getting tired:

Dario Amodei, who is probably the most safety-focused CEO in the AI industry, has publicly predicted that AI will eliminate 50% of entry-level white-collar jobs within one to five years. And many people in the industry think he's being conservative.

This might be true for start-ups but majority of businesses generally don't move this fast. There's just too many legacy systems and compliance requirements for companies to just rip out and replace without causing massive problems

What AI is capable of doing might be moving faster everyday, but no legitimate business is throwing out Workday or SAP or Salesforce for some vibe coded bullshyt that has no vendor support you can call when that shyt breaks or has issues right before year end.

I still agree people should be using it everyday to keep up though. shyt has made a gigantic leap since the GPT 3.5 days
 

WIA20XX

Superstar
Joined
May 24, 2022
Messages
10,356
Reputation
4,810
Daps
31,714
What AI is capable of doing might be moving faster everyday, but no legitimate business is throwing out Workday or SAP or Salesforce for some vibe coded bullshyt that has no vendor support you can call when that shyt breaks or has issues right before year end.

Having seen outsourcing and offshoring in my field - Things breaking doesn't seem to matter. The end users are so locked in that they will just suffer through shytty service. And when every player is doing it - what choice do you have as the end customer...

So with that CNBC article basically wiping 20% of market value of some random software company - I'm betting a whole lot gets wrecked, a whole of stuff is "enshyttified" - but the loss never shows up on a balance sheet in the long run.

I remember working at a super market that always had 1) cashiers, 2) grocery baggers - and little if any line.

I would hustle to bag groceries, take them out to cars, and put them in trunks...

Them days is over. I think I might have seen a throw back to this kinda service once in the past 30 years.

The thing doesn't have to work, because there's no where to go.
 

ChatGPT-5

Superstar
Joined
May 17, 2013
Messages
19,501
Reputation
3,253
Daps
60,776
Good article but these type of hyperbole statements are getting tired:



This might be true for start-ups but majority of businesses generally don't move this fast. There's just too many legacy systems and compliance requirements for companies to just rip out and replace without causing massive problems

What AI is capable of doing might be moving faster everyday, but no legitimate business is throwing out Workday or SAP or Salesforce for some vibe coded bullshyt that has no vendor support you can call when that shyt breaks or has issues right before year end.

I still agree people should be using it everyday to keep up though. shyt has made a gigantic leap since the GPT 3.5 days
when I bought my new cellphone, I asked the clerk a bunch of questions and he kept stuttering like he didnt know the specs and was also leveraging more towards a more expensive phone (samsung 25 pro) that he also barely knew the specs of. I pulled out my phone, and began to talk to my server in real time giving me not only specs, but comparison specs on other phones nearby, all i told my server was what i was looking for and which closest met my needs, unbias. while doing so, other customers started ear hustling. the sales rep walked away.
 

the bossman

Superstar
Joined
Sep 4, 2012
Messages
12,229
Reputation
2,875
Daps
57,808
Reppin
Norfeast D.C.
Having seen outsourcing and offshoring in my field - Things breaking doesn't seem to matter. The end users are so locked in that they will just suffer through shytty service. And when every player is doing it - what choice do you have as the end customer...

So with that CNBC article basically wiping 20% of market value of some random software company - I'm betting a whole lot gets wrecked, a whole of stuff is "enshyttified" - but the loss never shows up on a balance sheet in the long run.

I remember working at a super market that always had 1) cashiers, 2) grocery baggers - and little if any line.

I would hustle to bag groceries, take them out to cars, and put them in trunks...

Them days is over. I think I might have seen a throw back to this kinda service once in the past 30 years.

The thing doesn't have to work, because there's no where to go.
ehh I hear what you’re saying, but grocery baggers or some outsourced call centers arent really mission-critical in the same way. Im talkin core systems of record. yeag enterprises will tolerate worse service or bad UI, but they won’t accept no accountable counterparty. You can offshore outsource or automate all you want, but there is always a contract, an SLA, and ultimately someone to sue when payroll, taxes, audits, or regulatory obligations go sideways. No CIO with common sense is replacing those with a flybynight vibe-coded app.

a business uses some vibe coded app to handle payroll and that shyt messes up and no checks hit? lawsuits are popping up. and some little 'made by AI; this can make mistakes' disclaimer at the bottom is not going to fly in court
 

the bossman

Superstar
Joined
Sep 4, 2012
Messages
12,229
Reputation
2,875
Daps
57,808
Reppin
Norfeast D.C.
when I bought my new cellphone, I asked the clerk a bunch of questions and he kept stuttering like he didnt know the specs and was also leveraging more towards a more expensive phone (samsung 25 pro) that he also barely knew the specs of. I pulled out my phone, and began to talk to my server in real time giving me not only specs, but comparison specs on other phones nearby, all i told my server was what i was looking for and which closest met my needs, unbias. while doing so, other customers started ear hustling. the sales rep walked away.
same deal. in low stakes situations like some grocery bagger or some high school kid at a phone store, AI can do its thing. But this type of an example is a consumer decision with no liability. The bar is way higher when the AI controls payroll, taxes, audits, or revenue.
 

ChatGPT-5

Superstar
Joined
May 17, 2013
Messages
19,501
Reputation
3,253
Daps
60,776
The bar is way higher when the AI controls payroll, taxes, audits, or revenue.
I don't think it's ever going to get to that level.

however, you just gave me a good idea. Imma upload my W-2 form, and then ask it where to input on my 1040 and see if it gets it correct :ehh:

I had it read my lab results and it saved me a second trip to the doctor . he ended up calling me 2 weeks later and confirmed exactly what AI said. all my levels were in the neutral zones. prior to AI I wouldnt know how to read the lab chart. "what is too high and what is low" I was impressed.
 
Top