bnew

Veteran
Joined
Nov 1, 2015
Messages
68,651
Reputation
10,572
Daps
185,466



















1/19
@ChrisClickUp
The $3B feature Google couldn't stop:

OpenAI's record-breaking acquisition wasn't for an AI assistant:

They found a way to modify specific lines of code without full codebase access.

Here's why this threatens every company's software security:



GqXHtv6bMAERMQu.jpg


2/19
@ChrisClickUp
Most code edits today require full codebase access.

You need to see the entire program to make targeted changes.

Think of it like needing to examine an entire car just to replace a faulty headlight.

But what if you could fix that headlight without seeing the rest of the car?



https://video.twimg.com/amplify_video/1920160422625488898/vid/avc1/1280x720/jtKJdydy-6dlE0yV.mp4

3/19
@ChrisClickUp
That's what OpenAI acquired with their $3B purchase of Windsurf (formerly Codeium).

Their tech lets AI modify specific code sections without seeing the entire codebase.

This changes everything about software security...



GqXHyqhbMAMxIBT.jpg


4/19
@ChrisClickUp
Imagine a locksmith who can replace just one tumbler in your lock without taking the whole thing apart.

Sounds convenient, right?

But what if that locksmith is a thief?

That's the double-edged sword of this technology:



GqXHzuabMAMkCbd.jpg


5/19
@ChrisClickUp
The tech works by creating a "mental model" of how code functions.

It allows users to modify individual lines through natural language prompts.

No need to download or review the entire codebase.



https://video.twimg.com/amplify_video/1920160525029421056/vid/avc1/1280x720/dZeRjlmtmPgSFSWL.mp4

6/19
@ChrisClickUp
Traditional security models rely on "security through obscurity."

Companies keep their codebases private, assuming hackers need to see the whole system to exploit it.

That assumption just collapsed.

With Windsurf's technology, attackers could potentially launch attacks with minimal code exposure.



7/19
@ChrisClickUp
Why did OpenAI pay $3 billion – more than double Windsurf's previous $1.25 billion valuation?

Because it fundamentally changes software development:

• Faster code fixes without complex system understanding
• More accessible programming for non-experts
• Automated updates across massive codebases

But these benefits come with serious security risks:



https://video.twimg.com/amplify_video/1920160573872099328/vid/avc1/1280x720/l4tyQIOukpcnnaD8.mp4

8/19
@ChrisClickUp
Every software vulnerability becomes more dangerous.

Previously, hackers finding a security flaw still needed to understand the surrounding code to exploit it.

Now, AI could potentially generate working exploits from minimal information.

This creates powerful advantages for malicious actors.



https://video.twimg.com/amplify_video/1920160621590712321/vid/avc1/1280x720/cecxG1HQ3ytPBRXA.mp4

9/19
@ChrisClickUp
• Small code leaks become major vulnerabilities
• Legacy systems become easier to compromise
• Supply chain attacks grow more sophisticated
• Open-source contributions could hide malicious code

Google reportedly tried to acquire similar technologies but couldn't match OpenAI's move.

Why were they so desperate?



https://video.twimg.com/amplify_video/1920160664288702465/vid/avc1/1280x720/-QsvBItwtr2AV3sR.mp4

10/19
@ChrisClickUp
Because whoever controls this capability shapes the future of code security.

Before being acquired, Windsurf offered free usage for developers who supplied their API keys.

The company had also partnered with several Fortune 500 firms under strict NDAs.

The security implications are enormous:



https://video.twimg.com/amplify_video/1920160696157024256/vid/avc1/1280x720/6UxQaX2rkYVPodC9.mp4

11/19
@ChrisClickUp
According to industry analysts, Windsurf's technology represents one of the most significant shifts in software security in years.

Companies must now assume that small code snippets could compromise their entire systems.



https://video.twimg.com/amplify_video/1920160762116571136/vid/avc1/1280x720/Z7qPhEnM6XJtktFu.mp4

12/19
@ChrisClickUp
Traditional security practices like:
• Code obfuscation
• Limiting repository access
• Segmenting codebases

Are no longer sufficient protections.

What should companies do instead?



13/19
@ChrisClickUp
The most effective defense will be comprehensive runtime monitoring and behavioral analysis.

Since preventing code access becomes less effective, detecting unusual behavior becomes essential.

Companies should implement:
• Simulation environments for testing changes
• Automated static analysis
• Zero-trust architecture



https://video.twimg.com/amplify_video/1920160809621336066/vid/avc1/1280x720/AfUv6WIuh8HrRgTY.mp4

14/19
@ChrisClickUp
This acquisition signals the beginning of a new arms race between AI-powered development and AI-powered security.

Companies that adapt quickly will thrive.

Those that cling to outdated security models will find themselves increasingly vulnerable.

The era of "security through obscurity" is officially over.



15/19
@ChrisClickUp
As AI continues to revolutionize code creation and modification, we're seeing a fundamental shift in how companies must approach security.

The old playbook of protecting your codebase is becoming obsolete.

What matters now is understanding how your code behaves when it runs - and identifying anomalies before they become breaches.



16/19
@ChrisClickUp
This shift requires a new generation of security tools designed specifically for the AI era.

Tools that can monitor behavior patterns in real-time.

Tools that can detect subtle code modifications that traditional security measures would miss.

Tools built by people who understand both AI and security at a fundamental level.



https://video.twimg.com/amplify_video/1920160883822804992/vid/avc1/1280x720/ARxoaJojQjvjfutd.mp4

17/19
@ChrisClickUp
That's why I've been obsessively tracking these developments in AI security for years.

Each breakthrough - from automated coding to this new surgical code modification capability - creates both possibilities and vulnerabilities.

By understanding where this technology is headed, we can build better protections and smarter systems.



18/19
@ChrisClickUp
Want to stay ahead of these emerging AI security threats and opportunities?

Follow me for weekly insights on AI developments that impact your business security.

I share practical strategies to protect your systems in this rapidly evolving landscape.



19/19
@ChrisClickUp
Video credits:
Deirdre Bosa - CNBC:
Y Combinator:
AI LABS:
Low Level:
TED:




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196







1/6
@kevinhou22
goodbye runbooks, hello /workflows 👋

[1/5] absolutely LOVING this new feature in windsurf. Some of my use cases so far:

- deploy my server to kubernetes
- generate PRs using my team's style
- get the recent error logs

The possibilities are endless. Here's how it works 🧵



GqXm_Z9bMAAaoJ_.jpg


2/6
@kevinhou22
[2/5] Windsurf rules already provide LLMs with guidance via persistent, reusable context at the prompt level.

Workflows extend this concept with:
- structured sequences of prompts on a per step level
- chaining interconnected tasks / actions
- general enough to handle ambiguity



GqXoB9XbMAMWy1s.png


3/6
@kevinhou22
[3/5] It's super simple to setup a new /workflow:

1. Click "Customize" --> "Workflows"
2. Press "+Workflow" to create a new one
3. Add a series of steps that Windsurf can follow
4. Set a title & description

The best part is, you can write it all in text!



https://video.twimg.com/amplify_video/1920197094142521344/vid/avc1/3652x2160/a14EBqmsj3lhtuei.mp4

4/6
@kevinhou22
[4/5] You can also ask Windsurf to generate Workflows for you! This works particularly well for workflows involving a series of steps in a particular CLI tool.

Check it out:



https://video.twimg.com/amplify_video/1920197704044720129/vid/avc1/3696x2160/AyMiwDXJ28XE1Xy0.mp4

5/6
@kevinhou22
[5/5] To execute a workflow, users simply invoke it in Cascade using the /[workflow-name] command.

It's that easy


Check it out on @windsurf_ai v1.8.2 today ⬇️
Windsurf (formerly Codeium) - The most powerful AI Code Editor



6/6
@lyson_ober
Respect 🫡 Like it ❤️




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,651
Reputation
10,572
Daps
185,466

Google Releases 76-Page Whitepaper on AI Agents: A Deep Technical Dive into Agentic RAG, Evaluation Frameworks, and Real-World Architectures​


By Sana Hassan

May 6, 2025

Google has published the second installment in its Agents Companion series—an in-depth 76-page whitepaper aimed at professionals developing advanced AI agent systems. Building on foundational concepts from the first release, this new edition focuses on operationalizing agents at scale, with specific emphasis on agent evaluation, multi-agent collaboration, and the evolution of Retrieval-Augmented Generation ( RAG ) into more adaptive, intelligent pipelines.

Agentic RAG: From Static Retrieval to Iterative Reasoning


At the center of this release is the evolution of RAG architectures. Traditional RAG pipelines typically involve static queries to vector stores followed by synthesis via large language models. However, this linear approach often fails in multi-perspective or multi-hop information retrieval.

Agentic RAG reframes the process by introducing autonomous retrieval agents that reason iteratively and adjust their behavior based on intermediate results. These agents improve retrieval precision and adaptability through:

  • Context-Aware Query Expansion : Agents reformulate search queries dynamically based on evolving task context.
  • Multi-Step Decomposition : Complex queries are broken into logical subtasks, each addressed in sequence.
  • Adaptive Source Selection : Instead of querying a fixed vector store, agents select optimal sources contextually.
  • Fact Verification : Dedicated evaluator agents validate retrieved content for consistency and grounding before synthesis.

The net result is a more intelligent RAG pipeline, capable of responding to nuanced information needs in high-stakes domains such as healthcare, legal compliance, and financial intelligence.

Rigorous Evaluation of Agent Behavior


Evaluating the performance of AI agents requires a distinct methodology from that used for static LLM outputs. Google’s framework separates agent evaluation into three primary dimensions:

  1. Capability Assessment : Benchmarking the agent’s ability to follow instructions, plan, reason, and use tools. Tools like AgentBench, PlanBench, and BFCL are highlighted for this purpose.
  2. Trajectory and Tool Use Analysis : Instead of focusing solely on outcomes, developers are encouraged to trace the agent’s action sequence (trajectory) and compare it to expected behavior using precision, recall, and match-based metrics.
  3. Final Response Evaluation : Evaluation of the agent’s output through autoraters—LLMs acting as evaluators—and human-in-the-loop methods. This ensures that assessments include both objective metrics and human-judged qualities like helpfulness and tone.

This process enables observability across both the reasoning and execution layers of agents, which is critical for production deployments.

Scaling to Multi-Agent Architectures


As real-world systems grow in complexity, Google’s whitepaper emphasizes a shift toward multi-agent architectures , where specialized agents collaborate, communicate, and self-correct.

Key benefits include:

  • Modular Reasoning : Tasks are decomposed across planner, retriever, executor, and validator agents.
  • Fault Tolerance : Redundant checks and peer hand-offs increase system reliability.
  • Improved Scalability : Specialized agents can be independently scaled or replaced.

Evaluation strategies adapt accordingly. Developers must track not only final task success but also coordination quality, adherence to delegated plans, and agent utilization efficiency. Trajectory analysis remains the primary lens, extended across multiple agents for system-level evaluation.

Real-World Applications: From Enterprise Automation to Automotive AI


The second half of the whitepaper focuses on real-world implementation patterns:

AgentSpace and NotebookLM Enterprise


Google’s AgentSpace is introduced as an enterprise-grade orchestration and governance platform for agent systems. It supports agent creation, deployment, and monitoring, incorporating Google Cloud’s security and IAM primitives. NotebookLM Enterprise, a research assistant framework, enables contextual summarization, multimodal interaction, and audio-based information synthesis.

Automotive AI Case Study


A highlight of the paper is a fully implemented multi-agent system within a connected vehicle context. Here, agents are designed for specialized tasks—navigation, messaging, media control, and user support—organized using design patterns such as:

  • Hierarchical Orchestration : Central agent routes tasks to domain experts.
  • Diamond Pattern : Responses are refined post-hoc by moderation agents.
  • Peer-to-Peer Handoff : Agents detect misclassification and reroute queries autonomously.
  • Collaborative Synthesis : Responses are merged across agents via a Response Mixer.
  • Adaptive Looping : Agents iteratively refine results until satisfactory outputs are achieved.

This modular design allows automotive systems to balance low-latency, on-device tasks (e.g., climate control) with more resource-intensive, cloud-based reasoning (e.g., restaurant recommendations).




Check out the Full Guide here .


 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,651
Reputation
10,572
Daps
185,466

Google Launches Gemini 2.5 Pro I/O: Outperforms GPT-4 in Coding, Supports Native Video Understanding and Leads WebDev Arena​


By Asif Razzaq

May 7, 2025

Just ahead of its annual I/O developer conference , Google has released an early preview of Gemini 2.5 Pro (I/O Edition) —a substantial update to its flagship AI model focused on software development and multimodal reasoning and understanding. This latest version delivers marked improvements in coding accuracy, web application generation, and video-based understanding, placing it at the forefront of large model evaluation leaderboards.

With top rankings in LM Arena’s WebDev and Coding categories, Gemini 2.5 Pro I/O emerges as a serious contender in applied AI programming assistance and multimodal intelligence.

Leading in Web App Development: Top of WebDev Arena


The I/O Edition distinguishes itself in frontend software development, achieving the top spot on the WebDev Arena leaderboard —a benchmark based on human evaluation of generated web applications. Compared to its predecessor, the model improves by +147 Elo points, underscoring meaningful progress in quality and consistency.

Key capabilities include:

  • End-to-End Frontend Generation
    Gemini 2.5 Pro I/O generates complete browser-ready applications from a single prompt. Outputs include well-structured HTML, responsive CSS, and functional JavaScript—reducing the need for iterative prompts or post-processing.
  • High-Fidelity UI Generation
    The model interprets structured UI prompts with precision, producing readable and modular code components that are suitable for direct deployment or integration into existing codebases.
  • Consistency Across Modalities
    Outputs remain consistent across various frontend tasks, enabling developers to use the model for layout prototyping, styling, and even component-level rendering.

This makes Gemini particularly valuable in streamlining frontend workflows, from mockup to functional prototype.

General Coding Performance: Outpacing GPT-4 and Claude 3.7


Beyond web development, Gemini 2.5 Pro I/O shows strong general-purpose coding capabilities. It now ranks first in LM Arena’s coding benchmark, ahead of competitors such as GPT-4 and Claude 3.7 Sonnet.

Notable enhancements include:

  • Multi-Step Programming Support
    The model can perform chained tasks such as code refactoring, optimization, and cross-language translation with increased accuracy.
  • Improved Tool Use
    Google reports a reduction in tool-calling errors during internal testing—an important milestone for real-time development scenarios where tool invocation is tightly coupled with model output.
  • Structured Instructions via Vertex AI
    In enterprise environments, the model supports structured system instructions, giving teams greater control over execution flow, especially in multi-agent or workflow-based systems.

Together, these improvements make the I/O Edition a more reliable assistant for tasks that go beyond single-function completions—supporting real-world software development practices.

Native Video Understanding and Multimodal Contexts


In a notable leap toward generalist AI, Gemini 2.5 Pro I/O introduces built-in support for video understanding. The model scores 84.8% on the VideoMME benchmark , indicating robust performance in spatial-temporal reasoning tasks.

Key features include:

  • Direct Video-to-Structure Understanding
    Developers can feed video inputs into AI Studio and receive structured outputs—eliminating the need for manual intermediate steps or model switching.
  • Unified Multimodal Context Window
    The model accepts extended, multimodal sequences—text, image, and video—within a single context. This simplifies the development of cross-modal workflows where continuity and memory retention are essential.
  • Application Readiness
    Video understanding is integrated into AI Studio today, with extended capabilities available through Vertex AI, making the model immediately usable for enterprise-facing tools.

This makes Gemini suitable for a range of new use cases, from video content summarization and instructional QA to dynamic UI adaptation based on video feeds.

Deployment and Integration


Gemini 2.5 Pro I/O is now available across key Google platforms:

  • Google AI Studio : For interactive experimentation and rapid prototyping
  • Vertex AI : For enterprise-grade deployment with support for system-level configuration and tool use
  • Gemini App : For general access via natural language interfaces

While the model does not yet support fine-tuning, it accepts prompt-based customization and structured input/output, making it adaptable for task-specific pipelines without retraining.

Conclusion


Gemini 2.5 Pro I/O marks a significant step forward in making large language models practically useful for developers and enterprises alike. Its leadership on both WebDev and coding leaderboards, combined with native support for multimodal input, illustrates Google’s growing emphasis on real-world applicability.

Rather than focusing solely on raw language modeling benchmarks, this release prioritizes functional quality—offering developers structured, accurate, and context-aware outputs across a diverse range of tasks. With Gemini 2.5 Pro I/O, Google continues to shape the future of developer-centric AI systems.




Check out the Technical details and Try it here .


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,651
Reputation
10,572
Daps
185,466

LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model​


By Asif Razzaq

May 6, 2025

Researchers at the Institute of Computing Technology, Chinese Academy of Sciences, have introduced LLaMA-Omni2 , a family of speech-capable large language models (SpeechLMs) now available on Hugging Face . This research introduces a modular framework that enables real-time spoken dialogue by integrating speech perception and synthesis with language understanding. Unlike earlier cascaded systems, LLaMA-Omni2 operates in an end-to-end pipeline while retaining modular interpretability and low training cost.

Overview of the LLaMA-Omni2 Architecture


LLaMA-Omni2 encompasses models ranging from 0.5B to 14B parameters, each built atop the Qwen2.5-Instruct series. The architecture consists of:

  • Speech Encoder : Utilizes Whisper-large-v3 to transform input speech into token-level acoustic representations.
  • Speech Adapter : Processes encoder outputs using a downsampling layer and a feed-forward network to align with the language model’s input space.
  • Core LLM : The Qwen2.5 models serve as the main reasoning engine.
  • Streaming TTS Decoder : Converts LLM outputs into speech tokens using an autoregressive Transformer and then generates mel spectrograms through a causal flow matching model inspired by CosyVoice2.

A gating mechanism fuses LLM hidden states with textual embeddings before speech synthesis, enhancing contextual fidelity in the generated audio.

Screenshot-2025-05-06-at-4.10.36%E2%80%AFPM-1-1024x636.png


Streaming Generation with Read-Write Scheduling


The model adopts a read-write strategy to facilitate streaming output. Specifically, for every R tokens produced by the LLM, W speech tokens are generated. This enables synchronized textual and acoustic generation, minimizing latency without compromising fluency.

Empirical findings suggest that setting R = 3 and W = 10 provides a favorable trade-off between latency (~583 ms), alignment (ASR-WER: 3.26), and perceptual quality (UTMOS: 4.19).

Training Approach


Despite achieving competitive performance, LLaMA-Omni2 is trained on a relatively compact corpus—200K multi-turn speech-to-speech dialogue samples. These samples are synthesized from instruction-following text datasets (Alpaca, UltraChat), with diverse input voices and a consistent output voice generated using FishSpeech and CosyVoice2 models.

Training is executed in two stages:

  • Stage I : Independently optimizes the speech-to-text and text-to-speech modules.
  • Stage II : Fine-tunes the speech-to-speech generation path, including the gating and autoregressive decoding components.

Benchmark Results


The models are evaluated on spoken question answering and speech instruction following tasks using both speech-to-text (S2T) and speech-to-speech (S2S) modes.

Model Llama Q (S2S) Web Q (S2S) GPT-4o Score ASR-WER Latency (ms) GLM-4-Voice (9B) 50.7 15.9 4.09 3.48 1562.8 LLaMA-Omni (8B) 49.0 23.7 3.52 3.67 346.7 LLaMA-Omni2-7B 60.7 31.3 4.15 3.26 582.9

The performance scales consistently with model size. Notably, LLaMA-Omni2-14B outperforms all baselines across tasks, even with substantially less training data than native SpeechLMs such as GLM-4-Voice.

Component Analyses


  • Gate Fusion Module : Removing the gating mechanism increases ASR-WER and reduces speech quality, confirming its role in aligning textual and contextual signals.
  • TTS Pretraining : Initializing the TTS model from Qwen2.5 and fine-tuning in a streaming setup yields the best performance. Training from scratch fails to converge effectively.
  • Read/Write Strategies : Adjusting the R:W ratio impacts latency and quality. Larger W improves UTMOS but at the cost of response delay.

Additionally, the study demonstrates that multi-turn dialogue data is more effective than single-turn data in training speech interaction capabilities, and that performance plateaus around 200K samples.

Conclusion


LLaMA-Omni2 demonstrates that high-quality, low-latency spoken interaction with LLMs is feasible without the need for extensive pretraining on massive speech corpora. By combining modular architecture with autoregressive streaming synthesis, the system offers a practical pathway for real-time speech applications.




Check out the Paper , Model on Hugging Face and GitHub Page .


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,651
Reputation
10,572
Daps
185,466

OpenAI Releases a Strategic Guide for Enterprise AI Adoption: Practical Lessons from the Field​


By Asif Razzaq

May 5, 2025

OpenAI has published a comprehensive 24-page document titled AI in the Enterpris e , offering a pragmatic framework for organizations navigating the complexities of large-scale AI deployment. Rather than focusing on abstract theories, the report presents seven implementation strategies based on field-tested insights from collaborations with leading companies including Morgan Stanley, Klarna, Lowe’s, and Mercado Libre.

The document reads less like promotional material and more like an operational guidebook—emphasizing systematic evaluation, infrastructure readiness, and domain-specific integration.

1. Establish a Rigorous Evaluation Process


The first recommendation is to initiate AI adoption through well-defined evaluations (“evals”) that benchmark model performance against targeted use cases. Morgan Stanley applied this approach by assessing language translation, summarization, and knowledge retrieval in financial advisory contexts. The outcome was measurable: improved document access, reduced search latency, and broader AI adoption among advisors.

Evals not only validate models for deployment but also help refine workflows with empirical feedback loops, enhancing both safety and model alignment.

2. Integrate AI at the Product Layer


Rather than treating AI as an auxiliary function, the report stresses embedding it directly into user-facing experiences. For instance, Indeed utilized GPT-4o mini to personalize job matching, supplementing recommendations with contextual “why” statements. This increased user engagement and hiring success rates while maintaining cost-efficiency through fine-tuned, token-optimized models.

The key takeaway: model performance alone is insufficient—impact scales when AI is embedded into product logic and tailored to domain-specific needs.

3. Invest Early to Capture Compounding Returns


Klarna’s early investment in AI yielded substantial gains in operational efficiency. A GPT-powered assistant now handles two-thirds of support chats, reducing resolution times from 11 minutes to 2. The company also reports that 90% of employees are using AI in their workflows, a level of adoption that enables rapid iteration and organizational learning.

This illustrates how early engagement not only improves tooling but accelerates institutional adaptation and compound value capture.

4. Leverage Fine-Tuning for Contextual Precision


Generic models can deliver strong baselines, but domain adaptation often requires customization. Lowe’s achieved notable improvements in product search relevance by fine-tuning GPT models on their internal product data. The result: a 20% increase in tagging accuracy and a 60% improvement in error detection.

OpenAI highlights this approach as a low-latency pathway to achieve brand consistency, domain fluency, and efficiency across content generation and search tasks.

5. Empower Internal Experts, Not Just Technologists


BBVA exemplifies a decentralized AI adoption model by enabling non-technical employees to build custom GPT-based tools. In just five months, over 2,900 internal GPTs were created, addressing legal, compliance, and customer service needs without requiring engineering support.

This bottom-up strategy empowers subject-matter experts to iterate directly on their workflows, yielding more relevant solutions and reducing development cycles.

6. Streamline Developer Workflows with Dedicated Platforms


Engineering bandwidth remains a bottleneck in many organizations. Mercado Libre addressed this by building Verdi , a platform powered by GPT-4o mini, enabling 17,000 developers to prototype and deploy AI applications using natural language interfaces. The system integrates guardrails, APIs, and reusable components—allowing faster, standardized development.

The platform now supports high-value functions such as fraud detection, multilingual translation, and automated content tagging, demonstrating how internal infrastructure can accelerate AI velocity.

7. Automate Deliberately and Systematically


OpenAI emphasizes setting clear automation targets. Internally, they developed an automation platform that integrates with tools like Gmail to draft support responses and trigger actions. This system now handles hundreds of thousands of tasks monthly, reducing manual workload and enhancing responsiveness.

Their broader vision includes Operator , a browser-agent capable of autonomously interacting with web-based interfaces to complete multi-step processes—signaling a move toward agent-based, API-free automation.

Final Observations


The report concludes with a central theme: effective AI adoption requires iterative deployment, cross-functional alignment, and a willingness to refine strategies through experimentation. While the examples are enterprise-scale, the core principles—starting with evals, integrating deeply, and customizing with context—are broadly applicable.

Security and data governance are also addressed explicitly. OpenAI reiterates that enterprise data is not used for training, offers SOC 2 and CSA STAR compliance, and provides granular access control for regulated environments.

In an increasingly AI-driven landscape, OpenAI’s guide serves as both a mirror and a map—reflecting current best practices and helping enterprises chart a more structured, sustainable path forward.




Check out the Full Guide here .


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,651
Reputation
10,572
Daps
185,466

This AI Paper Introduce WebThinker: A Deep Research Agent that Empowers Large Reasoning Models (LRMs) for Autonomous Search and Report Generation​


By Sajjad Ansari

May 6, 2025

Large reasoning models (LRMs) have shown impressive capabilities in mathematics, coding, and scientific reasoning. However, they face significant limitations when addressing complex information research needs when relying solely on internal knowledge. These models struggle with conducting thorough web information retrieval and generating accurate scientific reports through multi-step reasoning processes. So, the deep integration of LRM’s reasoning capabilities with web information exploration is a practical demand, initiating a series of deep research initiatives. However, existing open-source deep search agents use RAG techniques with rigid, predefined workflows, restricting LRMs’ ability to explore deeper web information and hindering effective interaction between LRMs and search engines.

LRMs like OpenAI-o1, Qwen-QwQ, and DeepSeek-R1 enhance performance through extended reasoning capabilities. Various strategies have been proposed to achieve advanced reasoning capabilities, including intentional errors in reasoning during training, distilled training data, and reinforcement learning approaches to develop long chain-of-thought abilities. However, these methods are fundamentally limited by their static, parameterized architectures that lack access to external world knowledge. RAG integrates retrieval mechanisms with generative models, enabling access to external knowledge. Recent advances span multiple dimensions, including retrieval necessity, query reformulation, document compression, denoising, and instruction-following.

Researchers from Renmin University of China, BAAI, and Huawei Poisson Lab have proposed a deep research agent called WebThinker that empowers LRMs to autonomously search the web, navigate web pages, and draft research reports during the reasoning process. WebThinker introduces a Deep Web Explorer module that enables LRMs to dynamically search, navigate, and extract information from the web when they encounter knowledge gaps. It employs an Autonomous Think-Search-and-Draft strategy, allowing models to combine reasoning, information gathering, and report writing in real time smoothly. Moreover, an RL-based training strategy is implemented to enhance research tool utilization through iterative online Direct Preference Optimization.

AD_4nXe_gxTbRrHlrygJRh8THCFkqkADfmwkOOq1szw9_n9RIMc2KIrSWXQ3mWaM3eDpnAD1YyMPycdOnr3nkHbO0pzjBUiCg6jxGF9oz7IzX3hwPEd1ytQQkV4ZlyLs79iCnjEzXDjNHQ


WebThinker framework operates in two primary modes: Problem-Solving Mode and Report Generation Mode. In Problem-Solving Mode, WebThinker addresses complex tasks using the Deep Web Explorer tool, which the LRM can invoke during reasoning. In Report Generation Mode, the LRM autonomously produces detailed reports and employs an assistant LLM to implement report-writing tools. To improve LRMs with research tools via RL, WebThinker generates diverse reasoning trajectories by applying its framework to an extensive set of complex reasoning and report generation datasets, including SuperGPQA, WebWalkerQA, OpenThoughts, NaturalReasoning, NuminaMath, and Glaive. For each query, the initial LRM produces multiple distinct trajectories.

The WebThinker-32B-Base model outperforms prior methods like Search-o1 across all benchmarks on complex problem-solving, with 22.9% improvement on WebWalkerQA and 20.4% on HLE. WebThinker achieves the highest overall score of 8.0, surpassing RAG baselines and advanced deep research systems in scientific report generation tasks, including Gemini-Deep Research (7.9). The adaptability across different LRM backbones is remarkable, with R1-based WebThinker models outperforming direct reasoning and standard RAG baselines. With the DeepSeek-R1-7B backbone, it achieves relative improvements of 174.4% on GAIA and 422.6% on WebWalkerQA compared to direct generation, and 82.9% on GAIA and 161.3% on WebWalkerQA over standard RAG implementations.

In conclusion, researchers introduced WebThinker, which provides LRMs with deep research capabilities, addressing their limitations in knowledge-intensive real-world tasks such as complex reasoning and scientific report generation. The framework enables LRMs to autonomously explore the web and produce comprehensive outputs through continuous reasoning processes. The findings highlight WebThinker’s potential to advance the deep research capabilities of LRMs, creating more powerful intelligent systems capable of addressing complex real-world challenges. Future work includes incorporating multimodal reasoning capabilities, exploring advanced tool learning mechanisms, and investigating GUI-based web exploration.




Check out the Paper .


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,651
Reputation
10,572
Daps
185,466

A Coding Guide to Compare Three Stability AI Diffusion Models (v1.5, v2-Base & SD3-Medium) Diffusion Capabilities Side-by-Side in Google Colab Using Gradio​


By Nikhil

May 5, 2025

In this hands-on tutorial, we’ll unlock the creative potential of Stability AI ’s industry-leading diffusion models, Stable Diffusion v1.5, Stability AI’s v2-base, and the cutting-edge Stable Diffusion 3 Medium , to generate eye-catching imagery. Running entirely in Google Colab with a Gradio interface, we’ll experience side-by-side comparisons of three powerful pipelines, rapid prompt iteration, and seamless GPU-accelerated inference. Whether we’re a marketer looking to elevate our brand’s visual narrative or a developer eager to prototype AI-driven content workflows, this tutorial showcases how Stability AI’s open-source models can be deployed instantly and at no infrastructure cost, allowing you to focus on storytelling, engagement, and driving real-world results.

We install the huggingface_hub library and then import and invoke the notebook_login() function, which prompts you to authenticate your notebook session with your Hugging Face account, allowing you to seamlessly access and manage models, datasets, and other hub resources.

We first force-uninstalls any existing torchvision to clear potential conflicts, then reinstalls torch and torchvision from the CUDA 11.8–compatible PyTorch wheels, and finally upgrades key libraries, diffusers, transformers, accelerate, safetensors, gradio, and pillow, to ensure you have the latest versions for building and running GPU-accelerated generative pipelines and web demos.

We import PyTorch alongside both the Stable Diffusion v1 and v3 pipelines from the Diffusers library, as well as Gradio for building interactive demos. It then checks for CUDA availability and sets the device variable to “cuda” if a GPU is present; otherwise, it falls back to “cpu”, ensuring your models run on the optimal hardware.

We load the Stable Diffusion v1.5 model in half-precision (float16) without the built-in safety checker, transfers it to your selected device (GPU, if available), and then enables attention slicing to reduce peak VRAM usage during image generation.

We load the Stable Diffusion v2 “base” model in 16-bit precision without the default safety filter, transfers it to your chosen device, and activates attention slicing to optimize memory usage during inference.

We pull in Stability AI’s Stable Diffusion 3 “medium” checkpoint in 16-bit precision (skipping the built-in safety checker), transfers it to your selected device, and enables attention slicing to reduce GPU memory usage during generation.

Now, this function runs the same text prompt through all three loaded pipelines (pipe1, pipe2, pipe3) using the specified inference steps and guidance scale, then returns the first image from each, making it perfect for comparing outputs across Stable Diffusion v1.5, v2-base, and v3-medium.

Finally, this Gradio app builds a three-column UI where you can enter a text prompt, adjust inference steps and guidance scale, then generate and display images from SD v1.5, v2-base, and v3-medium side by side. It also features a radio selector, allowing you to select your preferred model output, and displays a simple confirmation message when a choice is made.

AD_4nXdkAqovJnQSlZTvg5mOOg30u_taD3a5fjijyj3rN1ZF0UnniuU9aSK3AXU9NtZbqZ1O-HIqElnX7LSbOGxrLKYeLmNvzzwX8muZhwbyd0gTD9VzFoBbBxelSf0VblICdkZ0C3a6OA
A web interface to compare the three Stability AI models’ output

In conclusion, by integrating Stability AI’s state-of-the-art diffusion architectures into an easy-to-use Gradio app, you’ve seen how effortlessly you can prototype, compare, and deploy stunning visuals that resonate on today’s platforms. From A/B-testing creative directions to automating campaign assets at scale, Stability AI provides the performance, flexibility, and vibrant community support to transform your content pipeline.




Check out the Colab Notebook .


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,651
Reputation
10,572
Daps
185,466

NVIDIA Open Sources Parakeet TDT 0.6B: Achieving a New Standard for Automatic Speech Recognition ASR and Transcribes an Hour of Audio in One Second​


By Asif Razzaq

May 5, 2025

NVIDIA has unveiled Parakeet TDT 0.6B , a state-of-the-art automatic speech recognition (ASR) model that is now fully open-sourced on Hugging Face . With 600 million parameters , a commercially permissive CC-BY-4.0 license , and a staggering real-time factor (RTF) of 3386 , this model sets a new benchmark for performance and accessibility in speech AI.

Blazing Speed and Accuracy


At the heart of Parakeet TDT 0.6B’s appeal is its unmatched speed and transcription quality . The model can transcribe 60 minutes of audio in just one second , a performance that’s over 50x faster than many existing open ASR models. On Hugging Face’s Open ASR Leaderboard , Parakeet V2 achieves a 6.05% word error rate (WER) —the best-in-class among open models.

This performance represents a significant leap forward for enterprise-grade speech applications, including real-time transcription, voice-based analytics, call center intelligence, and audio content indexing.

Technical Overview


Parakeet TDT 0.6B builds on a transformer-based architecture fine-tuned with high-quality transcription data and optimized for inference on NVIDIA hardware. Here are the key highlights:

  • 600M parameter encoder-decoder model
  • Quantized and fused kernels for maximum inference efficiency
  • Optimized for TDT (Transducer Decoder Transformer) architecture
  • Supports accurate timestamp formatting , numerical formatting , and punctuation restoration
  • Pioneers song-to-lyrics transcription , a rare capability in ASR models

The model’s high-speed inference is powered by NVIDIA’s TensorRT and FP8 quantization , enabling it to reach a real-time factor of RTF = 3386 , meaning it processes audio 3386 times faster than real-time .

Benchmark Leadership


On the Hugging Face Open ASR Leaderboard —a standardized benchmark for evaluating speech models across public datasets—Parakeet TDT 0.6B leads with the lowest WER recorded among open-source models . This positions it well above comparable models like Whisper from OpenAI and other community-driven efforts.

Screenshot-2025-05-05-at-10.43.00%E2%80%AFPM-1-1024x433.png
Data based on May 5 2025

This performance makes Parakeet V2 not only a leader in quality but also in deployment readiness for latency-sensitive applications.

Beyond Conventional Transcription


Parakeet is not just about speed and word error rate. NVIDIA has embedded unique capabilities into the model:

  • Song-to-lyrics transcription : Unlocks transcription for sung content, expanding use cases into music indexing and media platforms.
  • Numerical and timestamp formatting : Improves readability and usability in structured contexts like meeting notes, legal transcripts, and health records.
  • Punctuation restoration : Enhances natural readability for downstream NLP applications.

These features elevate the quality of transcripts and reduce the burden on post-processing or human editing, especially in enterprise-grade deployments.

Strategic Implications


The release of Parakeet TDT 0.6B represents another step in NVIDIA’s strategic investment in AI infrastructure and open ecosystem leadership . With strong momentum in foundational models (e.g., Nemotron for language and BioNeMo for protein design), NVIDIA is positioning itself as a full-stack AI company—from GPUs to state-of-the-art models.

For the AI developer community, this open release could become the new foundation for building speech interfaces in everything from smart devices and virtual assistants to multimodal AI agents.

Getting Started


Parakeet TDT 0.6B is available now on Hugging Face , complete with model weights, tokenizer, and inference scripts. It runs optimally on NVIDIA GPUs with TensorRT, but support is also available for CPU environments with reduced throughput.

Whether you’re building transcription services, annotating massive audio datasets, or integrating voice into your product, Parakeet TDT 0.6B offers a compelling open-source alternative to commercial APIs.




Check out the Model on Hugging Face .


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,651
Reputation
10,572
Daps
185,466

The Google Gemini generative AI logo on a smartphone.

Image Credits:Andrey Rudakov/Bloomberg / Getty Images

AI

Google launches ‘implicit caching’ to make accessing its latest AI models cheaper​


Kyle Wiggers

11:20 AM PDT · May 8, 2025

Google is rolling out a feature in its Gemini API that the company claims will make its latest AI models cheaper for third-party developers.

Google calls the feature “implicit caching” and says it can deliver 75% savings on “repetitive context” passed to models via the Gemini API. It supports Google’s Gemini 2.5 Pro and 2.5 Flash models.

That’s likely to be welcome news to developers as the cost of using frontier models continues to grow.

Caching, a widely adopted practice in the AI industry, reuses frequently accessed or pre-computed data from models to cut down on computing requirements and cost. For example, caches can store answers to questions users often ask of a model, eliminating the need for the model to re-create answers to the same request.

Google previously offered model prompt caching, but only explicit prompt caching, meaning devs had to define their highest-frequency prompts. While cost savings were supposed to be guaranteed, explicit prompt caching typically involved a lot of manual work.

Some developers weren’t pleased with how Google’s explicit caching implementation worked for Gemini 2.5 Pro, which they said could cause surprisingly large API bills. Complaints reached a fever pitch in the past week, prompting the Gemini team to apologize and pledge to make changes.

In contrast to explicit caching, implicit caching is automatic. Enabled by default for Gemini 2.5 models, it passes on cost savings if a Gemini API request to a model hits a cache.
“[W]hen you send a request to one of the Gemini 2.5 models, if the request shares a common prefix as one of previous requests, then it’s eligible for a cache hit,” explained Google in a blog post. “We will dynamically pass cost savings back to you.”

The minimum prompt token count for implicit caching is 1,024 for 2.5 Flash and 2,048 for 2.5 Pro, according to Google’s developer documentation, which is not a terribly big amount, meaning it shouldn’t take much to trigger these automatic savings. Tokens are the raw bits of data models work with, with a thousand tokens equivalent to about 750 words.

Given that Google’s last claims of cost savings from caching ran afoul, there are some buyer-beware areas in this new feature. For one, Google recommends that developers keep repetitive context at the beginning of requests to increase the chances of implicit cache hits. Context that might change from request to request should be appended at the end, the company says.

For another, Google didn’t offer any third-party verification that the new implicit caching system would deliver the promised automatic savings. So we’ll have to see what early adopters say.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,651
Reputation
10,572
Daps
185,466
"Researchers are pushing beyond chain-of-thought prompting to new cognitive techniques"



Posted on Fri May 9 14:05:45 2025 UTC

/r/singularity/comments/1kijbzo/researchers_are_pushing_beyond_chainofthought/

Is AI's Race for Bigger Models Over? Explore the Shift Towards Human-Like Reasoning in AI Development

"Getting models to reason flexibly across a wide range of tasks may require a more fundamental shift, says the University of Waterloo’s Grossmann. Last November, he coauthored [u][url]https://arxiv.org/abs/2411.02478[/url][/u] with leading AI researchers highlighting the need to imbue models with metacognition, which they describe as “the ability to reflect on and regulate one’s thought processes.”

Today’s models are “professional bullshyt generators,” says Grossmann, that come up with a best guess to any question without the capacity to recognize or communicate their uncertainty. They are also bad at adapting responses to specific contexts or considering diverse perspectives, things humans do naturally. Providing models with these kinds of metacognitive capabilities will not only improve performance but will also make it easier to follow their reasoning processes, says Grossmann."

Imagining and building wise machines: The centrality of AI metacognition

"Although AI has become increasingly smart, its wisdom has not kept pace. In this article, we examine what is known about human wisdom and sketch a vision of its AI counterpart. We analyze human wisdom as a set of strategies for solving intractable problems-those outside the scope of analytic techniques-including both object-level strategies like heuristics [for managing problems] and metacognitive strategies like intellectual humility, perspective-taking, or context-adaptability [for managing object-level strategies]. We argue that AI systems particularly struggle with metacognition; improved metacognition would lead to AI more robust to novel environments, explainable to users, cooperative with others, and safer in risking fewer misaligned goals with human users. We discuss how wise AI might be benchmarked, trained, and implemented."
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,651
Reputation
10,572
Daps
185,466
I’d like to share that we’re introducing the latest 3D foundation AI model AssetGen 2.0, which was designed to create high-quality 3D assets from text and image prompts.



Posted on Sat May 10 19:02:10 2025 UTC

video-expand.svg

video-collapse.svg

video-refresh.svg

video-volume.svg

video-mute.svg

video-settings.svg

video-settings-open.svg

checkmark.svg



💡AssetGen 2.0 consist of 2 models: one to generate the 3D Mesh, & a second one to generate textures.

ℹ️ Technological Advancements:


Utilizes a single-stage 3D diffusion model for geometry estimation, leading to improved detail and fidelity compared to its predecessor, AssetGen 1.0.
TextureGen introduces methods for enhanced view consistency, texture in-painting, and higher texture resolution.


📌 Current Use and Future Plans:


Currently employed internally for creating 3D worlds.
Planned rollout to Horizon creators later this year.


👉 More details about this Introducing Meta 3D AssetGen 2.0: A new foundation model for 3D content creation








1/14
@Dilmerv
Hello AI enthusiasts!👋

I’d like to share that we’re introducing the latest 3D foundation AI model AssetGen 2.0, which was designed to create high-quality 3D assets from text and image prompts.

💡AssetGen 2.0 consist of 2 models: one to generate the 3D Mesh, & a second one to generate textures.

ℹ️ Technological Advancements:

- Utilizes a single-stage 3D diffusion model for geometry estimation, leading to improved detail and fidelity compared to its predecessor, AssetGen 1.0.
- TextureGen introduces methods for enhanced view consistency, texture in-painting, and higher texture resolution.

📌 Current Use and Future Plans:

- Currently employed internally for creating 3D worlds.
- Planned rollout to Horizon creators later this year.

👉 More details about this announcement: Introducing Meta 3D AssetGen 2.0: A new foundation model for 3D content creation

/search?q=#Meta /search?q=#AI /search?q=#GenAI



https://video.twimg.com/amplify_video/1921245903429718016/vid/avc1/1920x1080/8vkL652UPAlSJdJr.mp4

2/14
@Dilmerv
@boztank AssetGen 2.0 🚀



3/14
@jmdagdelen
What are you guys introducing? There is no paper, no code, no API. Just a blog post.



4/14
@Dilmerv
Right for now, but is to keep the community informed about what we are working on. Meta 3D AssetGen is the current version, more info coming soon.



5/14
@javierdavalos
Open it up for 3rd party devs, I want it 👍



6/14
@Dilmerv
Would love that personally, I pass this feedback!



7/14
@hermes_f
This looks amazing! Can’t wait to try it out!



8/14
@Dilmerv
Great to know Hermes! I need to get access too to test it out!



9/14
@realdowd
When was the last time you used this off work hours?



10/14
@Dilmerv
I haven’t yet but I plan to once it becomes available, specially for a few prototypes I have in mind.



11/14
@pavtalk
Open source?



12/14
@mpvprb
As it becomes increasingly effortless to create stuff like this, their economic value will approach zero



13/14
@otri
Looks great. I’d love to employ this in our Digital Dojo! However, it really must be open source (MIT or at least Apache 2.0) to gain traction.



14/14
@MrLiamKelly
This is awful news for the games industry. :(




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

 

Adeptus Astartes

Loyal servant of the God-Brehmperor
Supporter
Joined
Sep 15, 2019
Messages
12,534
Reputation
3,287
Daps
76,115
Reppin
Imperium of Man

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,651
Reputation
10,572
Daps
185,466

Elon Musk’s apparent power play at the Copyright Office completely backfired​


Ripping off content to train AI wasn’t going to fly with either MAGA populists or MAGA media.
by Tina Nguyen

Updated May 14, 2025, 10:02 AM EDT


15 Comments15 New
Library Of Congress At Sunset In Washington DC

Photo by Kevin Carter / Getty Images

Tina Nguyen is a senior reporter for The Verge, covering the Trump administration, Elon Musk’s takeover of the federal government, and the tech industry’s embrace of the MAGA movement.

What initially appeared to be a power play by Elon Musk and the Department of Government Efficiency (DOGE) to take over the US Copyright Office by having Donald Trump remove the officials in charge has now backfired in spectacular fashion, as Trump’s acting replacements are known to be unfriendly — and even downright hostile — to the tech industry.

When Trump fired Librarian of Congress Carla Hayden last week and Register of Copyrights Shira Perlmutter over the weekend, it was seen as another move driven by the tech wing of the Republican Party — especially in light of the Copyright Office releasing a pre-publication report saying some kinds of generative AI training would not be considered fair use. And when two men showed up at the Copyright Office inside the Library of Congress carrying letters purporting to appoint them to acting leadership positions, the DOGE takeover appeared to be complete.

But those two men, Paul Perkins and Brian Nieves, were not DOGE at all, but instead approved by the MAGA wing of the Trump coalition that aims to put tech companies in check.

Perkins, now the supposed acting Register of Copyrights, is an eight-year veteran of the DOJ who served in the first Trump administration prosecuting fraud cases. Nieves, the putative acting deputy librarian, is currently at the Office of the Deputy Attorney General, having previously been a lawyer on the House Judiciary Committee, where he worked with Rep. Jim Jordan on Big Tech investigations. And Todd Blanche, the putative Acting Librarian of Congress who would be their boss, is a staunch Trump ally who represented him during his 2024 Manhattan criminal trial, and is now the Deputy Attorney General overseeing the DOJ’s side in the Google Search remedies case. As one government affairs lobbyist told The Verge, Blanche is “there to stick it to tech.”

The appointments of Blanche, Perkins, and Nieves are the result of furious lobbying over the weekend by the conservative content industry — as jealously protective of its copyrighted works as any other media companies — as well as populist Republican lawmakers and lawyers, all enraged that Silicon Valley had somehow persuaded Trump to fire someone who’d recently criticized AI companies.

Sources speaking to The Verge are convinced the firings were a tech industry power play led by Elon Musk and David Sacks

The populists were particularly rankled over Perlmutter’s removal from the helm of the Copyright Office, which happened the day after the agency released a pre-publication version of its report on the use of copyrighted material in training generative AI systems. Sources speaking to The Verge are convinced the firings were a tech industry power play led by Elon Musk and “White House A.I. & Crypto Czar” David Sacks, meant to eliminate any resistance to AI companies using copyrighted material to train models without having to pay for it.
“You can say, well, we have to compete with China. No, we don’t have to steal content to compete with China. We don’t have slave labor to compete with China. It’s a bullshyt argument,” Mike Davis, the president of the Article III project and a key antitrust advisor to Trump, told The Verge. “It’s not fair use under the copyright laws to take everyone’s content and have the big tech platforms monetize it. That’s the opposite of fair use. That’s a copyright infringement.”

It’s the rare time that MAGA world is in agreement with the Democratic Party, which has roundly condemned the firings of Hayden and Perlmutter, and also zeroed in on the Musk-Sacks faction as the instigator.

In a press release, Rep. Joe Morelle (D-NY) characterized the hundred-plus-page report, the third installment of a series that the office has put out on copyright and artificial intelligence, as “refus[ing] to rubber-stamp Elon Musk’s efforts to mine troves of copyrighted works to train AI models.” Meanwhile, Sen. Ron Wyden (D-OR), who told The Verge in an emailed statement that the president had no power to fire either Hayden or Perlmutter, said, “This all looks like another way to pay back Elon Musk and the other AI billionaires who backed Trump’s campaign.”

The agency’s interpretation of what is or isn’t fair use does not have binding force on the courts

Publications like the AI report essentially lay out how the Copyright Office interprets copyright law. But the agency’s interpretation of what is or isn’t fair use does not have binding force on the courts, so a report like this one functions mostly as expert commentary and reference material. However, the entire AI industry is built on an expansive interpretation of copyright law that’s currently being tested in the courts — a situation that’s created dire need for exactly this sort of expert commentary.

The AI report applies the law of fair use to different kinds of AI training and usage, concluding that although outcomes might differ case by case, “making commercial use of vast troves of copyrighted works to produce expressive content that competes with them in existing markets, especially where this is accomplished through illegal access, goes beyond established fair use boundaries.” But far from advising drastic action in response to what the Office believes is rampant copyright infringement, the report instead states that “government intervention would be premature at this time,” given that licensing agreements are being made across various sectors.
“Now tech bros are going to steal creators’ copyrights for AI profits”

The unoffending nature of the report made Perlmutter’s removal all the more alarming to the MAGA ideologues in Trump’s inner circle, who saw this as a clear power grab, and were immediately vocal about it. “Now tech bros are going to steal creators’ copyrights for AI profits,” Davis posted immediately on Truth Social, along with a link to a CBS story about Perlmutter’s firing. “This is 100% unacceptable.”

Curiously, just after Davis published the post, Trump reposted it, link and all.

None of Trump’s purported appointees have a particularly relevant background for their new jobs — but they are certainly not DOGE people and, generally speaking, are not the kind of people that generative AI proponents would want in the office. And for now, this counts as a political win for the anti-tech populists, even if nothing further happens. “Sometimes when you make a pitch to leadership to get rid of someone, the person who comes in after isn’t any better,” said a source familiar with the dynamic between the White House and both sides of the copyright issue. “You don’t necessarily get to name the successor and fire someone, and so in many cases, I’ve seen people get pushed out the door and the replacement is even worse.”

The speed of the firings and subsequent power struggle, however, have underscored the brewing constitutional crisis sparked by Trump’s frequent firing of independent agency officials confirmed by Congress. The Library of Congress firings, in particular, reach well past the theory of executive power claimed by the White House and into even murkier territory. It’s legally dubious whether the Librarian of Congress can be removed by the president, as the Library, a legislative branch agency that significantly predates the administrative state, does not fit neatly into the modern-day legal framework of federal agencies. (Of course, everything about the law is in upheaval even where agencies do fit the framework.) Regardless, the law clearly states that the Librarian of Congress — not the president — appoints the Register of Copyrights.

At the moment, the Library of Congress has not received any direction from Congress on how to move forward. The constitutional crisis — one of many across the federal government — remains ongoing.

Elon Musk and xAI did not respond to a request for comment.
Additional reporting by Sarah Jeong.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,651
Reputation
10,572
Daps
185,466

Meet AlphaEvolve, the Google AI that writes its own code—and just saved millions in computing costs​



Michael Nuñez@MichaelFNunez

May 14, 2025 8:00 AM

Credit: VentureBeat made with Midjourney


Credit: VentureBeat made with Midjourney




Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Google DeepMind today pulled the curtain back on AlphaEvolve, an artificial-intelligence agent that can invent brand-new computer algorithms — then put them straight to work inside the company’s vast computing empire.
AlphaEvolve pairs Google’s Gemini large language models with an evolutionary approach that tests, refines, and improves algorithms automatically. The system has already been deployed across Google’s data centers, chip designs, and AI training systems — boosting efficiency and solving mathematical problems that have stumped researchers for decades.
“AlphaEvolve is a Gemini-powered AI coding agent that is able to make new discoveries in computing and mathematics,” explained Matej Balog, a researcher at Google DeepMind, in an interview with VentureBeat. “It can discover algorithms of remarkable complexity — spanning hundreds of lines of code with sophisticated logical structures that go far beyond simple functions.”

The system dramatically expands upon Google’s previous work with FunSearch by evolving entire codebases rather than single functions. It represents a major leap in AI’s ability to develop sophisticated algorithms for both scientific challenges and everyday computing problems.

Inside Google’s 0.7% efficiency boost: How AI-crafted algorithms run the company’s data centers​


AlphaEvolve has been quietly at work inside Google for over a year. The results are already significant.

One algorithm it discovered has been powering Borg, Google’s massive cluster management system. This scheduling heuristic recovers an average of 0.7% of Google’s worldwide computing resources continuously — a staggering efficiency gain at Google’s scale.

The discovery directly targets “stranded resources” — machines that have run out of one resource type (like memory) while still having others (like CPU) available. AlphaEvolve’s solution is especially valuable because it produces simple, human-readable code that engineers can easily interpret, debug, and deploy.

The AI agent hasn’t stopped at data centers. It rewrote part of Google’s hardware design, finding a way to eliminate unnecessary bits in a crucial arithmetic circuit for Tensor Processing Units (TPUs). TPU designers validated the change for correctness, and it’s now headed into an upcoming chip design.

Perhaps most impressively, AlphaEvolve improved the very systems that power itself. It optimized a matrix multiplication kernel used to train Gemini models, achieving a 23% speedup for that operation and cutting overall training time by 1%. For AI systems that train on massive computational grids, this efficiency gain translates to substantial energy and resource savings.
“We try to identify critical pieces that can be accelerated and have as much impact as possible,” said Alexander Novikov, another DeepMind researcher, in an interview with VentureBeat. “We were able to optimize the practical running time of [a vital kernel] by 23%, which translated into 1% end-to-end savings on the entire Gemini training card.”

Breaking Strassen’s 56-year-old matrix multiplication record: AI solves what humans couldn’t​


AlphaEvolve solves mathematical problems that stumped human experts for decades while advancing existing systems.

The system designed a novel gradient-based optimization procedure that discovered multiple new matrix multiplication algorithms. One discovery toppled a mathematical record that had stood for 56 years.
“What we found, to our surprise, to be honest, is that AlphaEvolve, despite being a more general technology, obtained even better results than AlphaTensor,” said Balog, referring to DeepMind’s previous specialized matrix multiplication system. “For these four by four matrices, AlphaEvolve found an algorithm that surpasses Strassen’s algorithm from 1969 for the first time in that setting.”

The breakthrough allows two 4×4 complex-valued matrices to be multiplied using 48 scalar multiplications instead of 49 — a discovery that had eluded mathematicians since Volker Strassen’s landmark work. According to the research paper, AlphaEvolve “improves the state of the art for 14 matrix multiplication algorithms.”

The system’s mathematical reach extends far beyond matrix multiplication. When tested against over 50 open problems in mathematical analysis, geometry, combinatorics, and number theory, AlphaEvolve matched state-of-the-art solutions in about 75% of cases. In approximately 20% of cases, it improved upon the best known solutions.

One victory came in the “kissing number problem” — a centuries-old geometric challenge to determine how many non-overlapping unit spheres can simultaneously touch a central sphere. In 11 dimensions, AlphaEvolve found a configuration with 593 spheres, breaking the previous record of 592.

How it works: Gemini language models plus evolution create a digital algorithm factory​


What makes AlphaEvolve different from other AI coding systems is its evolutionary approach.

The system deploys both Gemini Flash (for speed) and Gemini Pro (for depth) to propose changes to existing code. These changes get tested by automated evaluators that score each variation. The most successful algorithms then guide the next round of evolution.

AlphaEvolve doesn’t just generate code from its training data. It actively explores the solution space, discovers novel approaches, and refines them through an automated evaluation process — creating solutions humans might never have conceived.
“One critical idea in our approach is that we focus on problems with clear evaluators. For any proposed solution or piece of code, we can automatically verify its validity and measure its quality,” Novikov explained. “This allows us to establish fast and reliable feedback loops to improve the system.”

This approach is particularly valuable because the system can work on any problem with a clear evaluation metric — whether it’s energy efficiency in a data center or the elegance of a mathematical proof.

From cloud computing to drug discovery: Where Google’s algorithm-inventing AI goes next​


While currently deployed within Google’s infrastructure and mathematical research, AlphaEvolve’s potential reaches much further. Google DeepMind envisions applications in material sciences, drug discovery, and other fields requiring complex algorithmic solutions.
“The best human-AI collaboration can help solve open scientific challenges and also apply them at Google scale,” said Novikov, highlighting the system’s collaborative potential.

Google DeepMind is now developing a user interface with its People + AI Research team and plans to launch an Early Access Program for selected academic researchers. The company is also exploring broader availability.

The system’s flexibility marks a significant advantage. Balog noted that “at least previously, when I worked in machine learning research, it wasn’t my experience that you could build a scientific tool and immediately see real-world impact at this scale. This is quite unusual.”

As large language models advance, AlphaEvolve’s capabilities will grow alongside them. The system demonstrates an intriguing evolution in AI itself — starting within the digital confines of Google’s servers, optimizing the very hardware and software that gives it life, and now reaching outward to solve problems that have challenged human intellect for decades or centuries.
 
Top