Large Language Models News & Discussions

bnew · Apr 26, 2024

1/2
I guess you might have tried the demo (Qwen1.5 110B Chat Demo - a Hugging Face Space by Qwen). Now the weights of Qwen1.5-110B are out! Temporarily only the base and chat models, AWQ and GGUF quantized models are about to be released very soon!

Blog: Qwen1.5-110B: The First 100B+ Model of the Qwen1.5 Series

Hugging Face: Qwen/Qwen1.5-110B · Hugging Face (base); Qwen/Qwen1.5-110B-Chat · Hugging Face (chat)

How is it compared with Llama-3-70B? For starters, Qwen1.5-110B at least has several unique features:

- Context length of 32K tokens
- Multilingual support, including English, Chinese, French, Spanish, Japanese, Korean, Vietnamese, etc.

This model is still based on the same architecture of Qwen1.5, and it is a dense model instead of MoE. It has the support of GQA like Qwen1.5-32B.

How many tokens have we trained? Essentially, it is built with very similar pretraining and posttraining recipes and thus it is still far from being sufficiently pretrained.

We find that its performance on benchmarks for base language models and we are confident in the base model quality. For the chat model, we have comparable performance in MT-Bench, but we also find some drawbacks in coding, math, logical reasoning. Honestly, we need your testing and feedback to help us better understand the capabilities and limitations of our models.

OK that's it. Get back to work for Qwen2!

Qwen1.5-110B: The First 100B+ Model of the Qwen1.5 Series

GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD Introduction Recently we have witnessed a burst of large-scale models with over 100 billion parameters in the opensource community. These models have demonstrated remarkable performance in both benchmark evaluation and chatbot arena. Today, we release...

qwenlm.github.io

2/2
Tmr I'll publish the GGUF

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2
Qwen1.5-110B running on
@replicate

First pass implementation done with vllm

Try it out!

2/2
Qwen1.5-110b

lucataco/qwen1.5-110b – Run with an API on Replicate

Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data

replicate.com

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
Qwen1.5-110B weights are out - run it on your own infra with SkyPilot!

From our AI gallery - a guide to host
@Alibaba_Qwen on your own infra: Serving Qwen1.5 on Your Own Cloud — SkyPilot documentation

Comparison with Llama 3: Qwen1.5-110B: The First 100B+ Model of the Qwen1.5 Series

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/5
Feel free to try this Qwen1.5-110B model preview! I hope you enjoy it! We will release the model weights soon!

Qwen1.5 110B Chat Demo - a Hugging Face Space by Qwen

Discover amazing ML apps made by the community

huggingface.co

2/5
This should be the last episode of the 1.5. No encore probably. Time to say goodbye to the old days and move on to the new series.

3/5
Yes we are going to release it next week. Need some days for the final preparation. Still 32K.

4/5
Wow! your words are so encouraging to us!

5/5
Yeah we will!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · May 1, 2024

1/7
InstantFamily

Masked Attention for Zero-shot Multi-ID Image Generation

In the field of personalized image generation, the ability to create images preserving concepts has significantly improved. Creating an image that naturally integrates multiple concepts in a cohesive and

2/7
visually appealing composition can indeed be challenging. This paper introduces "InstantFamily," an approach that employs a novel masked cross-attention mechanism and a multimodal embedding stack to achieve zero-shot multi-ID image generation. Our method effectively

3/7
preserves ID as it utilizes global and local features from a pre-trained face recognition model integrated with text conditions. Additionally, our masked cross-attention mechanism enables the precise control of multi-ID and composition in the generated images. We demonstrate

4/7
the effectiveness of InstantFamily through experiments showing its dominance in generating images with multi-ID, while resolving well-known multi-ID generation problems. Additionally, our model achieves state-of-the-art performance in both single-ID and multi-ID

5/7
preservation. Furthermore, our model exhibits remarkable scalability with a greater number of ID preservation than it was originally trained with.

6/7
paper page:

7/7
daily papers: Paper page - InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation

In the field of personalized image generation, the ability to create images preserving concepts has significantly improved. Creating an image that naturally integrates multiple concepts in a cohesive and visually appealing composition can indeed be challenging. This paper introduces...

arxiv.org

Computer Science > Computer Vision and Pattern Recognition

[Submitted on 30 Apr 2024]

InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation

Chanran Kim, Jeongin Lee, Shichang Joung, Bongmo Kim, Yeul-Min Baek

In the field of personalized image generation, the ability to create images preserving concepts has significantly improved. Creating an image that naturally integrates multiple concepts in a cohesive and visually appealing composition can indeed be challenging. This paper introduces "InstantFamily," an approach that employs a novel masked cross-attention mechanism and a multimodal embedding stack to achieve zero-shot multi-ID image generation. Our method effectively preserves ID as it utilizes global and local features from a pre-trained face recognition model integrated with text conditions. Additionally, our masked cross-attention mechanism enables the precise control of multi-ID and composition in the generated images. We demonstrate the effectiveness of InstantFamily through experiments showing its dominance in generating images with multi-ID, while resolving well-known multi-ID generation problems. Additionally, our model achieves state-of-the-art performance in both single-ID and multi-ID preservation. Furthermore, our model exhibits remarkable scalability with a greater number of ID preservation than it was originally trained with.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2404.19427 [cs.CV]
	(or arXiv:2404.19427v1 [cs.CV] for this version)

Submission history

From: Chanran Kim [view email]
[v1] Tue, 30 Apr 2024 10:16:21 UTC (20,960 KB)

https://arxiv.org/pdf/2404.19427

bnew · May 1, 2024

1/8
There is a mysterious new model called gpt2-chatbot accessible from a major LLM benchmarking site. No one knows who made it or what it is, but I have been playing with it a little and it appears to be in the same rough ability level as GPT-4. A mysterious GPT-4 class model? Neat!

2/8
Maybe better than GPT-4. Hard to tell, but it does do much better at the iconic “draw a unicorn with code”’ task…

You can access it here: https://chat.lmsys.org

3/8
GPT 4 Turbo TikZ unicorn vs gpt2-chatbot TikZ unicorn

4/8
It identifies itself as GPT-4 (with v2 personality?) but who knows?

5/8
Some interesting experiments

6/8
uh.... gpt2-chatbot just solved an International Math Olympiad (IMO) problem in one-shot

the IMO is insanely hard. only the FOUR best math students in the USA get to compete

prompt + its thoughts x.com/itsandrewgao/s…

7/8
Anonymous testing is a apparently a thing LMSYS Org does for model makers (which makes sense).

But they really should insist on cooler names for secret models in testing. All of the labs are so bad at naming their AIs.

8/8
hi @simonw, thanks a ton! We really value your feedback.

Just to clarify, following our policy, we've partnered with several model developers to bring their new models to our platform for community preview testing. These models are strictly for testing and won't be listed on the

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
GPT2-Chatbot nearly built a flappy bird clone in one shot. It messed up initializing movement and didn't give actual assets.

But I had Opus create a build script to grab the assets GPT2 intended to be there and Opus pointed to the actual flappy bird assets...

Ya can't flap and doesn't auto-restart. But man was that close.

I am fully confident if I could just use the model I'd have a python version working in a few prompts.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Mockapapella (@mockapapella) on Threads

I tried out the mysterious new gpt2-chatbot. I have a test question that I ask all new LLMs about a very specific issue I've come across when deploying models to production. As far as I (and my team) are aware, the answer to this question does not exist anywhere on the internet. Some LLMs got...

www.threads.net

Sung Kim (@sung.kim.mw) on Threads

I have to hand it to wandb.ai on speed. I put GPT2-chatbot’s coding skills to the test A new model known as gpt2-chatbot has appeared on the LMSYS platform, attracting attention for its advanced capabilities. This article details my approach, the tests I performed, and the insights I gathered...

www.threads.net

Steve Steiner (@stevejsteiner) on Threads

Interesting I just blind picked gpt2-chatbot over gpt4-turbo-2024-4-9 in llmsys. It was a series of 3 questions. Outline process philosophy, compare and contrast with Deleuze, and then ask whether Terrance Deacon’s morphodynamics and telodynamics aligned more with one or the other. gpt4: “it is...

www.threads.net

Gpt2-chatbot: a fake AI game by OpenAI | Michael Tchuindjang posted on the topic | LinkedIn

An advanced AI chatbot with unknown origins is turning heads last week as online communities unpack a cryptic tweet from Sam Altman. Online AI communities have gone wild about the anonymous gpt2-chatbot. One X user claims that gpt2-chatbot nearly coded a perfect clone of the mobile game Flappy...

www.linkedin.com

Is OpenAI testing GPT-4.5? "gpt2-chatbot" writes better code than GPT-4 and Claude

A powerful new AI model called "gpt2-chatbot" shows capabilities that appear to be at or above the level of GPT-4.

the-decoder.com

AI in practice

Apr 30, 2024

Is OpenAI testing GPT-4.5? "gpt2-chatbot" writes better code than GPT-4 and Claude

X.com

Maximilian Schreiner

Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.

Profile

E-Mail

Content

Summary

A powerful new AI model called "gpt2-chatbot" shows capabilities that appear to be at or above the level of GPT-4.

The model, called "gpt2-chatbot," appeared without much fanfare in the LMSYS Org Chatbot Arena, a website that compares AI language models. However, its performance quickly caught the attention of testers.

"I would agree with assessments that it is at least GPT-4 level," says Andrew Gao, an AI researcher at Stanford University who has been tracking the model on LMSYS since its release.

For example, gpt2-chatbot solved a problem from the prestigious International Mathematical Olympiad on the first try - a feat he described as "insanely hard."

According to Ethan Mollick, a professor at the Wharton School, the model seems to perform better than GPT-4 Turbo on complex reasoning tasks such as writing code. Chase McCoy, founding engineer at CodeGen, said that gpt2-chatbot "is definitely better at complex code manipulation tasks than Claude Opus or the latest GPT4. Did better on all the coding prompts we use to test new models."

There are more examples on Twitter: Alvaro Cintas generated a Snake game on the first attempt.

Sully Omar, co-founder of Cognosys, had the model draw a unicorn - a test from Microsoft's controversial "Sparks of AGI" paper.

GPT-4.5 or something entirely different?

The strong performance and clues about the tokenizer used by OpenAI suggest that gpt2-chatbot may come from OpenAI and could be a test of GPT-4.5 or another new model from the company. LMSYS confirmed that it also allows model providers to test their models anonymously. The model also describes itself as ChatGPT and "based on GPT-4."

However, self-descriptions of AI models are not always reliable, and some testers report more hallucinations than GPT-4 Turbo. OpenAI CEO Sam Altman responded to the rumors with a post on X: "I have a soft spot for gpt2." In short, although the similarities to earlier OpenAI creations suggest a possible connection, conclusive evidence is still lacking.

bnew · May 3, 2024

1/2
Did anybody notice Nvidia published a competitive llama3-70b QA/RAG fine tune?

nvidia/Llama3-ChatQA-1.5-70B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

2/2
actually, looks like the 8b version might be more interesting.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
Nvidia has published a competitive llama3-70b QA/RAG fine tune #LLM #LLMs

ChatQA-1.5, which excels at conversational question answering (QA) and retrieval-augumented generation (RAG). ChatQA-1.5 is built using the training recipe from ChatQA (1.0), and it is built on top of Llama-3 foundation model. Additionally, we incorporate more conversational QA data to enhance its tabular and arithmatic calculation capability. ChatQA-1.5 has two variants: ChatQA-1.5-8B and ChatQA-1.5-70B.
Nvidia/ChatQA-1.5-70B: nvidia/Llama3-ChatQA-1.5-70B · Hugging Face
Nvidia/ChatQA-1.5-8B: nvidia/Llama3-ChatQA-1.5-8B · Hugging Face

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
Nvidia presents ChatQA

Building GPT-4 Level Conversational QA Models

paper page: Paper page - ChatQA: Building GPT-4 Level Conversational QA Models

introduce ChatQA, a family of conversational question answering (QA) models, that obtain GPT-4 level accuracies. Specifically, we propose a two-stage instruction tuning method that can significantly improve the zero-shot conversational QA results from large language models (LLMs). To handle retrieval in conversational QA, we fine-tune a dense retriever on a multi-turn QA dataset, which provides comparable results to using the state-of-the-art query rewriting model while largely reducing deployment cost. Notably, our ChatQA-70B can outperform GPT-4 in terms of average score on 10 conversational QA datasets (54.14 vs. 53.90), without relying on any synthetic data from OpenAI GPT models.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Note that ChatQA-1.5 is built based on Llama-3 base model, and ChatQA-1.0 is built based on Llama-2 base model. ChatQA-1.5 used some samples from the HybriDial training dataset. To ensure fair comparison, we also compare average scores excluding HybriDial. The data and evaluation scripts for ConvRAG can be found here.

bnew · May 3, 2024

Full Line Code Completion in JetBrains IDEs: All You Need to Know | The JetBrains Blog

Learn more about a new feature in v2024.1 of JetBrains IDEs – full line code completion.

blog.jetbrains.com

Full Line Code Completion in JetBrains IDEs: All You Need to Know

Ekaterina Ryabukha
April 4, 2024

Programming with AI is still a highly divisive topic, but there’s no denying that more and more developers are starting to incorporate AI into their daily workflows. Whether you’ve already picked your side in the debate or are still undecided, we’ve got a new feature in v2024.1 of JetBrains IDEs that might just pique your interest – full line code completion. It’s AI-powered and runs locally without sending any data over the internet.

In this blog post, we’ll tell you more about what full line code completion is, how it works, what languages are supported, and how you can provide feedback about it to us.

What is full line code completion in JetBrains IDEs?

This new type of code completion was added to JetBrains IDEs with the latest 2024.1 update. As you can see below, it takes the form of gray-toned, single-line suggestions that complete lines based on the context of the current file:

GIF

These suggestions are powered by specialized language models that we’ve trained specifically for different languages and frameworks. The models run locally without sending any code over the internet.

Full line code complеtion is currently available for Java, Kotlin, Python, JavaScript, TypeScript, CSS, PHP, Go, and Ruby within the corresponding JetBrains IDEs: IntelliJ IDEA Ultimate, PyCharm Professional, WebStorm, PhpStorm, GoLand, and RubyMine. In the coming months, we plan to extend the functionality to C#, Rust, and C++, so it will also land in Rider, RustRover, and CLion.

Note that full line code completion is included with your active JetBrains IDE subscription at no additional cost – just make sure you’re on v2024.1 or later. If you don’t yet have a subscription, you can also use this feature during the 30-day free trial.

How does full line completion work?

With full line code completion, we had two main goals in mind. The first one is obvious – to help you save time and increase your coding speed. But beyond that, we also wanted to provide a solution that addresses the constraints certain organizations have when it comes to using AI solutions that are connected to the cloud.

Here’s a breakdown of how full line code completion helps to realize these two aims:

It works locally and is available offline. This means you can take advantage of the feature even if you aren’t connected to the internet.
It doesn’t send any data from your machine over the internet. The language models that power full line code completion run locally, which is great for two reasons. First, your code remains safe, as it never leaves your machine. Second, there are no additional cloud-related expenses – that’s why this feature comes at no additional cost.
It’s integrated deeply into JetBrains IDEs. All suggestions will be appropriately formatted, with the IDE checking for balanced brackets and quotes. Additionally, we use the power of static analysis and our understanding of code to filter out incorrect suggestions. Each supported language has its own set of suggested code correctness checks. The most basic ones, like unresolved reference checks, are implemented for most languages to guarantee that the IDE doesn’t suggest non-existent variables and methods. The auto-import feature is also supported.
It’s designed to keep your workflow as smooth as possible. We use smart filtering to avoid showing suggestions that tend to be canceled explicitly or deleted right after they were added.

For some additional technical details, see this section below.

Full line code completion vs. AI Assistant

There are two ways you can benefit from AI functionality in JetBrains IDEs – full line code completion and JetBrains AI Assistant. We appreciate that this might be confusing, so let’s take a closer look at what they have in common and how they differ.

Both full line code completion and JetBrains AI Assistant aim to help you work faster. They both also go beyond the standard completion that has been available in JetBrains IDEs for some time already. However, JetBrains AI Assistant offers a more comprehensive feature set, including context-aware smart chat and the ability to generate tests or write documentation.

See the table below for a comparison of the two AI functionalities:

Please rest assured that we never train any of our AI features on customers’ code. If your company has strict data privacy regulations, but you still want to speed up your workflows with AI, full line code completion may be a better choice for you.

Under the hood

The backbone of full line code completion is a programming-language specific language model, which is trained in house using a dataset of open-source code with permissive licenses. The language model’s input is the code before the caret, though for some languages, we also add content from related files. The output is the model’s suggested continuation of the current line, which is shown in gray.

The language model’s inference runs on your local machine. To ensure the most efficient generation, the model inference runs in a separate process and is heavily optimized for the target machine’s architecture. For example, if you’re using x86-64 architecture, the model will run on the CPU, whereas if you’re using ARM64 architecture, the model will use the power of your computer’s GPU.

After the suggestion is generated, a number of post-processing steps are applied. First, we check whether this suggestion is syntactically and semantically correct, and then we perform smart filtering, formatting, parenthesis balancing, and various other manipulations. Post-processing is crucial for user experience, so we do our best to show only valuable suggestions that don’t disturb your workflow.

Lastly, you may also be wondering why we decided to go for single-line suggestions. The length of the AI completion suggestions is a trade-off. While longer suggestions do tend to reduce how many keystrokes you have to make, which is good, they also increase the number of reviews required on your end. Taking the above into account, we decided that completing a single line of code would be a fair compromise.

This decision allowed us to reduce the size of the model without any significant decline in suggestion quality. In the 2024.1 version of JetBrains IDEs, we use a language model that has 100 million parameters, with a maximum context size of 1,536 tokens, which is roughly 170 lines of code.

How to tweak the feature

You can configure full line code completion in Settings | Editor | General | Code Completion – all the settings can be found there, under the Machine Learning-Assisted Completion section:

If you’d like to turn off the feature, you can do so by unticking the Enable Full Line suggestions checkbox. Alternatively, you can disable the plugin powering this feature. To do so, go to Settings | Plugins, switch to the Installed tab, and look for full line code completion.

How to provide feedback

Full line code completion is still in active development, so we encourage you to share your feedback with us. You can do so by leaving a comment under this blog post. You can also upvote existing issues here or create a new one by logging in and clicking on the New Issue button in the top right-hand corner.

That’s it for today. Please give full line code completion a try and let us know what you think. We’ll continue improving this functionality further, with support for C#, Rust, and C++ as well as better integration with AI Assistant’s multi-line code completion being our top priorities for now. Stay tuned for updates!

bnew · May 3, 2024

bnew · May 3, 2024

OpenFunctions v2

Gorilla: Large Language Model Connected with Massive APIs

Blog 7: Gorilla OpenFunctions v2

Gorilla OpenFunctions v2

Gorilla OpenFunctions-v2! SoTA for open-source models. On-par with commercial models.

With the latest iteration of Gorilla OpenFunctions-v2, we are delighted to mark significant advancements in function calling for LLMs within the open-source community. As a direct substitute for its predecessor, Gorilla OpenFunctions-v2 retains its open-source ethos while introducing exciting enhancements. These include support for multiple programming languages such as Python, Java, JavaScript, and REST API - the first among both open-source and closed-source models, alongside the ability to handle multiple and parallel function calls, and the ability to determine function relevance. This update cements Gorilla OpenFunctions-v2's position at the forefront of function calling capabilities among LLMs. Moreover, the drop-in replacement allows for seamless integration of OpenFunctions into a diverse range of applications, from social media platforms like Instagram to delivery services like DoorDash, as well as utility tools including Google Calendar and Stripe.

See What's New!!

The five new exciting features we are happy to launch with OpenFunctions-v2 are:

More Data Types: Gorilla OpenFunctions-v2 can now support diverse languages with expanded support for argument types in function calls. This includes [string, number, boolean, list, tuple, dict, any] for Python, [string, number, boolean, list, tuple, dict, any] for Java and [string, number, boolean, dict, bigint, array, date, any] for Javascript. For reference, OpenAI and many others only support JSON schema, i.e., [string, number, integer, object, array, and boolean]. Native support for these types means, you can now plug-and-play openfunctions-v2 without having to weave through string literals.
Parallel & Multiple Functions: Support for Parallel and Multiple Functions. Multiple functions refers to the scenario where the user can input multiple functions when they are not sure which exact function is best to service the prompt. In this scenario, the Gorilla model picks one or more (or none) of the functions provided to respond to the user's requests. In parallel functions, the user's prompt could be serviced by multiple calls to the same function. Gorilla not only supports both of these, but the benefits stack one-on-top of the other!
Function Relevance Detection: Reduce hallucinations in scenarios when no function, or even no relevant function is provided. Gorilla openfunctions v2 can now automatically detect whether the functions provided to the model can address the user's prompt. Recognizing this, the LLM raises an “Error” message to the user providing them with additional information.
Enhanced Capabilities for RESTful APIs: Enhance ability to format RESTful API calls. RESTful APIs are a common phenomenon within the web powering many popular software services including Slack, PayPal, etc. Our model is specially trained to handle RESTful API calls with good quality.

Quick Links:

How well to other function-calling models perform: Berkeley Function Calling Leaderboard
Play with the model online: Gorilla OpenFunctions-v2 web-demo
Check out the project: GitHub
Model (6.91B) on HuggingFace : gorilla-llm/gorilla-openfunctions-v2

Integrating OpenFunctions-v2 in your App

Using Gorilla OpenFunctions-v2 is straightforward:

To help with quick prototyping, we provide a hosted Gorilla Openfunctions-v2 model for inference. Or you can run it locally, or self-host it by accessing the model from HuggingFace. The example below, demonstrates how to invoke the hosted Gorilla Openfunctions-v2 model:
import openai

def get_gorilla_response(prompt="", model="gorilla-openfunctions-v2", functions=[]):
openai.api_key = "EMPTY" # Hosted for free with from UC Berkeley
openai.api_base = "http://luigi.millennium.berkeley.edu:8000/v1"
try:
completion = openai.ChatCompletion.create(
model="gorilla-openfunctions-v2",
temperature=0.0,
messages=[{"role": "user", "content": prompt}],
functions=functions,
)
# completion.choices[0].message.content, string format of the function call
# completion.choices[0].message.functions, Json format of the function call
return completion.choices[0]
Prompt the model:
What's the weather like in the two cities of Boston and San Francisco?
Format your function call: The model will return the function call based on your request.

query = "What's the weather like in the two cities of Boston and San Francisco?"

Code:

functions = [
    {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA",
                },
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["location"],
        },
    }
]

Get Your Function Call: The model will return a Python function call based on your request.
This opens up possibilities for developers and non-developers alike, allowing them to leverage complex functionalities without writing extensive code.

Input:

get_gorilla_response(prompt=query, functions=[functions])
Output:

[get_current_weather(location='Boston, MA'), get_current_weather(location='San Francisco, CA')]
With the example above, you can use Gorilla OpenFunctions-v2 to provide a well formatted output, or call a function with your own definition! Then you can use this freely within your applications and chatbot!

Note: Gorilla through our hosted end-point is currently only supported with openai==0.28.1. We will migrate to also include support for openai==1.xx soon with which functions is replaced by tool_calls.

bnew · May 3, 2024

Performance on Berkeley Function-Calling Leaderboard

We perform exhaustive and comprehensive evaluation on the Berkeley Function-Calling Leaderboard, we benchmark our model against current state-of-the-art models GPT-4-1106-preview as well as GPT-4 and GPT-3.5-turbo function calling features. In addition, we also compare our model with the other open-source models, demonstrating superior behavior among them. Our evaluation consists of 2k distinct query, API documentation pairs from different domains (including travel, finance, scheduling meetings, etc) and languages (java, javascript, python, REST API).

To dive into details about how our model performs in each category, we provide a detailed table below from the Berkeley Function-Calling Leaderboard. We see that when compared to the current state-of-the-art, GPT-4's function calling, in Python, Gorilla OpenFunctions-v2 does better at simple function calling category, but not as good on function calls that involve multiple and parallel functions. This new feature continues to be an exciting area of research for us, and the open-source community in general. It is worth highlighting that our model provides a very stable executable function calls - function calls that were evaluated by actually executing them - with no intervention in-between. Unsurprisingly, having been trained, our model outperforms GPT-4 on function calls on the programming languages other than Python (e.g., Java, Javascript and REST APIs). For REST APIs, our model provides more stable outputs that includes all the required fields including the url, params and header making our model ripe for immediate adoption.

Gorilla OpenFunctions-v2's performance on Berkeley Function-Calling Leaderboard

Code:

"User": "Can you fetch me the weather data for the coordinates
37.8651 N, 119.5383 W, including the hourly forecast for temperature,
wind speed, and precipitation for the next 10 days?"

"Function":
{
...
"parameters":
{
"type": "object",
"properties":
{
"url":
{
"type": "string",
"description": "The API endpoint for fetching weather
data from the Open-Meteo API for the given latitude
and longitude, default
https://api.open-meteo.com/v1/forecast"
}
...
}
}
}

"Gorilla OpenFunctions-v2 output":
{
"name": "requests.get",
"parameters": {
"url": "https://api.open-meteo.com/v1/forecast",
"params":
{
"latitude": "37.8651",
"longitude": "-119.5383",
"forecast_days": 10
},
}
}

The left hand side is GPT-4 generated, and the right hand side is openfunctions-v2 generated. As we can see from the above mistakes that when GPT-4 function call is dealing with functions involving complex parameter structures (e.g., dict inside a dict) with default values, the model tends to have trouble, especially on parsing default values. Rather than being a corner-case, the example above is a common paradigm for REST APIs.

OpenFunctions Data Composition & Training

Gorilla OpenFunctions-v2 is a 6.91B parameter model trained further upon on the Deepseek-Coder-7B-Instruct-v1.5 6.91B model. To train the model, we collect in total of 65,283 question-function-answer pairs from three different sources: Python packages (19,353), Java repositories (16,586), Javascript Repositories (4,245), public-API (6,009), and Command Line Tools (19,090) from various cloud providers. The data composition is shown in the figure below.

After the data collection, we carry out four data augmentations to diversify our training dataset. First, we change the function names. This is critical to ensure the model does not "memorize" the API mapping. Second, we add random (randomly chosen, and random number of) functions to make our data-set compatible with parallel functions. This way we can generate multiple-function datasets from simple functions. Third, we adopt similar strategies of perturbing the prompt to generate scenarios of parallel-functions. We then extend it to also include multiple- and parallel- functions in the same data-points. Finally, we mix some portion of the dataset in which the functions provided during the input is not sufficient to the task. We flag these as `Relevance Detection` scenarios. As with most LLM training, we extensively varied the extents of each data augmentation to train a robust model.

Function Name Transformation:From the original question-function-answer pairs, we augment this with a differnt function names to avoid the model memorizing the correlation between function names and the question (e.g., 'uber' API is used for transportation).
query + [{'name': 'func1', 'description': 'order takeout'}] -> ans1 =>
query + [{'name': 'func2', 'description': 'order takeout'}] -> [ans2]
Parallel Functions Transformation:To handle a more complex case where multiple functions will be selected to answer the user's request, we change the original question to ask for multiple outputs.
query + [{'name': 'func1', 'description': 'order takeout'}] -> ans1 =>
query + [{'name': 'func1', 'description': 'order takeout'}, {'name': 'func2', 'description': 'get weather'}] -> [ans1]
Multiple Functions Transformation:Transform the original function with multiple function calls included in the training, so that the model can learn to choose which function call to use.
query1 + [{'name': 'func1', 'description': 'order takeout'}] -> ans1 =>
query2 + [{'name': 'func1', 'description': 'order takeout'}] -> [ans1, ans2]
Parallel Multiple Functions Transformation:The combined of the above parallel, and multiple transforms.
query1 + [{'name': 'func1', 'description': 'order takeout'}] -> ans1 =>
query2 + [{'name': 'func1', 'description': 'order takeout'}, {'name': 'func2', 'description': 'get weather'}] -> [ans1, ans2]
Function Relevance Detection Transformation:We also include some portion of the dataset in which the functions provided cant not solve the task. We call this `Relevance Detection`.
query1 + [{'name': 'func1', 'description': 'order takeout'}] -> ans1 =>
query2 + [{'name': 'func1', 'description': 'order takeout'}] -> [Error, the function cannot solve the question.]

Following the completion of the data augmentation process, we further refine the dataset by employing the Rouge score for deduplication, effectively eliminating redundant entries. This step is a recognized standard practice.

Conclusion

We are happy to release gorilla-openfunctions-v2, a 6.91B parameter model trained on top of the Deepseek-Coder-7B-Instruct-v1.5 LLM. It takes-in the users prompt along with multiple API calls and returns the functions with the right arguments. With OpenFunctions we extended native support for parameter types in Python, Java, and JavaScript, and RESTful APIs. For more information, check out our blog on Berkeley Function Calling Leaderboard for evaluation, and our GitHub page for the model. All of the results in the blog are generated using gorilla-openfunctions-v2.

bnew · May 4, 2024

OpenAI CEO Sam Altman says GPT-4 is the dumbest AI model you'll ever have to use again

According to OpenAI CEO Sam Altman, GPT-4 is by far the dumbest AI model that humans have to use compared to what's coming in the future.

the-decoder.com

AI in practice

May 2, 2024

OpenAI CEO Sam Altman says GPT-4 is the dumbest AI model you'll ever have to use again

YouTube Screenshot via Stanford eCorner

Matthias Bastian
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Profile
E-Mail

According to OpenAI CEO Sam Altman, GPT-4 is by far the dumbest AI model that humans have to use compared to what's coming in the future.

During a recent appearance at Stanford University, Altman said that OpenAI's current AI models still have significant room for improvement. " ChatGPT is like mildly embarrassing at best. GPT-4 is by far the dumbest model any of you will ever ever have to use again," he said during an appearance at Stanford University.

The CEO believes that there will be much more powerful AI systems in the coming years, saying with a high degree of scientific certainty that humanity will have more advanced models every year.

"GPT-5 is going to be a lot smarter than GPT-4, GPT-6 is going to be a lot smarter than GPT-5, and we are not near the top of this curve," Altman said.

Developing such systems is expensive, but that doesn't worry Altman. "Whether we burn 500 million a year or 5 billion or 50 billion a year, I don't care. I genuinely don't, as long as we can, I think, stay on a trajectory where eventually we create way more value for society than that, and as long as we can figure out a way to pay the bills. We're making AGI, it's going to be expensive, it's totally worth it," he said.

External media content ( www.youtube.com) has been blocked here. When loading or playing, connections are established to the servers of the respective providers. Personal data may be communicated to the providers in the process. You can find more information in our privacy policy.

Agents as the next evolution of AI

While Altman didn't provide a timeline for the development of artificial general intelligence (AGI), he told MIT Technology Review that he believes there will be several versions of AGI that are more or less suitable for certain tasks.

Altman sees intelligent agents as the killer application for future AI systems. These "super-competent colleagues" would know everything about a person's life, including emails and conversations, and could perform certain tasks on the fly, suggest solutions to complex problems, and ask questions when needed.

In the future, Altman believes that AI will not only generate better text, images, and video, but will also be able to perform real-world tasks, further integrating systems into people's daily lives.

According to Altman, this doesn't necessarily require new hardware, as the AI assistant could exist in the cloud - though many users would likely prefer a new device for it.

Altman is reportedly working with iPhone designer Jony Ive on new AI hardware, and OpenAI is said to be developing two agent systems that will automate entire work processes.

GPT-5 is reportedly in development and could be released as early as mid-year. It is expected to be significantly better than its predecessor, GPT-4. It is rumored that GPT-5 will support video generation in addition to text and images. If OpenAI follows the DALL-E approach with its AI video generator Sora, video generation could be integrated into ChatGPT.

Summary

OpenAI CEO Sam Altman expects much more powerful AI models than GPT-4 in the future. In his opinion, GPT-4 is "by far the dumbest model" compared to what is yet to come.
Altman sees intelligent agents as the killer application for future AI systems. They will act as "super-competent colleagues" who know everything about a person's life and can perform specific tasks or suggest solutions to more complex problems.
GPT-5 is already under development and will be released by the middle of this year at the earliest. It is said to be much better than GPT-4. Rumor has it that GPT-5 will support video as well as text and images.

bnew · May 6, 2024

Ocrbench Leaderboard - a Hugging Face Space by echo840

View the OCRBench leaderboard evaluating models on tasks like text recognition and VQA. Provide CSV files for leaderboard data, and get a ranked list with scores and models.

huggingface.co

OCRBench Leaderboard

| GitHub | Paper |
OCRBenchText Recognition

OCRBench is a comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models. It comprises five components: Text Recognition, SceneText-Centric VQA, Document-Oriented VQA, Key Information Extraction, and Handwritten Mathematical Expression Recognition. The benchmark includes 1000 question-answer pairs, and all the answers undergo manual verification and correction to ensure a more precise evaluation.

Rank	Name	Language Model	Open Source	Text Recognition	Scene Text-Centric VQA	Doc-Oriented VQA	KIE	HMER	Final Score
1	Qwen-VL-Max	-	No	254	166	148	143	12	723
2	Qwen-VL-Plus	-	No	248	155	141	141	9	694
3	Gemini	-	No	215	174	128	134	8	659
4	GPT4V	-	No	167	163	146	160	9	645
5	MiniCPM-V-2	MiniCPM-2.4B	Yes	245	171	103	86	0	605
6	mPLUG-DocOwl1.5	LLaMA-2 7B	Yes	182	157	126	134	0	599
7	TextMonkey	Qwen-7B	Yes	169	164	115	113	0	561
8	InternVL-Chat-Chinese	LLaMA2-13B	Yes	228	153	72	64	0	517
9	Monkey	Qwen-7B	Yes	174	161	91	88	0	514
10	InternLM-XComposer2	InternLM2-7B	Yes	160	160	103	87	1	511
11	QwenVL	Qwen-7B	Yes	179	157	95	75	0	506
12	mPLUG-Owl2	LLaMA2-7B	Yes	153	153	41	19	0	366
13	LLaVAR	LLaMA-13B.	Yes	186	122	25	13	0	346
14	LLaVA1.5-13B	Vicuna-v1.5-13B	Yes	176	129	19	7	0	331
15	InternLM-XComposer	InternLM-7B	Yes	192	91	14	6	0	303
16	LLaVA1.5-7B	Vicuna-v1.5-7B	Yes	160	117	15	5	0	297
17	mPLUG-Owl	LLaMA-2 7B	Yes	172	104	18	3	0	297
18	BLIVA	Vicuna-7B	Yes	165	103	22	1	0	291
19	InstructBLIP	Vicuna-7b	Yes	168	93	14	1	0	276
20	BLIP2-6.7B	OPT-6.7B	Yes	154	71	10	0	0	235
21	MiniGPT4V2	LLaMA2-13B	Yes	124	29	4	0	0	157

bnew · May 6, 2024

Reddit users compile list of words and phrases that unmask ChatGPT's writing style

As AI-generated texts become more prevalent on the internet, social media, and in emails, often without any labeling, Reddit users are sharing the telltale signs they use to identify text generated by ChatGPT.

the-decoder.com

AI in practice

May 1, 2024

Reddit users compile list of words and phrases that unmask ChatGPT's writing style

Midjourney prmpted by THE DECODER

Matthias Bastian

Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.

Profile

E-Mail

Content

https://the-decoder.com/reddit-user...s-that-unmask-chatgpts-writing-style/#summary

As AI-generated texts become more prevalent on the internet, social media, and in emails, often without any labeling, Reddit users are sharing the telltale signs they use to identify text generated by ChatGPT.

In a [URL='https://www.reddit.com/r/OpenAI/comments/1cdo36l/whats_your_personal_tell_word_to_identify/']thread started by user PowerfulDev that has garnered over 300 comments, users discussed the words and phrases that can be used to better identify ChatGPT-generated content. According to Reddit users, ChatGPT tends to use certain words disproportionately, such as:

Delve
Tapestry
Kaleidoscope
Foster
Nuanced
Crucial
Essential
Furthermore
Moreover

Many users agree that ChatGPT tends to draw conclusions too often, even when they are unnecessary. "In conclusion, I think this is a very helpful way to identify ChatGPT content," writes user MrSnowden.

Slightly more stylized language and phrases such as "in this digital world" or "let's dive deeper" are also considered indicators of ChatGPT text.

Commenters also describe ChatGPT as producing overly intellectual passages with words like "intricate," "nuanced," "complex," or "multifaceted" for certain topics.

Frequent use of hyphens in compound adjectives, even when not grammatically necessary, is also a potential ChatGPT telltale. Freylaverse points to em-dashes with no space on either side, which are not as easy to type on a keyboard as hyphens. ChatGPT uses them correctly, while lazy humans usually just use hyphens.

Sentences and paragraphs of uniform length and an overall formal style are also characteristic of ChatGPT. In emails, phrases like "I hope this email finds you well" or the excessive use of "moreover" or "furthermore" are seen as red flags.

AI detectives have an easy time with texts in which ChatGPT writes: "As an AI language model ...". Some people overlook this disclaimer and publish the text anyway.

However, some Reddit users caution against taking individual words as clear evidence of ChatGPT. After all, people might use those words - otherwise they wouldn't be in the training data. Only in combination and in large numbers are they indicative of AI-generated text.

Anecdotal analysis and the uncanny valley of communication

An analysis by Reddit user peyotebonsai as part of a research project shows that LinkedIn posts written by AI perform slightly better on average on sentiment analysis using tools like TextBlob and Vader than posts written by human authors.

TextBlob and Vader are programs that can assess sentiment and emotion in text. The results suggest that AI-generated text tends to sound more positive than human-generated text because of word choice, Peyotebonsai says.

Interestingly, however, a comparison of other indicators, such as reposts, comments and likes, showed that AI-generated content received significantly less engagement from users than human-generated posts.

This suggests that LinkedIn users are well aware of the subtle but noticeable differences between human and machine expression.

The analysis should be taken with a grain of salt, of course, since peyotebonsai posts anonymously and does not publish the results in detail. But it is consistent with my anecdotal experience, for what it is worth.

Or as user opi098514 puts it: "For me, it’s not a word. It’s kind of the uncanny valley…. But with communication."

In case you never heard of this effect: The Uncanny Valley originally describes the phenomenon that robots or human-like figures are often perceived as uncanny because they look human-like, but not human-like enough. This creates a feeling of alienation.

When it comes to communication, the analogy suggests that the subtle but noticeable differences in the way AI and humans express themselves can create a subliminal feeling of discomfort in the reader, even though AI texts may appear more coherent and positive on the surface.

Reddit users discuss words and phrases that are typical of ChatGPT-generated text, including "delve," "tapestry," "kaleidoscope," "foster," "nuanced," "crucial," "essential," "moreover," and "furthermore.
Other indicators include stilted language, overly intellectual passages, many hyphens in compound adjectives, similarly long sentences and paragraphs, and a very formal style.
One user describes the feeling he gets when reading AI texts: They are like the uncanny valley of communication. Somehow real, but not.

Sources
Reddit

https://archive.is/PBp5t

bnew · May 6, 2024

Massive prompts can outperform fine-tuning for LLMs, researchers find

Researchers have found that giving large language models (LLMs) many examples directly in the prompt can be more effective than time-consuming fine-tuning, according to a study from Carnegie Mellon and Tel Aviv University.

the-decoder.com

AI in practice

May 6, 2024

Massive prompts can outperform fine-tuning for LLMs, researchers find

Midjourney prompted by THE DECODER

Matthias Bastian

Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.

Profile
E-Mail

Researchers have found that giving large language models (LLMs) many examples directly in the prompt can be more effective than time-consuming fine-tuning, according to a study from Carnegie Mellon and Tel Aviv University.

This "in-context learning" (ICL) approach becomes more effective as the context window of LLMs grows, allowing for hundreds or thousands of examples in prompts, especially for tasks with many possible answers.

One method for selecting examples for ICL is "retrieval," where an algorithm (BM25) chooses the most relevant examples from a large dataset for each new question. This improves performance compared to random selection, particularly when using fewer examples.

However, the performance gain from retrieval diminishes with large numbers of examples, suggesting that longer prompts become more robust and individual examples or their order become less important.

While fine-tuning usually requires more data than ICL, it can sometimes outperform ICL with very long contexts. In some cases, ICL with long examples can be more effective and efficient than fine-tuning, even though ICL does not actually learn tasks but solves them using the examples, the researchers noted.

Fine-tuning sometimes, but not always, exceeds ICL at high numbers of demonstrations. | Image: Bertsch et al.

The experiments used special variants of the Llama-2-7B and Mistral-7B language models, which can process particularly long input text. The results suggest that ICL with many examples can be a viable alternative to retrieval and fine-tuning, especially as future models improve at handling extremely long input texts.

Ultimately, the choice between ICL and fine-tuning comes down to cost. Fine-tuning has a higher one-time cost, while ICL requires more computing power due to the many examples in the prompt. In some cases, it may be best to use many-shot prompts until you get a robust, reliable, high-quality result, and then use that data for fine-tuning.

While finetuning with full datasets is still a powerful option if the data vastly exceeds the context length, our results suggest that long-context ICL is an effective alternative– trading finetuning-time cost for increased inference-time compute. As the effectiveness and effiency of using very long model context lengths continues to increase, we believe long-context ICL will be a powerful tool for many tasks.

From the paper

The study confirms the results of a recent Google Deepmind study on many-shot prompts, which also showed that using hundreds to thousands of examples can significantly improve LLM results.

Researchers at Carnegie Mellon and Tel Aviv University have discovered that the results of large language models (LLMs) improve the more examples you give them directly in the input (prompt) as context. This method, called "In-Context Learning" (ICL), could be an alternative to time-consuming fine-tuning.
In ICL with a large number of examples in the prompt, the performance of the language models increases further, especially for tasks with many possible answers. Retrieval methods for selecting relevant examples further improve the results. Finetuning requires more data than ICL, but can provide even better results in some cases.
The researchers believe that ICL with long contexts will be a powerful tool for many tasks as language models get better at handling extremely long texts. Ultimately, it is also a question of cost whether ICL or fine-tuning is used. The study confirms earlier results from Google Deepmind on many-shot prompts.

Sources

Arxiv

[2405.00200] In-Context Learning with Long-Context Models: An In-Depth Exploration

Computer Science > Computation and Language

[Submitted on 30 Apr 2024]

In-Context Learning with Long-Context Models: An In-Depth Exploration

Amanda Bertsch, Maor Ivgi, Uri Alon, Jonathan Berant, Matthew R. Gormley, Graham Neubig

As model context lengths continue to increase, the number of demonstrations that can be provided in-context approaches the size of entire training datasets. We study the behavior of in-context learning (ICL) at this extreme scale on multiple datasets and models. We show that, for many datasets with large label spaces, performance continues to increase with hundreds or thousands of demonstrations. We contrast this with example retrieval and finetuning: example retrieval shows excellent performance at low context lengths but has diminished gains with more demonstrations; finetuning is more data hungry than ICL but can sometimes exceed long-context ICL performance with additional data. We use this ICL setting as a testbed to study several properties of both in-context learning and long-context models. We show that long-context ICL is less sensitive to random input shuffling than short-context ICL, that grouping of same-label examples can negatively impact performance, and that the performance boosts we see do not arise from cumulative gain from encoding many examples together. We conclude that although long-context ICL can be surprisingly effective, most of this gain comes from attending back to similar examples rather than task learning.

Comments:	27 pages; preprint
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2405.00200 [cs.CL]
	(or arXiv:2405.00200v1 [cs.CL] for this version)

Submission history

From: Amanda Bertsch [view email]
[v1] Tue, 30 Apr 2024 21:06:52 UTC (233 KB)

https://arxiv.org/pdf/2405.00200

bnew · May 6, 2024

Open-source model Prometheus 2 can evaluate other language models nearly as well as GPT-4

Prometheus 2, a freely available language model, has been optimized to evaluate other language models, catching up with commercial models such as GPT-4.

the-decoder.com

AI in practice

May 5, 2024

Open-source model Prometheus 2 can evaluate other language models nearly as well as GPT-4

Midjourney prompted by THE DECODER

Matthias Bastian

Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.

Profile

E-Mail

Prometheus 2, a freely available language model, has been optimized to evaluate other language models, catching up with commercial models such as GPT-4.

These evaluations allow researchers and developers to objectively measure and compare the performance of their language models and receive detailed feedback on strengths and weaknesses for targeted improvements, helping to continuously enhance the quality and reliability of language models.

Until now, proprietary models such as GPT-4 have often been used for these evaluations, but they lack transparency, are difficult to control, and are not affordable for many, according to a research team led by Seungone Kim of KAIST AI. Kim's team developed Prometheus 2 to provide an independent, transparent, and detailed evaluation of language models for everyone.

Prometheus 2 can perform evaluations similar to humans and GPT-4, mastering the two most common evaluation methods: direct evaluation, assigning scores on a scale, and pairwise comparison, deciding which of two responses is better.

Prometheus 2 can score answers directly or select the better of two answers. | Image: Kim et al.

Share

Recommend our article

Share

It can also evaluate on user-defined criteria, not limited to general aspects such as helpfulness and harmlessness, allowing for optimization for specific applications, the researchers report.

For example, a medical advice chatbot can be trained and tested on criteria such as trustworthiness, empathy, and professional correctness, enabling the development of high-quality language models for different applications, the team explained.

A new data set and mixed weights

To train Prometheus 2, the researchers created a new pairwise comparison dataset called the "Preference Collection," which contains more than 1,000 different evaluation criteria beyond basic characteristics.

They found that the best results came from training two separate models - one for direct ratings based on the Feedback Collection dataset, and one for pairwise comparisons based on the existing Preference Collection dataset - and then combining their learned weights.

In tests with eight datasets (four for direct ratings, four for pairwise comparisons), Prometheus 2 achieved the highest agreement with human judgments and commercial language models of all freely available rating models.

Although it lags behind GPT-4 and Claude 3 Opus in many tests, it can significantly close the gap with proprietary models, the researchers report.

Prometheus 2 can evaluate generated text as well as GPT-4 and Opus 3, but offers much more transparency and is potentially cheaper. The table shows the results for direct evaluation. | Image: Kim et al.

Prometheus 2 supports independent and transparent evaluation of language models for everyone, contributing to greater fairness and accessibility in the field, according to Kim's team. The code and data are available on Github.

The Prometheus 2 models ( 7B & 8x7B) are available from HuggingFace. According to the team, the faster 7B model achieves 80 percent of the evaluation performance of the 8x7B model, is on par with Mistral's Mixtral-8x7B, and better than Meta's Llama 2 70B.

Summary

Prometheus 2 is a freely available language model that can evaluate other language models as well as commercial models such as GPT-4, but is more transparent and potentially cheaper.
The model was trained on two separate datasets - one for direct scores and one for pairwise comparisons. By combining the learned weights, the researchers achieved the best results.
In tests on eight datasets, Prometheus 2 achieved the highest agreement with human judgments of any freely available model. This makes it possible for anyone to perform an independent and detailed evaluation of language models.

Sources

Github Paper

[2405.01535] Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Computer Science > Computation and Language

[Submitted on 2 May 2024]

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo

Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs. However, concerns including transparency, controllability, and affordability strongly motivate the development of open-source LMs specialized in evaluations. On the other hand, existing open evaluator LMs exhibit critical shortcomings: 1) they issue scores that significantly diverge from those assigned by humans, and 2) they lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment. Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. To address these issues, we introduce Prometheus 2, a more powerful evaluator LM than its predecessor that closely mirrors human and GPT-4 judgements. Moreover, it is capable of processing both direct assessment and pair-wise ranking formats grouped with a user-defined evaluation criteria. On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Our models, code, and data are all publicly available at this https URL.

Comments:	Work in Progress
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2405.01535 [cs.CL]
	(or arXiv:2405.01535v1 [cs.CL] for this version)
	[2405.01535] Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models Focus to learn more

Submission history

From: Seungone Kim [view email]
[v1] Thu, 2 May 2024 17:59:35 UTC (1,959 KB)

https://arxiv.org/pdf/2405.01535

bnew · May 6, 2024

Med-Gemini and Meditron: Google and Meta present new LLMs for medicine

Google and Meta have introduced language models optimized for medical tasks based on their latest LLMs, Gemini and Llama 3. Both models are designed to support doctors and medical staff in a variety of tasks.

the-decoder.com

AI in practice

May 5, 2024

Med-Gemini and Meditron: Google and Meta present new LLMs for medicine

Ideogram prompted by THE DECODER

Matthias Bastian

Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.

Profile
E-Mail

Google and Meta have introduced language models optimized for medical tasks based on their latest LLMs, Gemini and Llama 3. Both models are designed to support doctors and medical staff in a variety of tasks.

Google's Med-Gemini is built on the multimodal Gemini model family. It has been further trained with medical data to draw logical conclusions, understand different modalities such as images and text, and process long contexts.

Image: Google

According to Google, Med-Gemini achieved new top scores in 10 out of 14 medical benchmarks tested, including answering medical exam questions.

Med-Gemini uses a novel uncertainty-based web search. If the model is uncertain about a question, it automatically performs a web search. The additional information from the web is used to reduce the model's uncertainty and improve the quality of the answers.

In answering medical questions, Med-Gemini is just ahead of its predecessor, Med-PaLM 2, and even closer to GPT-4, which is not specifically optimized for medical questions.

Image: Google

This may seem like a small improvement, but when it comes to developing a reliable medical model, every percentage point counts, and the higher you get, the harder it is to make improvements. Still, it shows once again that GPT-4 as a generic LLM is already capable in niche areas.

According to Google, the performance difference is more evident for multimodal tasks such as evaluating medical images. Here, Med-Gemini outperforms GPT-4 by an average of 44.5 percent. Through fine-tuning and adapted encoders, modalities such as ECG recordings can also be processed.

Beispiel eines medizinischen Chatbot-Austauschs mit einem Arzt.

This is how Google envisions the use of Med-Gemini as a diagnostic assistant. | Image: Google

Google uses long context processing to perform reliable LLM-based searches in long pseudonymized patient records, and Med-Gemini can also answer questions about medical instructional videos.

Meta says Meditron is the most capable open-source LLM

In collaboration with ETH Lausanne and Yale University, Meta has developed a suite called Meditron, based on its open-source Llama 3 model. Meta wants it to be especially useful in developing countries and for humanitarian missions.

Recommendation

AI in practice

How Europe's hottest AI startup Mistral AI plans to beat OpenAI

Continuous pre-training on carefully compiled medical data aims to avoid distortions caused by the original Llama 3 web training. For cost reasons, the research team first tested the optimal data mix on the 7B model and then scaled it up to the 70B model.

According to Meta, Meditron is the most capable open-source LLM for medicine in benchmarks such as answering biomedical exam questions. But it's not yet on the same level as proprietary models.

Image: Meta

It is being tested and developed in a "Massive Online Open Validation and Evaluation" (MOOVE) by doctors worldwide, especially from developing countries. Meditron is available from Hugging Face in 7B and 70B versions.

Both models have yet to prove themselves in practice. Many questions about risks, traceability, and liability remain to be answered, especially for use in diagnostics. Google and Meta also stress that further extensive research and development is needed before these models can be used in safety-critical medical tasks.

Summary

Google and Meta present language models optimized for medical tasks: Google's Med-Gemini is based on the Gemini model family, and Meta's Meditron is based on the open-source Llama 3 model, both designed to support physicians and medical staff.
Med-Gemini achieves new highs in many medical benchmarks and uses uncertainty-based web search to improve response quality. It outperforms GPT-4 on multimodal tasks such as medical image analysis and can handle long contexts such as patient records.
Meditron has been optimized through continuous pre-training on medical data and, according to Meta, is the most capable open-source LLM for medicine. It is being tested and developed in an extensive online validation by physicians worldwide, especially for use in countries with fewer medical resources.

Sources

Meta Med-Gemini Paper

[2404.18416] Capabilities of Gemini Models in Medicine

Computer Science > Artificial Intelligence

[Submitted on 29 Apr 2024 (v1), last revised 1 May 2024 (this version, v2)]

Capabilities of Gemini Models in Medicine

Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, Juanma Zambrano Chaves, Szu-Yeu Hu, Mike Schaekermann, Aishwarya Kamath, Yong Cheng, David G.T. Barrett, Cathy Cheung, Basil Mustafa, Anil Palepu, Daniel McDuff, Le Hou, Tomer Golany, Luyang Liu, Jean-baptiste Alayrac, Neil Houlsby, Nenad Tomasev, Jan Freyberg, Charles Lau, Jonas Kemp, Jeremy Lai, Shekoofeh Azizi, Kimberly Kanada, SiWai Man, Kavita Kulkarni, Ruoxi Sun, Siamak Shakeri, Luheng He, Ben Caine, Albert Webson, Natasha Latysheva, Melvin Johnson, Philip Mansfield, Jian Lu, Ehud Rivlin, Jesper Anderson, Bradley Green, Renee Wong, Jonathan Krause, Jonathon Shlens, Ewa Dominowska, S. M. Ali Eslami, Katherine Chou, Claire Cui, Oriol Vinyals, Koray Kavukcuoglu, James Manyika, Jeff Dean, Demis Hassabis, Yossi Matias, Dale Webster, Joelle Barral, Greg Corrado, Christopher Semturs, S. Sara Mahdavi, Juraj Gottweis, Alan Karthikesalingam, Vivek Natarajan

Excellence in a wide variety of medical applications poses considerable challenges for AI, requiring advanced reasoning, access to up-to-date medical knowledge and understanding of complex multimodal data. Gemini models, with strong general capabilities in multimodal and long-context reasoning, offer exciting possibilities in medicine. Building on these core strengths of Gemini, we introduce Med-Gemini, a family of highly capable multimodal models that are specialized in medicine with the ability to seamlessly use web search, and that can be efficiently tailored to novel modalities using custom encoders. We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them, and surpass the GPT-4 model family on every benchmark where a direct comparison is viable, often by a wide margin. On the popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU (health & medicine), Med-Gemini improves over GPT-4V by an average relative margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context capabilities through SoTA performance on a needle-in-a-haystack retrieval task from long de-identified health records and medical video question answering, surpassing prior bespoke methods using only in-context learning. Finally, Med-Gemini's performance suggests real-world utility by surpassing human experts on tasks such as medical text summarization, alongside demonstrations of promising potential for multimodal medical dialogue, medical research and education. Taken together, our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment in this safety-critical domain.

Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2404.18416 [cs.AI]
	(or arXiv:2404.18416v2 [cs.AI] for this version)
	[2404.18416] Capabilities of Gemini Models in Medicine Focus to learn more

Submission history

From: Khaled Saab [view email]
[v1] Mon, 29 Apr 2024 04:11:28 UTC (4,986 KB)
[v2] Wed, 1 May 2024 17:12:10 UTC (4,986 KB)

https://arxiv.org/pdf/2404.18416

bnew · May 7, 2024

Meaningless fillers enable complex thinking in large language models

Researchers have found that specifically trained LLMs can solve complex problems just as well using dots like "......" instead of full sentences. This could make it harder to control what's happening in these models.

the-decoder.com

DE

AI in practice

Apr 29, 2024

Meaningless fillers enable complex thinking in large language models

Ideogram prompted by THE DECODER

Matthias Bastian

Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.

Profile

E-Mail

Researchers have found that specifically trained LLMs can solve complex problems just as well using dots like "......" instead of full sentences. This could make it harder to control what's happening in these models.

The researchers trained Llama language models to solve a difficult math problem called "3SUM", where the model has to find three numbers that add up to zero.

Usually, AI models solve such tasks by explaining the steps in full sentences, known as "chain of thought" prompting. But the researchers replaced these natural language explanations with repeated dots, called filler tokens.

Surprisingly, the models using dots performed as well as those using natural language reasoning with full sentences. As the tasks became more difficult, the dot models outperformed models that responded directly without any intermediate reasoning.

Die drei Prompting-Methoden, die in der Studie verglichen wurden.

The study compared three prompting methods.| Image: Jacob Pfau, William Merrill & Samuel R. Bowman

The researchers discovered the models were actually using the dots for calculations relevant to the task. The more dots available, the more accurate the answer was, suggesting more dots could provide the model with greater "thinking capacity".

They suspect the dots act as placeholders where the model inserts various numbers and checks if they meet the task's conditions. This allows the model to answer very complex questions it couldn't solve all at once.

Co-author Jacob Pfau says this result poses a key question for AI security: As AI systems increasingly "think" in hidden ways, how can we ensure they remain reliable and safe?

The finding aligns with recent research showing longer chain-of-thought prompts can boost language model performance, even if the added content is off-topic, essentially just multiplying tokens.

The researchers think it could be useful to teach AI systems to handle filler tokens from the start in the future, despite the challenging process. It may be worthwhile if the problems LLMs need to solve are highly complex and can't be solved in a single step.

Additionally, the training data must include enough examples where the problem is broken into smaller, simultaneously processable parts.

If these criteria are met, the dot method could also work in regular AI systems, helping them answer tough questions without it being obvious from their responses.

However, dot system training is considered difficult because it's unclear exactly what the AI calculates with the dots, and the dot approach doesn't work well for explanations needing a specific step sequence.

Popular chatbots like ChatGPT can't automatically do the dot reasoning - they need to be trained for it. So chain-of-thought prompting is still the standard approach to improving LLM reasoning.

Summary

Researchers have found that AI models can solve complex tasks like "3SUM" by using simple dots like "......" instead of sentences. The more dots available, the more accurate the results.
The dots are thought to act as placeholders into which the model inserts different numbers and checks that they fulfil the conditions. This makes it possible to answer very complex questions that cannot be solved in one go.
According to the researchers, this hidden computation raises safety issues when AI systems "think" in secret.

Sources

Arxiv

Let's Think Dot by Dot: Hidden Computation in Transformer Language Models

Chain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that...

arxiv.org

Computer Science > Computation and Language

[Submitted on 24 Apr 2024]

Let's Think Dot by Dot - Hidden Computation in Transformer Language Models

Jacob Pfau, William Merrill, Samuel R. Bowman

Chain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought to solve two hard algorithmic tasks they could not solve when responding without intermediate tokens. However, we find empirically that learning to use filler tokens is difficult and requires specific, dense supervision to converge. We also provide a theoretical characterization of the class of problems where filler tokens are useful in terms of the quantifier depth of a first-order formula. For problems satisfying this characterization, chain-of-thought tokens need not provide information about the intermediate computational steps involved in multi-token computations. In summary, our results show that additional tokens can provide computational benefits independent of token choice. The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens.

Comments:	17 pages, 10 figures
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
ACM classes:	I.2.6
Cite as:	arXiv:2404.15758 [cs.CL]
	(or arXiv:2404.15758v1 [cs.CL] for this version)
	[2404.15758] Let's Think Dot by Dot: Hidden Computation in Transformer Language Models Focus to learn more

Submission history

From: Jacob Pfau [view email]
[v1] Wed, 24 Apr 2024 09:30:00 UTC (579 KB)

https://arxiv.org/pdf/2404.15758

Large Language Models News & Discussions

Veteran

Veteran

Computer Science > Computer Vision and Pattern Recognition​

InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation​

Submission history​

Veteran

Is OpenAI testing GPT-4.5? "gpt2-chatbot" writes better code than GPT-4 and Claude​

GPT-4.5 or something entirely different?​

Veteran

Veteran

Full Line Code Completion in JetBrains IDEs: All You Need to Know​

What is full line code completion in JetBrains IDEs?​

How does full line completion work?​

Full line code completion vs. AI Assistant​

Under the hood​

How to tweak the feature​

How to provide feedback​

Veteran

Veteran

Gorilla: Large Language Model Connected with Massive APIs​

Blog 7: Gorilla OpenFunctions v2​

Gorilla OpenFunctions v2​

​

See What's New!! ​

Integrating OpenFunctions-v2 in your App ​

Veteran

Performance on Berkeley Function-Calling Leaderboard ​

OpenFunctions Data Composition & Training ​

Conclusion​

Veteran

OpenAI CEO Sam Altman says GPT-4 is the dumbest AI model you'll ever have to use again​

Agents as the next evolution of AI​

Veteran

OCRBench Leaderboard​

Veteran

Reddit users compile list of words and phrases that unmask ChatGPT's writing style​

Anecdotal analysis and the uncanny valley of communication​

Veteran

Massive prompts can outperform fine-tuning for LLMs, researchers find​

Computer Science > Computation and Language​

In-Context Learning with Long-Context Models: An In-Depth Exploration​

Submission history​

Veteran

Open-source model Prometheus 2 can evaluate other language models nearly as well as GPT-4​

A new data set and mixed weights​

Computer Science > Computation and Language​

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models​

Submission history​

Veteran

Med-Gemini and Meditron: Google and Meta present new LLMs for medicine​

Meta says Meditron is the most capable open-source LLM​

How Europe's hottest AI startup Mistral AI plans to beat OpenAI​

Computer Science > Artificial Intelligence​

Capabilities of Gemini Models in Medicine​

Submission history​

Veteran

Meaningless fillers enable complex thinking in large language models​

Computer Science > Computation and Language​

Let's Think Dot by Dot - Hidden Computation in Transformer Language Models​

Submission history​

Computer Science > Computer Vision and Pattern Recognition

InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation

Submission history

Is OpenAI testing GPT-4.5? "gpt2-chatbot" writes better code than GPT-4 and Claude

GPT-4.5 or something entirely different?

Full Line Code Completion in JetBrains IDEs: All You Need to Know

What is full line code completion in JetBrains IDEs?

How does full line completion work?

Full line code completion vs. AI Assistant

Under the hood

How to tweak the feature

How to provide feedback

Gorilla: Large Language Model Connected with Massive APIs

Blog 7: Gorilla OpenFunctions v2

Gorilla OpenFunctions v2

See What's New!!

Integrating OpenFunctions-v2 in your App

Performance on Berkeley Function-Calling Leaderboard

OpenFunctions Data Composition & Training

Conclusion

OpenAI CEO Sam Altman says GPT-4 is the dumbest AI model you'll ever have to use again

Agents as the next evolution of AI

OCRBench Leaderboard

Reddit users compile list of words and phrases that unmask ChatGPT's writing style

Anecdotal analysis and the uncanny valley of communication

Massive prompts can outperform fine-tuning for LLMs, researchers find

Computer Science > Computation and Language

In-Context Learning with Long-Context Models: An In-Depth Exploration

Submission history

Open-source model Prometheus 2 can evaluate other language models nearly as well as GPT-4

A new data set and mixed weights

Computer Science > Computation and Language

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Submission history

Med-Gemini and Meditron: Google and Meta present new LLMs for medicine

Meta says Meditron is the most capable open-source LLM

How Europe's hottest AI startup Mistral AI plans to beat OpenAI

Computer Science > Artificial Intelligence

Capabilities of Gemini Models in Medicine

Submission history

Meaningless fillers enable complex thinking in large language models

Computer Science > Computation and Language

Let's Think Dot by Dot - Hidden Computation in Transformer Language Models

Submission history