Large Language Models News & Discussions

bnew · Apr 30, 2023

https://web.archive.org/web/20230430105900/https://twitter.com/weights_biases/status/1651375841899577350

Secure Da Bag · Apr 30, 2023

Voice of Reason said:
I trying to work through the fast.ai course to learn this stuff

Practical Deep Learning for Coders - Practical Deep Learning

A free course designed for people with some coding experience, who want to learn how to apply deep learning and machine learning to practical problems.

course.fast.ai

Voice of Reason · Apr 30, 2023

Secure Da Bag said:
Practical Deep Learning for Coders - Practical Deep Learning

A free course designed for people with some coding experience, who want to learn how to apply deep learning and machine learning to practical problems.

course.fast.ai

Yeah I'm on lesson 2 right now

bnew · May 5, 2023

https://archive.is/wip/ueDbO

https://archive.is/isCZP

https://archive.is/8BHyv

Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs

Introducing MPT-7B, the latest entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on 1T tokens of text and code. It is open source, available for commercial use, and matches the quality of LLaMA-7B. MPT-7B was trained on the MosaicML platform in 9.5 days with...

www.mosaicml.com

GitHub - mosaicml/llm-foundry: LLM training code for MosaicML foundation models

LLM training code for MosaicML foundation models. Contribute to mosaicml/llm-foundry development by creating an account on GitHub.

github.com

MPT-7B-Instruct - a Hugging Face Space by mosaicml

Discover amazing ML apps made by the community

huggingface.co

Macallik86 · May 13, 2023

@bnew are you using any of the open-source models? I don't have the PC power for a local setup, but I was today years old when I found out that they have web-facing options that anyone can utilize:

Sign Up - Open Assistant

Conversational AI for everyone. An open source project to create a chat enabled GPT LLM run by LAION and contributors around the world.

open-assistant.io

HuggingChat

Making the community's best AI chat models available to everyone.

huggingface.co

I think my usage will likely mirror the way I use search engines. For example, I primarily search through Ecosia (powered by Bing). It's less accurate than google but it has an altruistic motive so I force myself to use it for everyday queries. The second option is Whoogle (powered by Google), solely for more nuanced/accurate searches.

I'm thinking the same will apply for my LLMs. I'm thinking I'll try to use the open-sourced models for as much as possible so that they get better, and then utilize Bard/Edge (or ChatGPT) for more nuanced questions that require the best models available.

bnew · May 13, 2023

Macallik86 said:
@bnew are you using any of the open-source models? I don't have the PC power for a local setup, but I was today years old when I found out that I can just query them via their websites:

Sign Up - Open Assistant

Conversational AI for everyone. An open source project to create a chat enabled GPT LLM run by LAION and contributors around the world.

open-assistant.io

HuggingChat

Making the community's best AI chat models available to everyone.

huggingface.co

I think my usage will likely mirror the way I use search engines. For example, I primarily search through Ecosia (powered by Bing). I use it for most of my everyday queries. It's less accurate than google but it has an altruistic motive so I force myself to use it. The second option is Whoogle (powered by Google) for more nuanced searches (albeit privacy-focused) that is more accurate.

I'm thinking the same will apply for my LLMs. I'm thinking I'll try to use the open-sourced models for as much as possible so that they get better, and then utilize Bard/Edge (or ChatGPT) for more nuanced questions that require the best models available.

I've downloaded like 2 dozen or so models but haven't run them locally yet, I use the open source ones online since they do what I need and most of the models I've downloaded have online demos I can use anytime.

try the same prompt in several different models because the answers will be different.

Chat with Open Large Language Models

Macallik86 · May 13, 2023

bnew said:
I've downloaded like 2 dozen or so models but haven't run them locally yet, I use the open source ones online since they do what I need and most of the models I've downloaded have online demos I can use anytime.

try the same prompt in several different models because the answers will be different.

Chat with Open Large Language Models

Oh wow, all of the open-source models in one website

. There goes my wknd

bnew · May 19, 2023

awesome-ml/llm-model-list.md at master · underlines/awesome-ml

Curated list of useful LLM / Analytics / Datascience resources - underlines/awesome-ml

github.com

Hood Critic · May 19, 2023

Digging into this framework to start building out my own custom AI assistant.

Welcome to LangChain — 🦜🔗 LangChain 0.0.174

bnew · May 21, 2023

https://archive.is/T6cSx

bnew · May 21, 2023

GitHub - ray-project/llm-numbers: Numbers every LLM developer should know

Numbers every LLM developer should know. Contribute to ray-project/llm-numbers development by creating an account on GitHub.

github.com

Numbers every LLM Developer should know

At Google, there was a document put together by Jeff Dean, the legendary engineer, called Numbers every Engineer should know. It’s really useful to have a similar set of numbers for LLM developers to know that are useful for back-of-the envelope calculations. Here we share particular numbers we at Anyscale use, why the number is important and how to use it to your advantage.

Notes on the Github version

Last updates: 2023-05-17

If you feel there's an issue with the accuracy of the numbers, please file an issue. Think there are more numbers that should be in this doc? Let us know or file a PR.

We are thinking the next thing we should add here is some stats on tokens per second of different models.

Prompts

40-90%1: Amount saved by appending “Be Concise” to your prompt

It’s important to remember that you pay by the token for responses. This means that asking an LLM to be concise can save you a lot of money. This can be broadened beyond simply appending “be concise” to your prompt: if you are using GPT-4 to come up with 10 alternatives, maybe ask it for 5 and keep the other half of the money.

1.3:1 -- Average tokens per word

LLMs operate on tokens. Tokens are words or sub-parts of words, so “eating” might be broken into two tokens “eat” and “ing”. A 750 word document in English will be about 1000 tokens. For languages other than English, the tokens per word increases depending on their commonality in the LLM's embedding corpus.

Knowing this ratio is important because most billing is done in tokens, and the LLM’s context window size is also defined in tokens.

Prices2

Prices are of course subject to change, but given how expensive LLMs are to operate, the numbers in this section are critical. We use OpenAI for the numbers here, but prices from other providers you should check out (Anthropic, Cohere) are in the same ballpark.

~50:1 -- Cost Ratio of GPT-4 to GPT-3.5 Turbo3

What this means is that for many practical applications, it’s much better to use GPT-4 for things like generation and then use that data to fine tune a smaller model. It is roughly 50 times cheaper to use GPT-3.5-Turbo than GPT-4 (the “roughly” is because GPT-4 charges differently for the prompt and the generated output) – so you really need to check on how far you can get with GPT-3.5-Turbo. GPT-3.5-Turbo is more than enough for tasks like summarization for example.

5:1 -- Cost Ratio of generation of text using GPT-3.5-Turbo vs OpenAI embedding

This means it is way cheaper to look something up in a vector store than to ask an LLM to generate it. E.g. “What is the capital of Delaware?” when looked up in an neural information retrieval system costs about 5x4 less than if you asked GPT-3.5-Turbo. The cost difference compared to GPT-4 is a whopping 250x!

10:1 -- Cost Ratio of OpenAI embedding to Self-Hosted embedding

Note: this number is sensitive to load and embedding batch size, so please consider this approximate.

In our blog post, we noted that using a g4dn.4xlarge (on-demand price: $1.20/hr) we were able to embed at about 9000 tokens per second using Hugging Face’s SentenceTransformers (which are pretty much as good as OpenAI’s embeddings). Doing some basic math of that rate and that node type indicates it is considerably cheaper (factor of 10 cheaper) to self-host embeddings (and that is before you start to think about things like ingress and egress fees).

6:1 -- Cost Ratio of OpenAI fine tuned vs base model queries

It costs you 6 times as much to serve a fine tuned model as it does the base model on OpenAI. This is pretty exorbitant, but might make sense because of the possible multi-tenancy of base models. It also means it is far more cost effective to tweak the prompt for a base model than to fine tune a customized model.

1:1 -- Cost Ratio of Self-Hosted base vs fine-tuned model queries

If you’re self hosting a model, then it more or less costs the same amount to serve a fine tuned model as it does to serve a base one: the models have the same number of parameters.

Training and Fine Tuning

~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens

The LLaMa paper mentions it took them 21 days to train LLaMa using 2048 GPUs A100 80GB GPUs. We considered training our own model on the Red Pajama training set, then we ran the numbers. The above is assuming everything goes right, nothing crashes, and the calculation succeeds on the first time, etc. Plus it involves the coordination of 2048 GPUs. That’s not something most companies can do (shameless plug time: of course, we at Anyscale can – that’s our bread and butter! Contact us if you’d like to learn more). The point is that training your own LLM is possible, but it’s not cheap. And it will literally take days to complete each run. Much cheaper to use a pre-trained model.

< 0.001: Cost ratio of fine tuning vs training from scratch

This is a bit of a generalization, but the cost of fine tuning is negligible. We showed for example that you can fine tune a 6B parameter model for about $7. Even at OpenAI’s rate for its most expensive fine-tunable model, Davinci, it is 3c per 1000 tokens. That means to fine tune on the entire works of Shakespeare (about 1 million words), you’re looking at $405. However, fine tuning is one thing and training from scratch is another …

GPU Memory

If you’re self-hosting a model, it’s really important to understand GPU memory because LLMs push your GPU’s memory to the limit. The following statistics are specifically about inference. You need considerably more memory for training or fine tuning.

V100: 16GB, A10G: 24GB, A100: 40/80GB: GPU Memory Capacities

It may seem strange, but it’s important to know the amount of memory different types of GPUs have. This will cap the number of parameters your LLM can have. Generally, we like to use A10Gs because they cost $1.50 to $2 per hour each at AWS on-demand prices and have 24G of GPU memory, vs the A100s which will run you about $5 each at AWS on-demand prices.

2x number of parameters: Typical GPU memory requirements of an LLM for serving

For example, if you have a 7 billion parameter model, it takes about 14GB of GPU space. This is because most of the time, one 16-bit float (or 2 bytes) is required per parameter. There’s usually no need to go beyond 16-bit accuracy, and most of the time when you go to 8-bit accuracy you start to lose resolution (though that may be acceptable in some cases). Of course there are efforts to reduce this, notably llama.cpp which runs a 13 billion parameter model on a 6GB GPU by quantizing aggressively down to 4 bits (and 8 bits without too much impact), but that’s atypical.

~1GB: Typical GPU memory requirements of an embedding model

Whenever you are doing sentence embedding (a very typical thing you do for clustering, semantic search and classification tasks), you need an embedding model like sentence transformers. OpenAI also has its own embeddings that they provide commercially.

You typically don’t have to worry about how much memory embeddings take on the GPU, they’re fairly small. We’ve even had the embedding and the LLM on the same GPU.

>10x: Throughput improvement from batching LLM requests

Running an LLM query through a GPU is very high latency: it may take, say, 5 seconds, with a throughput of 0.2 queries per second. The funny thing is, though, if you run two tasks, it might only take 5.2 seconds. This means that if you can bundle 25 queries together, it would take about 10 seconds, and our throughput has improved to 2.5 queries per second. However, see the next point.

~1 MB: GPU Memory required for 1 token of output with a 13B parameter model

The amount of memory you need is directly proportional to the maximum number of tokens you want to generate. So for example, if you want to generate outputs of up to 512 tokens (about 380 words), you need 512MB. No big deal you might say – I have 24GB to spare, what’s 512MB? Well, if you want to run bigger batches it starts to add up. So if you want to do batches of 16, you need 8GB of space. There are some techniques being developed that overcome this, but it’s still a real issue.

Cheatsheet

storyteller · May 31, 2023

Macallik86 said:
Speak of the devil, yesterday Replika doubled back to their erotic AI after the incels got ornery about the functionality being removed:

AI chatbot company Replika restores erotic roleplay for some users

AI chatbot company Replika is restoring erotic roleplay for some users, Replika CEO Eugenia Kuyda said late on Friday.

www.reuters.com

I guess I forgot to link it, but we discussed this story on the podcast.

We've had a bunch of AI stuff because the stories are hella interesting and bugged out. Just put out this short clip about Geoffrey Hinton "Godfather of AI" being startled after an AI explained to him what makes jokes funny.

And our next episode will drop with coverage and discussion on a positive story for a change. AI helped researchers discover an antibiotic that can beat a superbug resistant to other antibiotics. It also seems specialized in a way to prevent the bug from developing resistance to it and uses a method of delivery that's completely unique.

https://www.cnn.com/2023/05/25/health/antibiotic-artificial-intelligence-superbug/index.html

This last story's clip won't be out for a while though, still going through edits and all that.

bnew · Jun 8, 2023

bnew · Jun 12, 2023

bnew said:
Try chatting with fine-tuned models for Falcon-7B, Falcon-40B, and the new Open-Llama-7B

h2oGPT

Making h2oGPT models available to everyone. For more information, visit our GitHub pages: H2O LLM Studio and h2oGPT

bnew · Jun 13, 2023

Function calling and other API updates

We’re announcing updates including more steerable API models, function calling capabilities, longer context, and lower prices.

openai.com

Function calling and other API updates

We’re announcing updates including more steerable API models, function calling capabilities, longer context, and lower prices.

June 13, 2023

Authors

We released gpt-3.5-turbo and gpt-4 earlier this year, and in only a short few months, have seen incredible applications built by developers on top of these models.

Today, we’re following up with some exciting updates:

new function calling capability in the Chat Completions API
updated and more steerable versions of gpt-4 and gpt-3.5-turbo
new 16k context version of gpt-3.5-turbo (vs the standard 4k version)
75% cost reduction on our state-of-the-art embeddings model
25% cost reduction on input tokens for gpt-3.5-turbo
announcing the deprecation timeline for the gpt-3.5-turbo-0301 and gpt-4-0314 models

All of these models come with the same data privacy and security guarantees we introduced on March 1 — customers own all outputs generated from their requests and their API data will not be used for training.

Function calling

Developers can now describe functions to gpt-4-0613 and gpt-3.5-turbo-0613, and have the model intelligently choose to output a JSON object containing arguments to call those functions. This is a new way to more reliably connect GPT's capabilities with external tools and APIs.
These models have been fine-tuned to both detect when a function needs to be called (depending on the user’s input) and to respond with JSON that adheres to the function signature. Function calling allows developers to more reliably get structured data back from the model. For example, developers can:

Create chatbots that answer questions by calling external tools (e.g., like ChatGPT Plugins)

Convert queries such as “Email Anya to see if she wants to get coffee next Friday” to a function call like send_email(to: string, body: string), or “What’s the weather like in Boston?” to get_current_weather(location: string, unit: 'celsius' | 'fahrenheit').

Convert natural language into API calls or database queries

Convert “Who are my top ten customers this month?” to an internal API call such as get_customers_by_revenue(start_date: string, end_date: string, limit: int), or “How many orders did Acme, Inc. place last month?” to a SQL query using sql_query(query: string).

Extract structured data from text

Define a function called extract_people_data(people: [{name: string, birthday: string, location: string}]), to extract all people mentioned in a Wikipedia article.
These use cases are enabled by new API parameters in our /v1/chat/completions endpoint, functions and function_call, that allow developers to describe functions to the model via JSON Schema, and optionally ask it to call a specific function. Get started with our developer documentation and add evals if you find cases where function calling could be improved

Function calling example

What’s the weather like in Boston right now?

Step 1·OpenAI API
Call the model with functions and the user’s input

Step 2·Third party API
Use the model response to call your API

Step 3·OpenAI API
Send the response back to the model to summarize

The weather in Boston is currently sunny with a temperature of 22 degrees Celsius.

Since the alpha release of ChatGPT plugins, we have learned much about making tools and language models work together safely. However, there are still open research questions. For example, a proof-of-concept exploit illustrates how untrusted data from a tool’s output can instruct the model to perform unintended actions. We are working to mitigate these and other risks. Developers can protect their applications by only consuming information from trusted tools and by including user confirmation steps before performing actions with real-world impact, such as sending an email, posting online, or making a purchase.

New models

GPT-4

gpt-4-0613 includes an updated and improved model with function calling.
gpt-4-32k-0613 includes the same improvements as gpt-4-0613, along with an extended context length for better comprehension of larger texts.
With these updates, we’ll be inviting many more people from the waitlist to try GPT-4 over the coming weeks, with the intent to remove the waitlist entirely with this model. Thank you to everyone who has been patiently waiting, we are excited to see what you build with GPT-4!

GPT-3.5 Turbo

gpt-3.5-turbo-0613 includes the same function calling as GPT-4 as well as more reliable steerability via the system message, two features that allow developers to guide the model's responses more effectively.
gpt-3.5-turbo-16k offers 4 times the context length of gpt-3.5-turbo at twice the price: $0.003 per 1K input tokens and $0.004 per 1K output tokens. 16k context means the model can now support ~20 pages of text in a single request.

Model deprecations

Today, we’ll begin the upgrade and deprecation process for the initial versions of gpt-4 and gpt-3.5-turbo that we announced in March. Applications using the stable model names (gpt-3.5-turbo, gpt-4, and gpt-4-32k) will automatically be upgraded to the new models listed above on June 27th. For comparing model performance between versions, our Evals library supports public and private evals to show how model changes will impact your use cases.

Developers who need more time to transition can continue using the older models by specifying gpt-3.5-turbo-0301, gpt-4-0314, or gpt-4-32k-0314 in the ‘model’ parameter of their API request. These older models will be accessible through September 13th, after which requests specifying those model names will fail. You can stay up to date on model deprecations via our model deprecation page. This is the first update to these models; so, we eagerly welcome developer feedback to help us ensure a smooth transition.

Lower pricing

We continue to make our systems more efficient and are passing those savings on to developers, effective today.

Embeddings

text-embedding-ada-002 is our most popular embeddings model. Today we’re reducing the cost by 75% to $0.0001 per 1K tokens.

GPT-3.5 Turbo

gpt-3.5-turbo is our most popular chat model and powers ChatGPT for millions of users. Today we're reducing the cost of gpt-3.5-turbo’s input tokens by 25%. Developers can now use this model for just $0.0015 per 1K input tokens and $0.002 per 1K output tokens, which equates to roughly 700 pages per dollar.
gpt-3.5-turbo-16k will be priced at $0.003 per 1K input tokens and $0.004 per 1K output tokens.
Developer feedback is a cornerstone of our platform’s evolution and we will continue to make improvements based on the suggestions we hear. We’re excited to see how developers use these latest models and new features in their applications.

Large Language Models News & Discussions

Veteran

Veteran

Veteran

Veteran

Superstar

Veteran

Superstar

Veteran

The Power Circle

Veteran

Veteran

Numbers every LLM Developer should know​

Notes on the Github version​

Prompts​

40-90%1: Amount saved by appending “Be Concise” to your prompt​

1.3:1 -- Average tokens per word​

Prices2​

~50:1 -- Cost Ratio of GPT-4 to GPT-3.5 Turbo3​

5:1 -- Cost Ratio of generation of text using GPT-3.5-Turbo vs OpenAI embedding​

10:1 -- Cost Ratio of OpenAI embedding to Self-Hosted embedding​

6:1 -- Cost Ratio of OpenAI fine tuned vs base model queries​

1:1 -- Cost Ratio of Self-Hosted base vs fine-tuned model queries​

Training and Fine Tuning​

~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens​

< 0.001: Cost ratio of fine tuning vs training from scratch​

GPU Memory​

V100: 16GB, A10G: 24GB, A100: 40/80GB: GPU Memory Capacities​

2x number of parameters: Typical GPU memory requirements of an LLM for serving​

~1GB: Typical GPU memory requirements of an embedding model​

>10x: Throughput improvement from batching LLM requests​

~1 MB: GPU Memory required for 1 token of output with a 13B parameter model​

Cheatsheet​

Veteran

Veteran

Veteran

Try chatting with fine-tuned models for Falcon-7B, Falcon-40B, and the new Open-Llama-7B​

Veteran

Function calling and other API updates​

Authors​

Function calling​

Function calling example​

New models​

GPT-4​

GPT-3.5 Turbo​

Model deprecations​

Lower pricing​

Embeddings​

GPT-3.5 Turbo​

Numbers every LLM Developer should know

Notes on the Github version

Prompts

40-90%1: Amount saved by appending “Be Concise” to your prompt

1.3:1 -- Average tokens per word

Prices2

~50:1 -- Cost Ratio of GPT-4 to GPT-3.5 Turbo3

5:1 -- Cost Ratio of generation of text using GPT-3.5-Turbo vs OpenAI embedding

10:1 -- Cost Ratio of OpenAI embedding to Self-Hosted embedding

6:1 -- Cost Ratio of OpenAI fine tuned vs base model queries

1:1 -- Cost Ratio of Self-Hosted base vs fine-tuned model queries

Training and Fine Tuning

~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens

< 0.001: Cost ratio of fine tuning vs training from scratch

GPU Memory

V100: 16GB, A10G: 24GB, A100: 40/80GB: GPU Memory Capacities

2x number of parameters: Typical GPU memory requirements of an LLM for serving

~1GB: Typical GPU memory requirements of an embedding model

>10x: Throughput improvement from batching LLM requests

~1 MB: GPU Memory required for 1 token of output with a 13B parameter model

Cheatsheet

Try chatting with fine-tuned models for Falcon-7B, Falcon-40B, and the new Open-Llama-7B

Function calling and other API updates

Authors

Function calling

Function calling example

New models

GPT-4

GPT-3.5 Turbo

Model deprecations

Lower pricing

Embeddings

GPT-3.5 Turbo