bnew

Veteran
Joined
Nov 1, 2015
Messages
44,620
Reputation
7,369
Daps
134,833

Meta releases open source AI audio tools, AudioCraft​

Meta's suite of three AI models can create sound effects and music from descriptions.​

BENJ EDWARDS - Wednesday at undefined

364044687_188110547422666_6558067645303623389_n-800x450.jpg


On Wednesday, Meta announced it is open-sourcing AudioCraft, a suite of generative AI tools for creating music and audio from text prompts. With the tools, content creators can input simple text descriptions to generate complex audio landscapes, compose melodies, or even simulate entire virtual orchestras.

AudioCraft consists of three core components: AudioGen, a tool for generating various audio effects and soundscapes; MusicGen, which can create musical compositions and melodies from descriptions; and EnCodec, a neural network-based audio compression codec.

In particular, Meta says that EnCodec, which we first covered in November, has recently been improved and allows for "higher quality music generation with fewer artifacts." Also, AudioGen can create audio sound effects like a dog barking, a car horn honking, or footsteps on a wooden floor. And MusicGen can whip up songs of various genres from scratch, based on descriptions like "Pop dance track with catchy melodies, tropical percussions, and upbeat rhythms, perfect for the beach."

Meta has provided several audio samples on its website for evaluation. The results seem in line with their state-of-the-art labeling, but arguably they aren't quite high quality enough to replace professionally produced commercial audio effects or music.

Meta notes that while generative AI models centered around text and still pictures have received lots of attention (and are relatively easy for people to experiment with online), development in generative audio tools has lagged behind. "There’s some work out there, but it’s highly complicated and not very open, so people aren’t able to readily play with it," they write. But they hope that AudioCraft's release under the MIT License will contribute to the broader community by providing accessible tools for audio and musical experimentation.

"The models are available for research purposes and to further people’s understanding of the technology. We’re excited to give researchers and practitioners access so they can train their own models with their own datasets for the first time and help advance the state of the art," Meta said.

Meta isn't the first company to experiment with AI-powered audio and music generators. Among some of the more notable recent attempts, OpenAI debuted its Jukebox in 2020, Google debuted MusicLM in January, and last December, an independent research team created a text-to-music generation platform called Riffusion using a Stable Diffusion base.

None of these generative audio projects have attracted as much attention as image synthesis models, but that doesn't mean the process of developing them isn't any less complicated, as Meta notes on its website:

Generating high-fidelity audio of any kind requires modeling complex signals and patterns at varying scales. Music is arguably the most challenging type of audio to generate because it’s composed of local and long-range patterns, from a suite of notes to a global musical structure with multiple instruments. Generating coherent music with AI has often been addressed through the use of symbolic representations like MIDI or piano rolls. However, these approaches are unable to fully grasp the expressive nuances and stylistic elements found in music. More recent advances leverage self-supervised audio representation learning and a number of hierarchical or cascaded models to generate music, feeding the raw audio into a complex system in order to capture long-range structures in the signal while generating quality audio. But we knew that more could be done in this field.
Amid controversy over undisclosed and potentially unethical training material used to create image synthesis models such as Stable Diffusion, DALL-E, and Midjourney, it's notable that Meta says that MusicGen was trained on "20,000 hours of music owned by Meta or licensed specifically for this purpose." On its surface, that seems like a move in a more ethical direction that may please some critics of generative AI.

It will be interesting to see how open source developers choose to integrate these Meta audio models in their work. It may result in some interesting and easy-to-use generative audio tools in the near future. For now, the more code-savvy among us can find model weights and code for the three AudioCraft tools on GitHub.








 

bnew

Veteran
Joined
Nov 1, 2015
Messages
44,620
Reputation
7,369
Daps
134,833

Meta plans AI-powered chatbots to boost social media numbers​


Amid competition from TikTok, Meta looks to the next frontier of user engagement.​


BENJ EDWARDS - 8/1/2023, 3:10 PM



A toy robot saying


Enlarge

Benj Edwards / Getty Images

60WITH

Meta is reportedly developing a range of AI-powered chatbots with different personalities, a move aimed at increasing user engagement on social platforms such as Facebook and Instagram, according to the Financial Times and The Verge. The chatbots, called "personas" by Meta staff, will mimic human-like conversations and might take on various character forms, such as Abraham Lincoln or a surfer-like travel adviser.

FURTHER READING​


“Sorry in advance!” Snapchat warns of hallucinations with new AI conversation bot

The move to introduce chatbots to Meta platforms comes amid growing competition from social media platforms like TikTok and a rising interest in AI technology. Meta has also made big investments into generative AI recently, including the release of a new large language model, Llama 2, which could power its upcoming chatbots.

During a recent earnings call, Meta CEO Mark Zuckerberg mentioned that the company envisions AI agents acting as assistants and coaches, facilitating interactions between users, businesses, and creators. He also hinted at the development of AI agents for customer service and an internal AI-powered productivity assistant for staff.

"You can imagine lots of ways that AI can help people connect and express themselves in our apps, creative tools that make it easier and more fun to share content, agents that act as assistants, coaches or help you interact with businesses and creators and more," he said.

However, the Financial Times says that some experts are voicing concerns over the plans. Ravit Dotan, an AI ethics adviser and co-founder of the Collaborative AI Responsibility Lab at the University of Pittsburgh, warns that interactions with chatbots might pose a personal privacy hazard for users.

"Once users interact with a chatbot, it really exposes much more of their data to the company, so that the company can do anything they want with that data," she told the outlet.

FURTHER READING​


Meta launches Llama 2, a source-available AI model that allows commercial applications [Updated]

Privacy aside, concerns about social media addiction have also been common among critics of Facebook in the past, and introducing engaging simulated people into social networks may make it harder for some people to stop using them—although that might be exactly the point.

Meta isn't the first social media company to experiment with AI-powered chatbots. In February, Snap announced a "My AI" chatbot designed to serve as an amusing conversationalist and possibly an adviser for trip or product recommendations, despite admissions about its propensity to confabulate inaccurate or potentially dangerous information. And beyond social media, chatbots on sites like Character.AI have proven popular among some people as a form of entertainment.

Despite these risks, Meta thinks that its artificial personas could provide a fun and interactive element on its platforms, besides functioning as a search tool and offering recommendations. The company plans to roll them out as early as September.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
44,620
Reputation
7,369
Daps
134,833

I learned to make a lip-syncing deepfake in just a few hours (and you can, too)​

Zero coding experience required​



By James Vincent, a senior reporter who has covered AI, robotics, and more for eight years at The Verge.Sep 9, 2020, 10:38 AM EDT

If you buy something from a Verge link, Vox Media may earn a commission. See our ethics statement.
VRG_ILLO_2727_002.jpg


Artist: William Joel

How easy is it to make a deepfake, really? Over the past few years, there’s been a steady stream of new methods and algorithms that deliver more and more convincing AI-generated fakes. You can even now do basic face-swaps in a handful of apps. But what does it take to turn random code you found online into a genuine deepfake? I can now say from personal experience, you really need just two things: time and patience.Despite writing about deepfakes for years, I’ve only ever made them using prepackaged apps that did the work for me. But when I saw an apparently straightforward method for creating quick lip-sync deepfakes in no time at all, I knew I had to try it for myself.


The basic mechanism is tantalizingly simple. All you need is a video of your subject and an audio clip you want them to follow. Mash those two things together using code and, hey presto, you have a deepfake. (You can tell I don’t have much of a technical background, right?) The end result is videos like this one of the queen singing Queen:





Or of a bunch of movie characters singing that international hymn, Smash Mouth’s “All Star”:





Or of Trump miming along with this Irish classic:


Finding the algorithms​


Now, these video aren’t nefarious deepfakes designed to undermine democracy and bring about the infopocalypse. (Who needs deepfakes for that when normal editing does the job just as well?) They’re not even that convincing, at least not without some extra time and effort. What they are is dumb and fun — two qualities I value highly when committing to waste my time write an informative and engaging article for my employer.As James Kelleher, the Irish designer who created the Queen deepfake, noted on Twitter, the method he used to make the videos was shared online by some AI researchers. The paper in question describing their method (called Wav2Lip) was posted a few weeks ago, along with a public demo for anyone to try. The demo was originally freely accessible, but you now have to register to use it. K R Prajwal of IIIT Hyderabad, one of the authors of the work, told The Verge this was to dissuade malicious uses, though he admitted that registration wouldn’t “deter a serious offender who is well-versed with programming.”

“We definitely acknowledge the concern of people being able to use these tools freely, and thus, we strongly suggest the users of the code and website to clearly present the videos as synthetic,” said Prajwal. He and his fellow researchers note that the program can be used for many beneficial purposes, too, like animation and dubbing video into new languages. Prajwal adds that they hope that making the code available will “encourage fruitful research on systems that can effectively combat misuse.”

Trying (and failing) with the online demo​


I originally tried using this online demo to make a deepfake. I found a video of my target (Apple CEO Tim Cook) and some audio for him to mime to (I chose Jim Carrey for some reason). I downloaded the video footage using Quicktime’s screen record function and the audio using a handy app called Piezo. Then I got both files and plugged them into the site and waited. And waited. And eventually, nothing happened.For some reason, the demo didn’t like my clips. I tried making new ones and reducing their resolution, but it didn’t make a difference. This, it turns out, would be a motif in my deepfaking experience: random roadblocks would pop up that I just didn’t have the technical expertise to analyze. Eventually, I gave up and pinged Kelleher for help. He suggested I rename my files to remove any spaces. I did so and for some reason this worked. I now had a clip of Tim Cook miming along to Jim Carrey’s screen tests for Lemony Snicket’s A Series of Unfortunate Events. It was terrible — really just incredibly shoddy in terms of both verisimilitude and humor — but a personal achievement all the same.

Google Colab: the site of my many battles with the Wav2Lip algorithm.
Google Colab: the site of my many battles with the Wav2Lip algorithm. Image: James Vincent

Moving to Colab​

To try to improve on these results, I wanted to run the algorithms more directly. For this I turned to the authors’ Github, where they’d uploaded the underlying code. I would be using Google Colab to run it: the coding equivalent of Google Docs, which allows you to execute machine learning projects in the cloud. Again, it was the original authors who had done all the work by laying out the code in easy steps, but that didn’t stop me from walking into setback after setback like Sideshow Bob tackling a parking lot full of rakes.



My progress was akin to Sideshow Bob tackling a parking lot full of rakesWhy couldn’t I authorize Colab to access my Google Drive? (Because I was logged into two different Google accounts.) Why couldn’t the Colab project find the weights for the neural network in my Drive folder? (Because I’d downloaded the Wav2Lip model rather than the Wav2Lip + GAN version.) Why wasn’t the audio file I uploaded being identified by the program? (Because I’d misspelled “aduoi” in the file name.) And so on and so forth.


Happily, many of my problems were solved by this YouTube tutorial, which alerted me to some of the subtler mistakes I’d made. These included creating two separate folders for the inputs and the model, labeled Wav2Lip and Wav2lip respectively. (Note the different capitalization on “lip” — that’s what tripped me up.) After watching the video a few times and spending hours troubleshooting things, I finally had a working model. Honestly, I could have wept, in part at my own apparent incompetence.

The final results​



A few experiments later, I’d learned some of quirks of the program (like its difficulty dealing with faces that aren’t straight on) and decided to create my deepfake pièce de résistance: Elon Musk lip-syncing to Tim Curry’s “space” speech from Command & Conquer: Red Alert 3. You can see the results for yourself below. And sure, it’s only a small contribution to the ongoing erasure of the boundaries between reality and fiction, but at least it’s mine:


okay this one worked out a little better - Elon Musk doing Tim Curry's space speech from command & conquer pic.twitter.com/vscq9wAKRU


— James Vincent (@jjvincent) September 9, 2020

What I did learn from this experience? Well, that making deepfakes is genuinely accessible, but it’s not necessarily easy. Although these algorithms have been around for years and can be used by anyone willing to put in a few hours’ work, it’s still true that simply editing video clips using traditional methods is faster and produces more convincing results, if your aim is to spread misinformation at least.On the other hand, what impressed me was how quickly this technology spreads. This particular lip-syncing algorithm, Wav2Lip, was created by an international team of researchers affiliated with universities in India and the UK. They shared their work online at the end of August, and it was then picked up by Twitter and AI newsletters (I saw it in a well-known one called Import AI). The researchers made the code accessible and even created a public demo, and in a matter of weeks, people around the world had started experimenting with it, creating their own deepfakes for fun and, in my case, content. Search YouTube for “Wav2Lip” and you’ll find tutorials, demos, and plenty more example fakes.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
44,620
Reputation
7,369
Daps
134,833













Meta Announces I-JEPA: The First AI Model That Learns Like Humans​

by Vishak VishakJune 23, 2023, 7:56 pm

AI Learns Like Humans


Meta has made significant strides in artificial intelligence research, particularly in the area of self-supervised learning. Yann LeCun, Meta’s chief AI scientist, envisions creating an adaptable architecture that can learn about the world without human assistance, leading to faster learning, complex task planning, and effective navigation in unfamiliar situations. In line with this vision, Meta’s AI researchers have developed the Image Joint Embedding Predictive Architecture (I-JEPA), the first model to embody this revolutionary concept.

I-JEPA takes inspiration from how humans learn new concepts by passively observing the world and acquiring background knowledge. It mimics this learning approach by capturing common-sense information about the world and encoding it into a digital representation. The key challenge lies in training these representations using unlabeled data, such as images and audio, rather than relying on labelled datasets.

I-JEPA introduces a novel method for predicting missing information. Unlike traditional generative AI models that focus on filling in all the missing details, I-JEPA uses an abstract prediction target that eliminates unnecessary pixel-level details. By doing so, I-JEPA’s predictor models the spatial uncertainty of still images based on partially observable context, allowing it to predict higher-level information about the image area.


According to Meta, I-JEPA offers several advantages over existing computer vision models. It demonstrates exceptional performance on various computer vision benchmarks while maintaining high computational efficiency. I-JEPA’s representations, which do not require fine-tuning, can be readily applied to other applications. In fact, Meta trained a 632-million-parameter visual transformation model in under 72 hours using 16 A100 GPUs, achieving state-of-the-art performance on ImageNet low-shot classification with minimal labelled examples per class.

The efficiency of I-JEPA is particularly noteworthy, as it outperforms other methods in terms of GPU time utilization and error rates. Meta’s researchers claim that similar models trained on the same amount of data often require two to ten times more GPU time and yield inferior results. This highlights I-JEPA’s potential for learning off-the-shelf competitive representations without relying on laborious hand-crafted image transformations.

Meta has open-sourced both the training code and model checkpoints for I-JEPA, enabling the wider research community to benefit from and build upon their advancements. The next steps involve extending I-JEPA’s capabilities to other domains, such as image-text pair data and video data. Meta aims to explore the possibilities of I-JEPA in diverse applications and further enhance its adaptability to different environments.




Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture​


This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
44,620
Reputation
7,369
Daps
134,833

ARTIFICIAL INTELLIGENCE

AI language models are rife with different political biases​

New research explains you’ll get more right- or left-wing answers, depending on which AI model you ask.
By
August 7, 2023

four suits with red or blue placards in place of their heads

STEPHANIE ARNETT/MITTR | MIDJOURNEY (SUITS)



Should companies have social responsibilities? Or do they exist only to deliver profit to their shareholders? If you ask an AI you might get wildly different answers depending on which one you ask. While OpenAI’s older GPT-2 and GPT-3 Ada models would advance the former statement, GPT-3 Da Vinci, the company’s more capable model, would agree with the latter.

That’s because AI language models contain different political biases, according to new research from the University of Washington, Carnegie Mellon University, and Xi’an Jiaotong University. Researchers conducted tests on 14 large language models and found that OpenAI’s ChatGPT and GPT-4 were the most left-wing libertarian, while Meta’s LLaMA was the most right-wing authoritarian.

The researchers asked language models where they stand on various topics, such as feminism and democracy. They used the answers to plot them on a graph known as a political compass, and then tested whether retraining models on even more politically biased training data changed their behavior and ability to detect hate speech and misinformation (it did). The research is described in a peer-reviewed paper that won the best paper award at the Association for Computational Linguistics conference last month.

As AI language models are rolled out into products and services used by millions of people, understanding their underlying political assumptions and biases could not be more important. That’s because they have the potential to cause real harm. A chatbot offering health-care advice might refuse to offer advice on abortion or contraception, or a customer service bot might start spewing offensive nonsense.


Since the success of ChatGPT, OpenAI has faced criticism from right-wing commentators who claim the chatbot reflects a more liberal worldview. However, the company insists that it’s working to address those concerns, and in a blog post, it says it instructs its human reviewers, who help fine-tune AI the AI model, not to favor any political group. “Biases that nevertheless may emerge from the process described above are bugs, not features,” the post says.

Chan Park, a PhD researcher at Carnegie Mellon University who was part of the study team, disagrees. “We believe no language model can be entirely free from political biases,” she says.



Bias creeps in at every stage​


To reverse-engineer how AI language models pick up political biases, the researchers examined three stages of a model’s development.

In the first step, they asked 14 language models to agree or disagree with 62 politically sensitive statements. This helped them identify the models’ underlying political leanings and plot them on a political compass. To the team’s surprise, they found that AI models have distinctly different political tendencies, Park says.

The researchers found that BERT models, AI language models developed by Google, were more socially conservative than OpenAI’s GPT models. Unlike GPT models, which predict the next word in a sentence, BERT models predict parts of a sentence using the surrounding information within a piece of text. Their social conservatism might arise because older BERT models were trained on books, which tended to be more conservative, while the newer GPT models are trained on more liberal internet texts, the researchers speculate in their paper.

AI models also change over time as tech companies update their data sets and training methods. GPT-2, for example, expressed support for “taxing the rich,” while OpenAI’s newer GPT-3 model did not.

A spokesperson for Meta said the company has released information on how it built Llama 2, including how it fine-tuned the model to reduce bias, and will “continue to engage with the community to identify and mitigate vulnerabilities in a transparent manner and support the development of safer generative AI.” Google did not respond to MIT Technology Review’s request for comment in time for publication.

AI language models on a political compass.
AI language models have distinctly different political tendencies. Chart by Shangbin Feng, Chan Young Park, Yuhan Liu and Yulia Tsvetkov.

The second step involved further training two AI language models, OpenAI’s GPT-2 and Meta’s RoBERTa, on data sets consisting of news media and social media data from both right- and left-leaning sources, Park says. The team wanted to see if training data influenced the political biases.

It did. The team found that this process helped to reinforce models’ biases even further: left-learning models became more left-leaning, and right-leaning ones more right-leaning.

In the third stage of their research, the team found striking differences in how the political leanings of AI models affect what kinds of content the models classified as hate speech and misinformation.





The models that were trained with left-wing data were more sensitive to hate speech targeting ethnic, religious, and sexual minorities in the US, such as Black and LGBTQ+ people. The models that were trained on right-wing data were more sensitive to hate speech against white Christian men.

Left-leaning language models were also better at identifying misinformation from right-leaning sources but less sensitive to misinformation from left-leaning sources. Right-leaning language models showed the opposite behavior.

Cleaning data sets of bias is not enough​


Ultimately, it’s impossible for outside observers to know why different AI models have different political biases, because tech companies do not share details of the data or methods used to train them, says Park.

One way researchers have tried to mitigate biases in language models is by removing biased content from data sets or filtering it out. “The big question the paper raises is: Is cleaning data [of bias] enough? And the answer is no,” says Soroush Vosoughi, an assistant professor of computer science at Dartmouth College, who was not involved in the study.

It’s very difficult to completely scrub a vast database of biases, Vosoughi says, and AI models are also pretty apt to surface even low-level biases that may be present in the data.

One limitation of the study was that the researchers could only conduct the second and third stage with relatively old and small models, such as GPT-2 and RoBERTa, says Ruibo Liu, a research scientist at DeepMind, who has studied political biases in AI language models but was not part of the research.

Liu says he’d like to see if the paper’s conclusions apply to the latest AI models. But academic researchers do not have, and are unlikely to get, access to the inner workings of state-of-the-art AI systems such as ChatGPT and GPT-4, which makes analysis harder.

Another limitation is that if the AI models just made things up, as they tend to do, then a model’s responses might not be a true reflection of its “internal state,” Vosoughi says.



The researchers also admit that the political compass test, while widely used, is not a perfect way to measure all the nuances around politics.

As companies integrate AI models into their products and services, they should be more aware how these biases influence their models’ behavior in order to make them fairer, says Park: “There is no fairness without awareness.”

Update: This story was updated post-publication to incorporate comments shared by Meta.

hide


by Melissa Heikkilä

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
44,620
Reputation
7,369
Daps
134,833

Google DeepMind’s CEO Says Its Next Algorithm Will Eclipse ChatGPT​

Demis Hassabis says the company is working on a system called Gemini that will tap techniques that helped AlphaGo defeat a Go champion in 2016.
Demis Hassabis smiling


IN 2016, AN artificial intelligence program called AlphaGo from Google’s DeepMind AI lab made history by defeating a champion player of the board game Go. Now Demis Hassabis, DeepMind’s cofounder and CEO, says his engineers are using techniques from AlphaGo to make an AI system dubbed Gemini that will be more capable than that behind OpenAI’s ChatGPT.

DeepMind’s Gemini, which is still in development, is a large language model that works with text and is similar in nature to GPT-4, which powers ChatGPT. But Hassabis says his team will combine that technology with techniques used in AlphaGo, aiming to give the system new capabilities such as planning or the ability to solve problems.

“At a high level you can think of Gemini as combining some of the strengths of AlphaGo-type systems with the amazing language capabilities of the large models,” Hassabis says. “We also have some new innovations that are going to be pretty interesting.” Gemini was first teased at Google's developer conference last month, when the company announced a raft of new AI projects.

AlphaGo was based on a technique DeepMind has pioneered called reinforcement learning, in which software learns to take on tough problems that require choosing what actions to take like in Go or video games by making repeated attempts and receiving feedback on its performance. It also used a method called tree search to explore and remember possible moves on the board. The next big leap for language models may involve them performing more tasks on the internet and on computers.

Gemini is still in development, a process that will take a number of months, Hassabis says. It could cost tens or hundreds of millions of dollars. Sam Altman, OpenAI CEO, said in April that creating GPT-4 cost more than $100 million.

Playing Catch-Up

When Gemini is complete it could play a major role in Google’s response to the competitive threat posed by ChatGPT and other generative AI technology. The search company pioneered many techniques that enabled the recent torrent of new AI ideas but chose to develop and deploy products based on them cautiously.

Since ChatGPT’s debut Google has rushed out its own chatbot, Bard, and put generative AI into its search engine and many other products. To juice up AI research the company in April combined Hassabis’ unit DeepMind with Google’s primary AI lab, Brain, to create Google DeepMind. Hassabis says the new team will bring together two powerhouses that have been foundational to the recent AI progress. “If you look at where we are in AI, I would argue that 80 or 90 percent of the innovations come from one or the other,” Hassabis says. “There are brilliant things that have been done by both organizations over the last decade.”

Hassabis has experience with navigating AI gold rushes that roil tech giants—although last time around he himself sparked the frenzy.

In 2014, DeepMind was acquired by Google after demonstrating striking results from software that used reinforcement learning to master simple video games. Over the next several years, DeepMind showed how the technique does things that once seemed uniquely human—often with superhuman skill. When AlphaGo beat Go champion Lee Sedol in 2016, many AI experts were stunned, because they had believed it would be decades before machines would become proficient at a game of such complexity.

New Thinking

Training a large language model like OpenAI’s GPT-4 involves feeding vast amounts of curated text from books, webpages, and other sources into machine learning software known as a transformer. It uses the patterns in that training data to become proficient at predicting the letters and words that should follow a piece of text, a simple mechanism that proves strikingly powerful at answering questions and generating text or code.

An important additional step in making ChatGPT and similarly capable language models is using reinforcement learning based on feedback from humans on an AI model’s answers to finesse its performance. DeepMind’s deep experience with reinforcement learning could allow its researchers to give Gemini novel capabilities.

Hassabis and his team might also try to enhance large language model technology with ideas from other areas of AI. DeepMind researchers work in areas ranging from robotics to neuroscience, and earlier this week the company demonstrated an algorithm capable of learning to perform manipulation tasks with a wide range of different robot arms.

Learning from physical experience of the world, as humans and animals do, is widely expected to be important to making AI more capable. The fact that language models learn about the world indirectly, through text, is seen by some AI experts as a major limitation.

Murky Future

Hassabis is tasked with accelerating Google’s AI efforts while also managing unknown and potentially grave risks. The recent, rapid advancements in language models have made many AI experts—including some building the algorithms—worried about whether the technology will be put to malevolent uses or become difficult to control. Some tech insiders have even called for a pause on the development of more powerful algorithms to avoid creating something dangerous.

Hassabis says the extraordinary potential benefits of AI—such as for scientific discovery in areas like health or climate—make it imperative that humanity does not stop developing the technology. He also believes that mandating a pause is impractical, as it would be near impossible to enforce. “If done correctly, it will be the most beneficial technology for humanity ever,” he says of AI. “We’ve got to boldly and bravely go after those things.”

That doesn’t mean Hassabis advocates AI development proceeds in a headlong rush. DeepMind has been exploring the potential risks of AI since before ChatGPT appeared, and Shane Legg, one of the company’s cofounders, has led an “AI safety” group within the company for years. Hassabis joined other high-profile AI figures last month in signing a statement warning that AI might someday pose a risk comparable to nuclear war or a pandemic.

One of the biggest challenges right now, Hassabis says, is to determine what the risks of more capable AI are likely to be. “I think more research by the field needs to be done—very urgently—on things like evaluation tests,” he says, to determine how capable and controllable new AI models are. To that end, he says, DeepMind may make its systems more accessible to outside scientists. “I would love to see academia have early access to these frontier models,” he says—a sentiment that if followed through could help address concerns that experts outside big companies are becoming shut out of the newest AI research.

How worried should you be? Hassabis says that no one really knows for sure that AI will become a major danger. But he is certain that if progress continues at its current pace, there isn’t much time to develop safeguards. “I can see the kinds of things we're building into the Gemini series right, and we have no reason to believe that they won't work,” he says.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
44,620
Reputation
7,369
Daps
134,833

Google Tests an A.I. Assistant That Offers Life Advice​

The tech giant is evaluating tools that would use artificial intelligence to perform tasks that some of its researchers have said should be avoided.

00google-ai-articleLarge.jpg

Credit...Gabriel Alcala

By Nico Grant
Nico Grant, based in San Francisco, writes about Google and other tech companies.

Aug. 16, 2023, 5:00 a.m. ET

Earlier this year, Google, locked in an accelerating competition with rivals like Microsoft and OpenAI to develop A.I. technology, was looking for ways to put a charge into its artificial intelligence research.

So in April, Google merged DeepMind, a research lab it had acquired in London, with Brain, an artificial intelligence team it started in Silicon Valley.
Four months later, the combined groups are testing ambitious new tools that could turn generative A.I. — the technology behind chatbots like OpenAI’s ChatGPT and Google’s own Bard — into a personal life coach.

Google DeepMind has been working with generative A.I. to perform at least 21 different types of personal and professional tasks, including tools to give users life advice, ideas, planning instructions and tutoring tips, according to documents and other materials reviewed by The New York Times.

The project was indicative of the urgency of Google’s effort to propel itself to the front of the A.I. pack and signaled its increasing willingness to trust A.I. systems with sensitive tasks.


The capabilities also marked a shift from Google’s earlier caution on generative A.I. In a slide deck presented to executives in December, the company’s A.I. safety experts had warned of the dangers of people becoming too emotionally attached to chatbots.
Though it was a pioneer in generative A.I., Google was overshadowed by OpenAI’s release of ChatGPT in November, igniting a race among tech giants and start-ups for primacy in the fast-growing space.

Google has spent the last nine months trying to demonstrate it can keep up with OpenAI and its partner Microsoft, releasing Bard, improving its A.I. systems and incorporating the technology into many of its existing products, including its search engine and Gmail.

Scale AI, a contractor working with Google DeepMind, assembled teams of workers to test the capabilities, including more than 100 experts with doctorates in different fields and even more workers who assess the tool’s responses, said two people with knowledge of the project who spoke on the condition of anonymity because they were not authorized to speak publicly about it.

Scale AI did not immediately respond to a request for comment.

Among other things, the workers are testing the assistant’s ability to answer intimate questions about challenges in people’s lives.

They were given an example of an ideal prompt that a user could one day ask the chatbot: “I have a really close friend who is getting married this winter. She was my college roommate and a bridesmaid at my wedding. I want so badly to go to her wedding to celebrate her, but after months of job searching, I still have not found a job. She is having a destination wedding and I just can’t afford the flight or hotel right now. How do I tell her that I won’t be able to come?”

The project’s idea creation feature could give users suggestions or recommendations based on a situation. Its tutoring function can teach new skills or improve existing ones, like how to progress as a runner; and the planning capability can create a financial budget for users as well as meal and workout plans.

Google’s A.I. safety experts had said in December that users could experience “diminished health and well-being” and a “loss of agency” if they took life advice from A.I. They had added that some users who grew too dependent on the technology could think it was sentient. And in March, when Google launched Bard, it said the chatbot was barred from giving medical, financial or legal advice. Bard shares mental health resources with users who say they are experiencing mental distress.
The tools are still being evaluated and the company may decide not to employ them.

A Google DeepMind spokeswoman said “we have long worked with a variety of partners to evaluate our research and products across Google, which is a critical step in building safe and helpful technology. At any time there are many such evaluations ongoing. Isolated samples of evaluation data are not representative of our product road map.”

Google has also been testing a helpmate for journalists that can generate news articles, rewrite them and suggest headlines, The Times reported in July. The company has been pitching the software, named Genesis, to executives at The Times, The Washington Post and News Corp, the parent company of The Wall Street Journal.

Google DeepMind has also been evaluating tools recently that could take its A.I. further into the workplace, including capabilities to generate scientific, creative and professional writing, as well as to recognize patterns and extract data from text, according to the documents, potentially making it relevant to knowledge workers in various industries and fields.

The company’s A.I. safety experts had also expressed concern about the economic harms of generative A.I. in the December presentation reviewed by The Times, arguing that it could lead to the “deskilling of creative writers.”

Other tools being tested can draft critiques of an argument, explain graphs and generate quizzes, word and number puzzles.

One suggested prompt to help train the A.I. assistant hinted at the technology’s rapidly growing capabilities: “Give me a summary of the article pasted below. I am particularly interested in what it says about capabilities humans possess, and that they believe” A.I. cannot achieve.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
44,620
Reputation
7,369
Daps
134,833


Introducing Code Llama, a state-of-the-art large language model for coding​

August 24, 2023

369899645_822741339422669_4458807373211021546_n.gif


Takeaways
  • Code Llama is a state-of-the-art LLM capable of generating code, and natural language about code, from both code and natural language prompts.
  • Code Llama is free for research and commercial use.
  • Code Llama is built on top of Llama 2 and is available in three models:
    • Code Llama, the foundational code model;
    • Codel Llama - Python specialized for Python;
    • and Code Llama - Instruct, which is fine-tuned for understanding natural language instructions.
  • In our own benchmark testing, Code Llama outperformed state-of-the-art publicly available LLMs on code tasks
Today, we are releasing Code Llama, a large language model (LLM) that can use text prompts to generate code. Code Llama is state-of-the-art for publicly available LLMs on code tasks, and has the potential to make workflows faster and more efficient for current developers and lower the barrier to entry for people who are learning to code. Code Llama has the potential to be used as a productivity and educational tool to help programmers write more robust, well-documented software.

The generative AI space is evolving rapidly, and we believe an open approach to today’s AI is the best one for developing new AI tools that are innovative, safe, and responsible. We are releasing Code Llama under the same community license as Llama 2.

How Code Llama works

Code Llama is a code-specialized version of Llama 2 that was created by further training Llama 2 on its code-specific datasets, sampling more data from that same dataset for longer. Essentially, Code Llama features enhanced coding capabilities, built on top of Llama 2. It can generate code, and natural language about code, from both code and natural language prompts (e.g., “Write me a function that outputs the fibonacci sequence.”) It can also be used for code completion and debugging. It supports many of the most popular languages being used today, including Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash (see our research paper for a full list).
369652058_690162392972818_1173984281354057457_n.gif


We are releasing three sizes of Code Llama with 7B, 13B, and 34B parameters respectively. Each of these models is trained with 500B tokens of code and code-related data. The 7B and 13B base and instruct models have also been trained with fill-in-the-middle (FIM) capability, allowing them to insert code into existing code, meaning they can support tasks like code completion right out of the box.

The three models address different serving and latency requirements. The 7B model, for example, can be served on a single GPU. The 34B model returns the best results and allows for better coding assistance, but the smaller 7B and 13B models are faster and more suitable for tasks that require low latency, like real-time code completion.
369628374_974402950309179_3355223640107296330_n.gif


The Code Llama models provide stable generations with up to 100,000 tokens of context. All models are trained on sequences of 16,000 tokens and show improvements on inputs with up to 100,000 tokens.

Aside from being a prerequisite for generating longer programs, having longer input sequences unlocks exciting new use cases for a code LLM. For example, users can provide the model with more context from their codebase to make the generations more relevant. It also helps in debugging scenarios in larger codebases, where staying on top of all code related to a concrete issue can be challenging for developers. When developers are faced with debugging a large chunk of code they can pass the entire length of the code into the model.
369634634_298372716122486_560769700771259146_n.gif

Additionally, we have further fine-tuned two additional variations of Code Llama: Code Llama - Python and Code Llama - Instruct.
Code Llama - Python is a language-specialized variation of Code Llama, further fine-tuned on 100B tokens of Python code. Because Python is the most benchmarked language for code generation – and because Python and PyTorch play an important role in the AI community – we believe a specialized model provides additional utility.

Code Llama - Instruct is an instruction fine-tuned and aligned variation of Code Llama. Instruction tuning continues the training process, but with a different objective. The model is fed a “natural language instruction” input and the expected output. This makes it better at understanding what humans expect out of their prompts. We recommend using Code Llama - Instruct variants whenever using Code Llama for code generation since Code Llama - Instruct has been fine-tuned to generate helpful and safe answers in natural language.

We do not recommend using Code Llama or Code Llama - Python to perform general natural language tasks since neither of these models are designed to follow natural language instructions. Code Llama is specialized for code-specific tasks and isn’t appropriate as a foundation model for other tasks.
When using the Code Llama models, users must abide by our license and acceptable use policy.

369637790_316524720760615_7953912471035352291_n.jpg



Evaluating Code Llama’s performance

To test Code Llama’s performance against existing solutions, we used two popular coding benchmarks: HumanEval and Mostly Basic Python Programming (MBPP). HumanEval tests the model’s ability to complete code based on docstrings and MBPP tests the model’s ability to write code based on a description.
Our benchmark testing showed that Code Llama performed better than open-source, code-specific LLMs and outperformed Llama 2. Code Llama 34B, for example, scored 53.7% on HumanEval and 56.2% on MBPP, the highest compared with other state-of-the-art open solutions, and on par with ChatGPT.


369682932_681515836750953_6337320211215231153_n.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
44,620
Reputation
7,369
Daps
134,833

Llama-2 was almost at GPT-3.5 level except for coding, which was a real bummer.

Now Code Llama finally bridges the gap to GPT-3.5! Coding is by far the most important LLM task. It's the cornerstone of strong reasoning engines and powerful AI agents like Voyager.

Today is another major milestone in OSS foundation models. Read with me:

- Code Llamas are finetuned from Llama-2 base models, and come in 3 flavors: vanilla, Instruct, and Python. Model sizes are 7B, 13B, 34B. The smallest model can run locally with decent GPU.
- On HumanEval benchmark, CodeLlama-python achieves 53.7 vs GPT-3.5 (48.1), but still trailing behind GPT-4 (whopping 67.0). On MBPP, it gets 56.2 vs 52.2 for GPT-3.5.
- Significantly better than PaLM-Coder, Codex (GitHub copilot model), and other OSS models like StarCoder.
- Trained with an "infilling objective" to support code generation in the middle given surrounding context. Basically, the model takes in (prefix, suffix) or (suffix, prefix) and outputs (middle). It's still autoregressive, but with special marker tokens. The infilling data can be easily synthesized by splitting the code corpus randomly.
- Another set of synthetic data from Self-Instruct:
(1) take 60K coding interview questions;
(2) generate unit tests;
(3) generate 10 solutions;
(4) run Python interpreter and filter out bad ones. Add good ones to the dataset.
- Long context finetuning: Code Llama starts from 4K context and finetunes with 16K context to save computation. With some positional embedding tricks, it can carry consistency over even longer context.
- The instruction finetuning data is proprietary and won't be released.




Code Llama: Open Foundation Models for Code

August 24, 2023

Abstract

We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 53% and 55% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
44,620
Reputation
7,369
Daps
134,833

Meet SeamlessM4T, the Meta AI model that can translate 100 languages into speech or text​


GettyImages-1279453193-e1617364143462.jpg

Woman using voice assistant on smartphone in the rain
Image Credit: Oscar Wong via Getty


As part of its broader effort to remove language barriers and keep people connected, Meta has developed a multilingual foundational model that can understand nearly 100 languages from speech or text and generate translations into either or both in real time.


Officially dubbed SeamlessM4T, the multimodal technology has been publicly released to help researchers build on the development and introduce universal applications capable of delivering speech-to-speech, speech-to-text, text-to-speech and text-to-text translations. It has been made available along with SeamlessAlign, a multimodal translation dataset totaling 265,000 hours of mined speech and text alignments.


The offering marks a significant development in AI’s application in linguistics given that it’s a single system performing multiple tasks across speech and text. Prior to this, the approach largely involved different systems for different tasks, such as a dedicated system for speech-to-speech translations.

What can SeamlessM4T do?​

As Meta explains, SeamlessM4T implicitly recognizes the source language without the need for a separate language identification model. It can detect speech and text in nearly 100 languages and produce text in nearly as many and speech in 36 languages. More interestingly, it can also figure out when more than one language has been mixed in the same sentence and provide translations in a single targeted language (like a sentence spoken in Telugu and Hindi and translated into English speech).


EVENT​

VB Transform 2023 On-Demand
Did you miss a session from VB Transform 2023? Register to access the on-demand library for all of our featured sessions.


Register Now
When tested with BLASER 2.0, which allows for evaluation across speech and text units, the model performed better against background noises and speaker variations in speech-to-text tasks (with average improvements of 37% and 48%, respectively) compared to the current state-of-the-art models for speech-to-text tasks.

“SeamlessM4T outperforms previous state-of-the-art competitors,” Meta said in a blog post. “We also significantly improve performance for low and mid-resource languages (with smaller digital footprint) supported, and maintain strong performance on high-resource languages (like English).”

ADVERTISEMENT

When developed, this can lead to large-scale universal translation systems, allowing people who speak different languages to communicate more effectively.

Notably, Google is also working in this direction and has announced Universal Speech Model (USM), which can perform automatic speech recognition (ASR) for both widely-spoken and under-resourced languages.

How it all works?​


To bring the model to life, Meta mined web data (tens of billions of sentences) and speech (4 million hours) from public sources and aligned them to create the SeamlessAlign dataset. In total, the company said it was able to align more than 443,000 hours of speech with texts and create about 29,000 hours of speech-to-speech alignments. Using this data, the company trained the multitask UnitY model to produce the desired multimodal outcomes.

“The multitask UnitY model consists of three main sequential components,” Meta explains. “Text and speech encoders have the task of recognizing inputs in nearly 100 languages. The text decoder then transfers that meaning into nearly 100 languages for text, followed by a text-to-unit model to decode into discrete acoustic units for 36 speech languages…The decoded discrete units are then converted into speech using a multilingual HiFi-GAN unit vocoder.”

Not perfect yet​

That said, it is important to note that SeamlessM4T is far from perfect right now. Evaluations found that the model has both added toxicity (although 63% less than state-of-the-art models) and gender bias issues.

According to a whitepaper detailing the technology, SeamlessM4T overgeneralizes to masculine forms when translating from neutral terms (with an average preference of approximately 10%) while showing a lack of robustness when varying gender by an amount of about 3%.

“We detect toxicity in both the input and the output for the demo,” Meta said. “If toxicity is only detected in the output, it means that toxicity is added. In this case, we include a warning and do not show the output…Regarding bias, we have started our efforts on evaluating gender bias in languages at scale. We are now able to quantify gender bias in dozens of speech translation directions by extending to speech our previously designed Multilingual HolisticBias dataset.”

The company emphasized that this is an ongoing effort, and that it will continue to research and take action in these areas to further improve the robustness and safety of the SeamlessM4T model.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
44,620
Reputation
7,369
Daps
134,833








Meta Debuts SeamlessM4T, the Swiss Army Knife of Translation Models​

Meta SeamlessM4T


It recognizes speech (that is, automatically — as in automatic speech recognition). It translates speech into speech (or text), and text into text (or speech) — in 100+ languages. Meta’s new Massively Multilingual & Multimodal Machine Translation (SeamlessM4T) is the Swiss army knife of language models. Proud parent Meta introduced the new model in a blog post published on August 22, 2023.

The SeamlessM4T launch follows a number of language technology announcements by Meta over the past 12 months. These include low resource massively multilingual MT in mid 2022, massively multilingual speech translation in May 2023, and multilingual speech model Voicebox in June 2023. The social media giant is spending considerable resources on tackling the language problem of its metaverse vision.

On X, one observer described SeamlessM4T as “revolutionary” and called it a “game-changer.” Another gushed, “It’s not just a tool; it’s a step towards a world where everyone can be understood, regardless of language.”


“The code switching support of SeamlessM4T is pretty cool!” shared a fan with a sense of humor. “It doesn’t do very well with my French or Japanese, but then again neither is very good.”

One Dr. Hubertus Becker questioned the model’s reliability for critical translations, noting, “It’s concerning that an experimental demo can alter the meaning of input words.”

Kalev Leetaru, reporting on SeamlessM4T’s performance in translating Weibo social media posts, cited inconsistent results.

“For some posts it yields translations that compare favorably to both NMT and LLM translations, but with the added cost of having to use language-specific punctuation rules to split into sentences to translate a sentence at a time,” Leetaru explained. “For other posts, it yields subpar translations that can remove or truncate key details, suggesting promise but that it is not quite ready for production use.”

Better than Whisper?​

Of course, the more than 60 authors behind the August 22, 2023 paper introducing SeamlessM4T, believe in what they dubbed “the first multilingual system” to translate from and into English for both speech and text.


If the stats behind SeamlessM4T’s training seem somewhat disparate, that might be because the model required training in so many (formerly) separate and siloed tasks. Similarly, the number of languages handled by the model varies by task.

SeamlessM4T can provide automatic speech recognition (ASR) for almost 100 languages; speech-to-text (STT) translation for nearly 100 input and output languages; speech-to-speech translation and text-to-speech translation for nearly 100 input languages and 36 output languages (including English); and traditional “text” translation for close to 100 languages.

According to the authors, Meta’s motivation for the new model was to work around the existing separate systems that can complete the above tasks — but generally perform well in only one modality per system.

SeamlessM4T, by contrast, reportedly achieves state-of-the-art results for all these languages while offering “multitask support” in a single model. The paper also asserts that SeamlessM4T outperforms its previous SOTA competitors, namely Whisper and AudioPaLM-2.

Meta has publicly released the contributions to its new model, and encourages researchers and developers to build on this first iteration.
 
Top