bnew

Veteran
Joined
Nov 1, 2015
Messages
40,010
Reputation
6,952
Daps
126,620

notes.aimodels.fyi

LLMs can be extended to infinite sequence lengths without fine-tuning


AI models can analyze thousands of words at a time. A Google researcher has found a way to increase that by millions.



StreamingLLM gives language models unlimited context


 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
40,010
Reputation
6,952
Daps
126,620

No-code retrieval augmented generation (RAG) with LlamaIndex and ChatGPT


By
Ben dikkson
November 22, 2023

llama rag
Image created with Bing Image Creator


Retrieval augmented generation (RAG) stands as a crucial tool in using large language models (LLM). RAG enables LLMs to incorporate external documents into their responses, thereby aligning more closely with user requirements. This feature is particularly beneficial in areas where LLMs traditionally falter, especially when factuality is important.

Since the advent of ChatGPT and similar LLMs, a plethora of RAG tools and libraries have emerged. Here is what you need to know about how RAG works and how you can get started using it with ChatGPT, Claude, or an LLM of your choice.



The benefits of RAG

When you interact with a large language model, it draws upon the knowledge embedded in its training data to formulate a response. However, the vastness of the training data often surpasses the model’s parameters, leading to responses that may not be entirely accurate. Moreover, the diverse information used in training can cause the LLM to conflate details, resulting in plausible yet incorrect answers, a phenomenon known as “hallucinations.”

In some instances, you might want the LLM to use information not encompassed in its training data, such as a recent news article, a scholarly paper, or proprietary company documents. This is where retrieval augmented generation comes into play.

RAG addresses these issues by equipping the LLM with pertinent information before it generates a response. This involves retrieving (hence the name) documents from an external source and inserting their contents into the conversation to provide context to the LLM.

This process enhances the model’s accuracy and enables it to formulate responses based on the provided content. Experiments show that RAG significantly curtails hallucinations. It also proves beneficial in applications requiring up-to-date or customer-specific information not included in the training dataset.

To put it simply, the difference between a standard LLM and a RAG-enabled LLM can be likened to two individuals answering questions. The former is like a person responding from memory, while the latter is someone provided with documents to read and answer questions based on their content.



How RAG works

RAG operates on a straightforward principle. It identifies one or more documents pertinent to your query, incorporates them into your prompt, and modifies the prompt to include instructions for the model to base its responses on these documents.

You can manually implement RAG by copy-pasting a document’s content into your prompt and instructing the model to formulate responses based on this document.

A RAG pipeline automates this process for efficiency. It begins by comparing the user’s prompts with a database of documents, retrieving those most relevant to the topic. The pipeline then integrates their content into the prompt and adds instructions to ensure the LLM adheres to the document’s content.



What do you need for a RAG pipeline?

llm chatbot embedding database
Using embeddings and a vector database to retrieve relevant documents

While retrieval augmented generation is an intuitive concept, its execution requires the seamless integration of several components.

Firstly, you need the primary language model that generates responses. Alongside this, an embedding model is necessary to encode both documents and user prompts into numerical lists, or “embeddings,” which represent their semantic content.

Next, a vector database is required to store these document embeddings and retrieve the most relevant ones each time a user query is received. In some cases, a ranking model is also beneficial to further refine the order of the documents provided by the vector database.

For certain applications, you might want to incorporate an additional mechanism that segments the user prompt into several parts. Each of these segments requires its own unique embedding and documents, enhancing the precision and relevance of the responses generated.



How to get started with RAG with no code

RAG llamaindex
No-code RAG with LlamaIndex and ChatGPT (source: LlamaIndex blog)


LlamaIndex recently released an open-source tool that allows you to develop a basic RAG application with almost no coding. While currently limited to single-file use, future enhancements may include support for multiple files and vector databases.

The project, named RAGs, is built on the Streamlit web application framework and LlamaIndex, a robust Python library particularly beneficial for RAG. If you’re comfortable with GitHub and Python, installation is straightforward: simply clone the repository, run the install command, and add your OpenAI API token to the configuration file as specified in the readme document.

RAGs is currently configured to work with OpenAI models. However, you can modify the code to use other models such as Anthropic Claude, Cohere models, or open-source models like Llama 2 hosted on your servers. LlamaIndex supports all these models.

The initial run of the application requires you to set up your RAG agent. This involves determining the settings, including the files, the size of chunks you want to break your file into, and the number of chunks to retrieve for each prompt.

Chunking plays a crucial role in RAG. When processing a large file, like a book or a multi-page research paper, it’s necessary to break it down into manageable chunks, such as 500 tokens. This allows the RAG agent to locate the specific part of the document relevant to your prompt.

After completing these steps, the application creates a configuration file for your RAG agent and uses it to run the code. RAGs serves as a valuable tool to begin with retrieval augmentation and build upon. You can find the full guide here.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
40,010
Reputation
6,952
Daps
126,620

Stable Signature: A new method for watermarking images created by open source generative AI​


October 6, 2023•
6 minute read

387117150_283299154583792_8671751040970644786_n.jpg


AI-powered image generation is booming and for good reason: It’s fun, entertaining, and easy to use. While these models enable new creative possibilities, they may raise concerns about potential misuse from bad actors who may intentionally generate images to deceive people. Even images created in good fun could still go viral and potentially mislead people. For example, earlier this year, images appearing to show Pope Francis wearing a flashy white puffy jacket went viral. The images weren’t actual photographs, but plenty of people were fooled, since there weren’t any clear indicators to distinguish that the content was created by generative AI.


At FAIR, we’re excited about driving continued exploratory research in generative AI, but we also want to make sure we do so in a manner that prioritizes safety and responsibility. Today, together with Inria, we are excited to share a research paper and code that details Stable Signature, an invisible watermarking technique we created to distinguish when an image is created by an open source generative AI model. Invisible watermarking incorporates information into digital content. The watermark is invisible to the naked eye but can be detected by algorithms—even if people edit the images. While there have been other lines of research around watermarking, many existing methods create the watermark after an image is generated.

More than 11 billion images have been created using models from three open source repositories, according to Everypixel Journal. In this case, invisible watermarks can be removed simply by deleting the line that generates the watermark.




386501092_619074386822387_6156723771081966318_n.jpg



While the fact that these safeguards exist is a start, this simple tactic shows there’s plenty of potential for this feature to be exploited. The work we’re sharing today is a solution for adding watermarks to images that come from open source generative AI models. We’re exploring how this research could potentially be used in our models. In keeping with our approach to open science, we want to share this research with the AI community in the hope of advancing the work being done in this space.

How the Stable Signature method works



Stable Signature closes the potential for removing the watermark by rooting it in the model with a watermark that can trace back to where the image was created.

Let’s take a look at how this process works with the below chart.




386655463_1051155306228897_7217726810106044762_n.jpg



Alice trains a master generative model. Before distributing it, she fine-tunes a small part of the model (called the decoder) to root a given watermark for Bob. This watermark may identify the model version, a company, a user, etc.

Bob receives his version of the model and generates images. The generated images will carry the watermark of Bob. They can be analyzed by Alice or third parties to see if the image was generated by Bob, who used the generative AI model.

We achieve this in a two-step process:



  • First, two convolutional neural networks are jointly trained. One encodes an image and a random message into a watermark image, while the other extracts the message from an augmented version of the watermark image. The objective is to make the encoded and extracted messages match. After training, only the watermark extractor is retained.
  • Second, the latent decoder of the generative model is fine-tuned to generate images containing a fixed signature. During this fine-tuning, batches of images are encoded, decoded, and optimized to minimize the difference between the extracted message and the target message, as well as to maintain perceptual image quality. This optimization process is fast and effective, requiring only a small batch size and a short time to achieve high-quality results.



Assessing the performance of Stable Signature



We know that people enjoy sharing and reposting images. What if Bob shared the image he created with 10 friends, who each then shared it with 10 more friends? During this time, it’s possible that someone could have altered the image, such as by cropping it, compressing it, or changing the colors. We built Stable Signature to be robust to these changes. No matter how a person transforms an image, the original watermark will likely remain in the digital data and can be traced back to the generative model where it was created.



386659636_700172301983869_8256737163893264734_n.jpg



During our research, we discovered two major advantages of Stable Signature over passive detection methods. First, we were able to control and reduce the generation of false positives, which occur when we mistake an image produced by humans for one generated by AI. This is crucial given the prevalence of non-AI-generated images shared online. For example, the most effective existing detection method can spot approximately 50% of edited generated images but still generates a false positive rate of approximately 1/100. Put differently, on a user-generated content platform receiving 1 billion images daily, around 10 million images would be incorrectly flagged to detect just half of the generated ones. On the other hand, Stable Signature detects images with the same accuracy at a false positive rate of 1e-10 (which can be set to a specific desired value). Moreover, our watermarking method allows us to trace images from various versions of the same model—a capability not possible with passive techniques.



How Stable Signature works with fine-tuning



A common practice in AI is to take foundational models and fine-tune them to handle specific use cases that are sometimes even tailored to one person. For example, a model could be shown images of Alice’s dog, and then Alice could ask for the model to generate images of her dog at the beach. This is done through methods like DreamBooth, Textual Inversion, and ControlNet. These methods act at the latent model level, and they do not change the decoder. This means that our watermarking method is not affected by these fine-tunings.

Overall, Stable Signature works well with vector-quantized image modeling (like VQGANs) and latent diffusion models (like Stable Diffusion). Since our method doesn’t modify the diffusion generation process, it’s compatible with the popular models mentioned above. We believe that, with some adaptation, Stable Signature could also be applied to other modeling methods.


Providing access to our technology



The use of generative AI is advancing at a rapid pace. Currently, there aren’t any common standards for identifying and labeling AI-generated content across the industry. In order to build better products, we believe advancements in responsibility research, like the work we’re sharing today, must exist in parallel.

We’re excited to share our work and give the AI research community access to these tools in the hope of driving continued collaboration and iteration. While it’s still early days for generative AI, we believe that by sharing our research, engaging with the community, and listening to feedback, we can all work together to ensure this impressive new technology is built, operated, and used in a responsible way.

The research we’re sharing today focuses on images, but in the future we hope to explore the potential of integrating our Stable Signature method across more generative AI modalities. Our model works with many popular open source models, however there are still limitations. It does not scale to non-latent generative models, so it may not be future proof to new generation technologies. By continuing to invest in this research, we believe we can chart a future where generative AI is used responsibly for exciting new creative endeavors.

This blog post reflects the work of Matthijs Douze and Pierre Fernandez. We'd like to acknowledge the contributions of Guillaume Couairon, Teddy Furon, and Hervé Jégou to this research.


Read the paper

Get the code
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
40,010
Reputation
6,952
Daps
126,620

LLaVA:​

Visual Instruction Tuning​

NeurIPS 2023 (Oral)​

Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee
University of Wisconsin-Madison Microsoft Research Columbia University
*Equal Contribution
arXiv arXiv (LLaVA-1.5) Code Demo Dataset Model

🔥



About

[NeurIPS'23 Oral] Visual Instruction Tuning: LLaVA (Large Language-and-Vision Assistant) built towards GPT-4V level capabilities.
llava.hliu.cc



🌋 LLaVA: Large Language and Vision Assistant

Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.
[Project Page] [Demo] [Data] [Model Zoo]
🤝Community Contributions: [llama.cpp] [Colab] [🤗Space] [Replicate] [AutoGen] [BakLLaVA (LLaVA with Mistral-7B)]
Improved Baselines with Visual Instruction Tuning [Paper]
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee
Visual Instruction Tuning (NeurIPS 2023, Oral) [Paper]
Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution)

Release

  • [11/10] LLaVA-Plus is released: Learning to Use Tools for Creating Multimodal Agents, with LLaVA-Plus (LLaVA that Plug and Learn to Use Skills). [Project Page] [Demo] [Code] [Paper]
  • [11/6] Support Intel dGPU and CPU platforms. More details here.
  • [11/2] LLaVA-Interactive is released: Experience the future of human-AI multimodal interaction with an all-in-one demo for Image Chat, Segmentation, Generation and Editing. [Project Page] [Demo] [Code] [Paper]
  • [10/26] 🔥 LLaVA-1.5 with LoRA achieves comparable performance as full-model finetuning, with a reduced GPU RAM requirement (ckpts, script). We also provide a doc on how to finetune LLaVA-1.5 on your own dataset with LoRA.
  • [10/12] Check out the Korean LLaVA (Ko-LLaVA), created by ETRI, who has generously supported our research! [🤗 Demo]
  • [10/12] LLaVA is now supported in llama.cpp with 4-bit / 5-bit quantization support!
  • [10/11] The training data and scripts of LLaVA-1.5 are released here, and evaluation scripts are released here!
  • [10/10] Roboflow Deep Dive: First Impressions with LLaVA-1.5.
  • [10/5] 🔥 LLaVA-1.5 is out! Achieving SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the technical report, and explore the demo! Models are available in Model Zoo.
  • [9/26] LLaVA is improved with reinforcement learning from human feedback (RLHF) to improve fact grounding and reduce hallucination. Check out the new SFT and RLHF checkpoints at project [LLavA-RLHF]
  • [9/22] LLaVA is accepted by NeurIPS 2023 as oral presentation, and LLaVA-Med is accepted by NeurIPS 2023 Datasets and Benchmarks Track as spotlight presentation.
  • [9/20] We summarize our empirical study of training 33B and 65B LLaVA models in a note. Further, if you are interested in the comprehensive review, evolution and trend of multimodal foundation models, please check out our recent survey paper ``Multimodal Foundation Models: From Specialists to General-Purpose Assistants''.
  • [7/19] 🔥 We release a major upgrade, including support for LLaMA-2, LoRA training, 4-/8-bit inference, higher resolution (336x336), and a lot more. We release LLaVA Bench for benchmarking open-ended visual chat with results from Bard and Bing-Chat. We also support and verify training with RTX 3090 and RTX A6000. Check out LLaVA-from-LLaMA-2, and our model zoo!
  • [6/26] CVPR 2023 Tutorial on Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4! Please check out [Slides] [Notes] [YouTube] [Bilibli].
  • [6/11] We released the preview for the most requested feature: DeepSpeed and LoRA support! Please see documentations here.
  • [6/1] We released LLaVA-Med: Large Language and Vision Assistant for Biomedicine, a step towards building biomedical domain large language and vision models with GPT-4 level capabilities. Checkout the paper and page.
  • [5/6] We are releasing LLaVA-Lighting-MPT-7B-preview, based on MPT-7B-Chat! See here for more details.
  • [5/2] 🔥 We are releasing LLaVA-Lighting! Train a lite, multimodal GPT-4 with just $40 in 3 hours! See here for more details.
  • [4/27] Thanks to the community effort, LLaVA-13B with 4-bit quantization allows you to run on a GPU with as few as 12GB VRAM! Try it out here.
  • [4/17] 🔥 We released LLaVA: Large Language and Vision Assistant. We propose visual instruction tuning, towards building large language and vision models with GPT-4 level capabilities. Checkout the paper and demo.
Code License Data License Usage and License Notices: The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
40,010
Reputation
6,952
Daps
126,620

Is This the End of ‘Intel Inside’?​

Newcomers pose numerous challenges to decades of ‘Wintel’ chip dominance.​

[/CENTER]

im-893474

JASON SCHNEIDER[/SIZE]

By
Christopher Mims
Follow
Dec. 1, 2023 9:00 pm ET



It might not look like it yet, but Intel is in a fight for its life.

The stakes for its employees and investors are high, and are likely to turn on some fierce battles for market share that will play out in 2024 and beyond.

For the everyday consumer, what’s at stake is mostly nostalgia. One day, the little “Intel Inside” sticker that’s been on PCs since 1991 could cease to exist.

Instead of an Intel chip, these computers could have processors from an array of manufacturers, principally Qualcomm, but also possibly Nvidia, AMD, and lesser-known companies like Santa Clara, Calif.-based Amlogic and Taiwan-based MediaTek.

What’s happening now is a tipping point decades in the making. Ever since a little chip-design company called ARM built the mobile processor for Apple’s first Newton personal digital assistant, which came out in 1993, it’s been gaining steam, primarily in the mobile-phone business. By the time Intel sought to enter the mobile-processor business in 2011, it was too late.

Apple was the first company to bet that ARM-based processors—thought by many to be useful only in phones—could be the brains of even the most powerful desktop computers. This gave Apple a huge head start over Intel, and the rest of the industry, in designing chips that prioritized power-sipping performance in a world where that’s become the primary limiting factor in the performance of all devices, not just phones.

Now, Google, Qualcomm, Amazon, Apple and others can use ARM’s blueprints to custom-design the chips that power everything from phones and notebooks to cloud servers. These chips are then typically produced by Samsung
or Taiwan-based TSMC, which focus on making chips for other companies.

The threats to Intel are so numerous that it’s worth summing them up: The Mac and Google’s Chromebooks are already eating the market share of Windows-based, Intel-powered devices. As for Windows-based devices, all signs point to their increasingly being based on non-Intel processors. Finally, Windows is likely to run on the cloud in the future, where it will also run on non-Intel chips.

Apple has moved almost entirely away from Intel’s chips, which it used for over a decade for all of its desktop and notebook computers. At the same time, its overall market share for desktops and notebooks has climbed from around 12% of devices in the U.S. in 2013 to nearly one in three today, according to Statcounter.

These days, it’s not just Apple moving away from Intel’s chips. Microsoft
is accelerating its yearslong effort to make Windows run on ARM-based processors, so that the entire PC ecosystem isn’t doomed by Intel’s failure to keep up with Apple and TSMC. Google’s Chrome OS, which works with either Intel or ARM-based chips, is also an emerging threat to Microsoft.

This means the threat to Intel comes from a whole ecosystem of companies with deep pockets and sizable profit margins, each trying to take their piece of the company’s market share. In many ways, it really is Intel versus the world—and “the world” includes nearly every tech giant you can name.

It wasn’t always this way. For decades, Intel enjoyed PC market dominance with its ride-or-die partner, Microsoft, through their “Wintel” duopoly.

It’s ironic, then, that Microsoft is one of the companies leading the charge away from Intel’s chips.

This estrangement is taking several forms, which shows how seriously Microsoft is taking this shift away from Intel. Microsoft declined to comment for this column.

Microsoft is working to make Windows and the rest of its software accessible in the cloud, which can save money for customers because it lets them use computers that are much cheaper and simpler than conventional PCs. It also means that ARM-based devices can be put on workers’ desks in place of more powerful, Intel-powered ones. And the version of Windows that workers are accessing remotely, in the cloud, can run on ARM-based chips in the data center too.

In mid-November, Microsoft unveiled its first ARM-based custom chips. One of them, called Cobalt, is intended to live in data centers and could power such cloud-based Windows experiences. Qualcomm also has forthcoming ARM-based chips for notebook computers.

These efforts are getting a boost from Amazon, which recently unveiled a small cube-shaped PC-like device that can stream Windows and applications from the cloud—like Netflix, but for software instead of entertainment. It’s a repurposed Fire TV Cube streaming device, costs $200, and is powered by an ARM-based chip from Amlogic.

Qualcomm also has forthcoming ARM-based chips for notebook computers, but these are intended not merely to connect these devices to the cloud. Rather, they’ll directly replace Intel’s processors, handling heavy workloads within the device itself. At the same time, they’re intended to go head-to-head with Apple’s best chips. Key to their adoption: Microsoft is putting a huge amount of effort into making Windows run on these processors, while encouraging developers of apps to do the same.

I asked Dan Rogers, vice president of silicon performance at Intel, if all of this is keeping him up at night. He declined to comment on Intel’s past, but he did say that since Pat Gelsinger, who had spent the first 30 years of his career at Intel, returned to the company as CEO in 2021, “I believe we are unleashed and focused, and our drive in the PC has in a way never been more intense.”



SHARE YOUR THOUGHTS​


What is your outlook for Intel? Join the conversation below.

Intel plans a new generation of chips in what Rogers calls the “thin and light” category of notebooks, where Apple has been beating the pants off Intel-powered Windows devices.

In terms of advanced chip-manufacturing technology, Intel has promised to catch up with its primary competitor, Taiwan-based TSMC, by 2025.

The consumer-electronics business is full of reversals, and Intel is still a strong competitor, so none of this is predestined.

Geopolitical factors, for one, have the potential to change the entire chip industry virtually overnight. Intel could suddenly become the only game in town for the most advanced kind of chip manufacturing, if American tech companies lose access to TSMC’s factories on account of China’s aggression toward Taiwan, says Patrick Moorhead, a former executive at Intel competitor AMD, and now head of tech analyst firm Moor Insights & Strategy.

When it comes to Intel, he adds, “Never count these guys out.”

For more WSJ Technology analysis, reviews, advice and headlines, sign up for our weekly newsletter.

Write to Christopher Mims at christopher.mims@wsj.com
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
40,010
Reputation
6,952
Daps
126,620

Research

Introducing Ego-Exo4D: A foundational dataset for research on video learning and multimodal perception

November 30, 2023•
8 minute read

405384771_249757221189904_9161450057120276175_n.png


Today we are announcing Ego-Exo4D, a foundational dataset and benchmark suite to support research on video learning and multimodal perception. The result of a two-year effort by Meta’s FAIR (Fundamental Artificial Intelligence Research), Meta’s Project Aria, and 15 university partners, the centerpiece of Ego-Exo4D is its simultaneous capture of both first-person “egocentric” views, from a participant’s wearable camera, as well as multiple “exocentric” views, from cameras surrounding the participant. The two perspectives are complementary. While the egocentric perspective reveals what the participant sees and hears, the exocentric views reveal the surrounding scene and the context. Together, these two perspectives give AI models a new window into complex human skill.


405223340_1511470296351868_8087525909299673245_n.gif


Working together as a consortium, FAIR or university partners captured these perspectives with the help of more than 800 skilled participants in the United States, Japan, Colombia, Singapore, India, and Canada. In December, the consortium will open source the data (including more than 1,400 hours of video) and annotations for novel benchmark tasks. Additional details about the datasets can be found in our technical paper. Next year, we plan to host a first public benchmark challenge and release baseline models for ego-exo understanding. Each university partner followed their own formal review processes to establish the standards for collection, management, informed consent, and a license agreement prescribing proper use. Each member also followed theProject Aria Community Research Guidelines. With this release, we aim to provide the tools the broader research community needs to explore ego-exo video, multimodal activity recognition, and beyond.

406270828_1366711530899272_2933140689624174759_n.png

How Ego-Exo4D works

Ego-Exo4D focuses on skilled human activities, such as playing sports, music, cooking, dancing, and bike repair. Advances in AI understanding of human skill in video could facilitate many applications. For example, in future augmented reality (AR) systems, a person wearing smart glasses could quickly pick up new skills with a virtual AI coach that guides them through a how-to video; in robot learning, a robot watching people in its environment could acquire new dexterous manipulation skills with less physical experience; in social networks, new communities could form based on how people share their expertise and complementary skills in video.

Such applications demand the ability to move fluidly between the exo and ego views. For example, imagine watching an expert repair a bike tire, juggle a soccer ball, or fold an origami swan—then being able to map their steps to your own actions. Cognitive science tells us that even from a very young age we can observe others’ behavior (exo) and translate it onto our own (ego).

Realizing this potential, however, is not possible using today's datasets and learning paradigms. Existing datasets comprised of both ego and exo views (i.e., ego-exo) are few, small in scale, lack synchronization across cameras, and/or are too staged or curated to be resilient to the diversity of the real world. As a result, the current literature for activity understanding primarily covers only the ego or exo view, leaving the ability to move fluidly between the first- and third-person perspectives out of reach.

Ego-Exo4D constitutes the largest public dataset of time-synchronized first- and third- person video. Building this dataset required the recruitment of specialists across varying domains, bringing diverse groups of people together to create a multifaceted AI dataset. All scenarios feature real-world experts, where the camera-wearer participant has specific credentials, training, or expertise in the skill being demonstrated. For example, among the Ego-Exo4D camera wearers are professional and college athletes; jazz, salsa, and Chinese folk dancers and instructors; competitive boulderers; professional chefs who work in industrial-scale kitchens; and bike technicians who service dozens of bikes per day.

Ego-Exo4D is not only multiview, it is also multimodal. Captured with Meta’s unique Aria glasses, all ego videos are accompanied by time-aligned seven channel audio, inertial measurement units (IMU), and two wide-angle grayscale cameras, among other sensors. All data sequences also provide eye gaze, head poses, and 3D point clouds of the environment through Project Aria’s state-of-the-art machine perception services. Additionally, Ego-Exo4D provides multiple new video-language resources:


  • First-person narrations by the camera wearers describing their own actions.
  • Third-person play-by-play descriptions of every camera wearer action
  • Third-person spoken expert commentary critiquing the videos. We hired 52 people with expertise in particular domains, many of them coaches and teachers, to provide tips and critiques based on the camera wearer’s performance. At each time step, the experts explain how the participants’ actions, such as their hand and body poses, affect their performance, and provide spatial markings to support their commentary.

All three language corpora are time-stamped against the video. With these novel video-language resources, AI models could learn about the subtle aspects of skilled human activities. To our knowledge, there is no prior video resource with such extensive and high quality multimodal data.

Alongside the data, we introduce benchmarks for foundational tasks for ego-exo video to spur the community's efforts. We propose four families of tasks:


  1. Ego(-exo) recognition: recognizing fine-grained keysteps of procedural activities and their structure from ego (and/or optionally exo) video, even in energy-constrained scenarios;
  2. Ego(-exo) proficiency estimation: inferring how well a person is executing a skill;
  3. Ego-exo relation: relating the actions of a teacher (exo) to a learner (ego) by estimating semantic correspondences and translating viewpoints; and
  4. Ego pose: recovering the skilled movements of experts from only monocular ego-video, namely 3D body and hand pose.

We provide high quality annotations for training and testing each task—the result of more than 200,000 hours of annotator effort. To kickstart work in these new challenges, we also develop baseline models and report their results. We plan to host a first public benchmark challenge in 2024.



406886129_866085048493148_3000829008893060003_n.jpg

406883464_1383969772522837_609021011767765469_n.jpg

405203662_841012317803452_9637809064688481_n.jpg

405314124_3506833022925259_8114322311954892666_n.jpg

405286924_887884849151659_1066589885864524765_n.jpg

406886129_866085048493148_3000829008893060003_n.jpg

406883464_1383969772522837_609021011767765469_n.jpg



Collaboratively building on this research

The Ego4D consortium is a long-running collaboration between FAIR and more than a dozen universities around the world. Following the 2021 release of Ego4D, this team of expert faculty, graduate students, and industry researchers reconvened to launch the Ego-Exo4D effort. The consortium’s strengths are both its collective AI talent as well as its breadth in geography, which facilitates recording data in a wide variety of visual contexts. Overall, Ego-Exo4D includes video from six countries and seven U.S. states, offering a diverse resource for AI development. The consortium members and FAIR researchers collaborated throughout the project, from developing the initiative’s scope, to each collecting unique components of the dataset, to formulating the benchmark tasks. This project also marks the single largest coordinated deployment of the Aria glasses in the academic research community, with partners at 12 different sites using them.

In releasing this resource of unprecedented scale and variety, the consortium aims to supercharge the research community on core AI challenges in video learning. As this line of research advances, we envision a future where AI enables new ways for people to learn new skills in augmented reality and mixed reality (AR/MR), where how-to videos come to life in front of the user, and the system acts as a virtual coach to guide them through a new procedure and offer advice on how to improve. Similarly, we hope it will enable robots of the future that gain insight about complex dexterous manipulations by watching skilled human experts in action. Ego-Exo4D is a critical stepping stone to enable this future, and we can’t wait to see what the research community creates with it.




Visit the Ego-Exo4D website

Read the paper

Learn more about Project Aria Research Kit





 

bnew

Veteran
Joined
Nov 1, 2015
Messages
40,010
Reputation
6,952
Daps
126,620


A Language Agent for Autonomous Driving​






Computer Science > Computer Vision and Pattern Recognition​

[Submitted on 17 Nov 2023 (v1), last revised 27 Nov 2023 (this version, v3)]

A Language Agent for Autonomous Driving​

Jiageng Mao, Junjie Ye, Yuxi Qian, Marco Pavone, Yue Wang
Human-level driving is an ultimate goal of autonomous driving. Conventional approaches formulate autonomous driving as a perception-prediction-planning framework, yet their systems do not capitalize on the inherent reasoning ability and experiential knowledge of humans. In this paper, we propose a fundamental paradigm shift from current pipelines, exploiting Large Language Models (LLMs) as a cognitive agent to integrate human-like intelligence into autonomous driving systems. Our approach, termed Agent-Driver, transforms the traditional autonomous driving pipeline by introducing a versatile tool library accessible via function calls, a cognitive memory of common sense and experiential knowledge for decision-making, and a reasoning engine capable of chain-of-thought reasoning, task planning, motion planning, and self-reflection. Powered by LLMs, our Agent-Driver is endowed with intuitive common sense and robust reasoning capabilities, thus enabling a more nuanced, human-like approach to autonomous driving. We evaluate our approach on the large-scale nuScenes benchmark, and extensive experiments substantiate that our Agent-Driver significantly outperforms the state-of-the-art driving methods by a large margin. Our approach also demonstrates superior interpretability and few-shot learning ability to these methods. Code will be released.
Comments:Project Page: this https URL
Subjects:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
Cite as:arXiv:2311.10813 [cs.CV]
(or arXiv:2311.10813v3 [cs.CV] for this version)
[2311.10813] A Language Agent for Autonomous Driving
Focus to learn more

Submission history​

From: Jiageng Mao [view email]
[v1] Fri, 17 Nov 2023 18:59:56 UTC (6,479 KB)
[v2] Tue, 21 Nov 2023 01:24:36 UTC (6,479 KB)
[v3] Mon, 27 Nov 2023 20:53:35 UTC (15,211 KB)

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
40,010
Reputation
6,952
Daps
126,620


AutoGen​

Build LLM applications via multiple agents

AutoGen_header_1920x720.jpg


AutoGen provides a multi-agent conversation framework as a high-level abstraction. It is an open-source library for enabling next-generation LLM applications with multi-agent collaborations, teachability and personalization. With this framework, users can build LLM workflows. The agent modularity and conversation-based programming simplifies development and enables reuse for developers. End-users benefit from multiple agents independently learning and collaborating on their behalf, enabling them to accomplish more with less work. Benefits of the multi agent approach with AutoGen include agents that can be backed by various LLM configurations; native support for a generic form of tool usage through code generation and execution; and, a special agent, the Human Proxy Agent that enables easy integration of human feedback and involvement at different levels.



Easily build LLM workflows​

With AutoGen, building a complex multi-agent conversation system boils down to:


  • Defining a set of agents with specialized capabilities and roles.
  • Defining the interaction behavior between agents, i.e., what to reply when an agent receives messages from another agent.

AutoGen

Read the paper



Related projects​

AutoGen is an open-source, community-driven project under active development (as a spinoff from FLAML, a fast library for automated machine learning and tuning), which encourages contributions from individuals of all backgrounds. Many Microsoft Research collaborators have made great contributions to this project, including academic contributors like Pennsylvania State University and the University of Washington, and product teams like Microsoft Fabric and ML.NET. AutoGen aims to provide an effective and easy-to-use framework for developers to build next-generation applications, and already demonstrates promising opportunities to build creative applications and provide a large space for innovation.

More about FLAML
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
40,010
Reputation
6,952
Daps
126,620









https://medium.com/slope-stories/slope-transformer-the-first-llm-trained-to-understand-the-language-of-banks-88adbb6c8da9


Slope TransFormer: The first LLM trained to understand the language of banks


Alex Wu
Follow


1*FbKukA5DdQAbrwWjvlE9Eg.gif

Today, we’re excited to share that we’ve developed the first Large Language Model (LLM) trained specifically to understand the language of banks: Slope TransFormer. It categorizes messy bank transaction data with speed and accuracy that surpass Plaid, ChatGPT, and humans. As the successor to SlopeGPT, it is the first LLM we’ve trained in-house.


We will share the motivation for it, the methodology used, and its results — including how it stacks up to existing solutions. We will end with some immediate applications, and how it fits into our vision of redefining how underwriting is done.


Why do we care about transactions?

First, some context. At Slope, we spend a lot of time with bank transactions. Why? Simply put, we are a payments company, and central to every payments company is risk. To that end, there is no better way to understand a business — what it’s been up to, its financial outlook, its fraud risk — than looking at every $ that flows in and out of it. The transaction, as we see it, is the atomic unit of business. It is the lifeblood.

Additionally, bank transactions have 2 critical properties:


  • Real-time. Thanks to Open Banking (e.g. Plaid), once a business connects their bank accounts, we will see every new $ that flows in and out of the business in real-time.
  • Unfalsifiable. Thanks to banks, a transaction is proof of an exchange of money. One cannot fake a transaction that’s pulled directly from their bank’s records (contrast this to an income statement).

At Slope, we strive to understand our customers deeply. Doing so not only enables us to assess risk, but fundamentally to build better products for our customers: from AR automation, to payments, to financing that’s personalized to a business’s unique needs. Transaction fluency, therefore, is a fundamental problem for Slope.


However, transactions are hard to understand.

The issue is that transactions are not written in English, or even a single language, for that matter. It is a language of many dialects: a single transaction type can be expressed 10 different ways across 10 different banks:

1*lqMOF7-bdjCDgiopH5zzDQ.png

These are all payments from Shopify.

Additionally, a transaction can be complex. It may have components that represent different counterparties, channels, and intermediaries which obscure the true flow of money. This opaqueness is only furthered by the rise of payment processors and middlemen (e.g. PayPal, Zelle, and even Slope). Can you see where the money is going here?

[B]BILL.COM DES:ACCTVERIFY ID:025AYXVFMTBCRRX INDN:DAVID VAN ARCH CO ID:XXXXX634527 CCD[/B]

If you consider the combinations of (bank dialects X merchants X intermediaries) — and also that a “merchant” can be any individual or entity in the world, and that new intermediaries are spawning every day — it becomes clear that transactions cannot be solved with traditional, rules-based methods. It is a high-dimensional, long-tail problem that even specialist companies often struggle to get right.

1*D3E6UgIcGnMKxzeETI65pQ.png


What about existing solutions?

Plaid

As our Open Banking provider, Plaid serves us transaction data pulled directly from our customers’ bank accounts. On top of this, Plaid tags the counterparty of each transaction (e.g. Shopify). But only sometimes. We found that Plaid gives us less than 50% coverage across our customers’ transactions:

1*Rc-osijaGJ5LYkcLK20Cqg.png

And even when tags are provided, they can be noisy. Some examples:


  1. Noisy labels for even well-known merchants:

1*Op0td2cfuJwtuNHs7x-V2w.png

2. Confusing the person, Aldo, for the company, Aldo:


1*ACtHw1CouVmceX2LgXmvZw.png

3. A single description resulting in a wide range of labels:


1*unmDVeXlOpfMV6mEGjJCqA.png

While some of these mistakes may seem elementary, producing accurate tags on a consistent basis is a deceptively difficult task – with many hidden tradeoffs. For the most part, Plaid does a very good job. But in our application — B2B risk assessment — we have especially strict requirements when it comes to accuracy, coverage, and explainability. We cannot afford a mistake with so much on the line.



ChatGPT

What about LLMs? There are 2 promising properties of LLMs in the context of transaction tagging: 1) their ability to extract meaning from unstructured data and 2) their pre-trained knowledge of the world. Here are some of our experiments with ChatGPT:

1*NkmLb0_gPTtBmknRfmvBxA.png

Wrong answer & super wordy.

1*_b_ikfrmGd4UJsxWd0Q5yA.png

Better with some prompt engineering, but still wordy.

Assuming we solve for accuracy and wordiness, there are still fundamental issues with a chat-based approach: unpredictability (the same prompt asked 10x may give you 10 different responses) and scalability (slow and expensive to hit an API 1000’s of times for a single customer). Yet, we saw promise. We began to believe that in some form, LLMs held the key to our problem.


SlopeGPT

Earlier this year, we launched SlopeGPT. Using GPT embeddings, we clustered transactions by semantic similarity. This allowed us to reliably group transactions into distinct cashflows without explicitly labeling them. Additionally, as the clustering happened at the customer level, the cashflows were fit uniquely to each business.

The impact was massive: from raw transactions emerged a rich story. We could now see individual streams of incomes and expenses, how they changed over time, and where they were headed. It was a major leap forward in our ability to understand our customers. Still, it had limitations:

  1. The resulting clusters were unlabeled: it could tell you which transactions likely belonged to the same cashflow streams, but not what those streams were.
  2. It was not optimized for financial data. We used out-of-the-box GPT embeddings, meaning we used English semantic similarity as a proxy for transaction semantic similarity. It worked surprisingly well, but we believed we could do better.
  3. It was slow: ~500 ms/txn. This may seem fast, but a single customer may have thousands of transactions. Our SLA for underwriting is 7s.

We’re excited to say that TransFormer overcomes all these limitations.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
40,010
Reputation
6,952
Daps
126,620

Meet Slope TransFormer: A Large Language Model (LLM) Trained Specifically to Understand the Language of Banks

By

Niharika Singh

November 28, 2023​

In payments, understanding transactions is crucial for assessing risks in businesses. However, deciphering messy bank transaction data poses a challenge, as it is expressed in various ways across different banks. Existing solutions like Plaid and ChatGPT have limitations, such as low coverage and wordiness. To address this, a new solution called Slope TransFormer has been developed—a Large Language Model (LLM) specifically trained to understand the language of banks.

Transactions are challenging to understand because they come in different forms, making traditional, rules-based methods ineffective. Plaid, a standard Open Banking provider, offers less than 50% coverage transaction data, and its labels can be noisy and confusing. LLMs like ChatGPT promise to extract meaning from unstructured data but need help with unpredictability and scalability.

Slope TransFormer, the new solution, overcomes these challenges by being a proprietary LLM fine-tuned to extract meaning from bank transactions. It addresses the limitations of its predecessor, SlopeGPT, by providing accurate and concise counterparty labels in an interpretable way. The key to its success lies in defining a new language during training, focusing solely on extracting the merchant name from transactions.

Using an efficient base model, OPT-125M, and a fine-tuning algorithm called LoRA, TransFormer achieves remarkable speed—labeling over 500 transactions per second, a 250x speedup over SlopeGPT. It boasts over 72% exact match accuracy against human experts, outperforming Plaid, which achieves only 62%. The solution is accurate and highly consistent, making it reliable in a production system.


[Featured AI Model] Check out LLMWare and It's RAG- specialized 7B Parameter LLMs


TransFormer’s performance has already led to its deployment in live credit monitoring dashboards. Its efficiency and functionality provide a detailed view into businesses, allowing for monitoring changing risks, alerting to abnormal events, and applying automated adjustments. The ultimate goal is to use TransFormer to power the entire underwriting system, reaching a precise understanding of businesses beyond traditional financials.

In conclusion, Slope TransFormer marks a significant milestone in redefining how underwriting is done in the B2B economy. Its efficiency, accuracy, and interpretability pave the way for a more precise understanding of businesses, unlocking new real-time signals to monitor and manage risks. This advancement aligns with the broader vision of SlopeAI to digitize the world’s B2B economy, using AI to automate workflows and eliminate inefficiencies that have hindered progress for decades.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
40,010
Reputation
6,952
Daps
126,620

Law secretly drafted by ChatGPT makes it onto the books​

'Unfortunately or fortunately, this is going to be a trend'​


Katyanna Quach
Sat 2 Dec 2023 // 17:24 UTC


The council of Porto Alegre, a city in southern Brazil, has approved legislation drafted by ChatGPT.

The ordinance is supposed to prevent the city from charging taxpayers to replace any water meters stolen by thieves. A vote from 36 members of the council unanimously passed the proposal, which came into effect in late November.

But what most of them didn't know was that the text for the proposal had been generated by an AI chatbot, until councilman Ramiro Rosário admitted he had used ChatGPT to write it.

"If I had revealed it before, the proposal certainly wouldn't even have been taken to a vote," he told the Associated Press.

This is the first-ever legislation written by AI to be passed by lawmakers that us vultures know about; if you know of any other robo-written laws, contracts, or interesting stuff like that, do let us know. To be clear, ChatGPT was not asked to come up with the idea but was used as a tool to write up the fine print. Rosário said he used a 49-word prompt to instruct OpenAI's erratic chatbot to generate the complete draft of the proposal.

At first, the city's council president Hamilton Sossmeier disapproved of his colleague's methods and thought Rosário had set a "dangerous precedent." He later changed his mind, however, and said: "I started to read more in depth and saw that, unfortunately or fortunately, this is going to be a trend."

Sossmeier may be right. In the US, Massachusetts state Senator Barry Finegold and Representative Josh Cutler made headlines earlier this year for their bill titled: "An Act drafted with the help of ChatGPT to regulate generative artificial intelligence models like ChatGPT."

The pair believe machine-learning engineers should include digital watermarks in any text generated by large language models to detect plagiarism (and presumably allow folks to know when stuff is computer-made); obtain explicit consent from people before collecting or using their data for training neural networks; and conduct regular risk assessments of their technology.

Using large language models like ChatGPT to write legal documents is controversial and risky right now, especially since the systems tend to fabricate information and hallucinate. In June, attorneys Steven Schwartz and Peter LoDuca representing Levidow, Levidow & Oberman, a law firm based in New York, came under fire for citing fake legal cases made up by ChatGPT in a lawsuit.

They were suing a Colombian airline Avianca on behalf of a passenger who was injured aboard a 2019 flight, and prompted ChatGPT to recall similar cases to cite, which it did, but it also just straight up imagined some. At the time Schwartz and LoDuca blamed their mistake on not understanding the chatbot's limitations, and claimed they didn't know it could hallucinate information.

Judge Kevin Castel from the Southern District Court of New York realized the cases were bogus when lawyers from the opposing side failed to find the cited court documents, and asked Schwartz and LoDuca to cite their sources. Castel fined them both $5,000 and dismissed the lawsuit altogether.

"The lesson here is that you can't delegate to a machine the things for which a lawyer is responsible," Stephen Wu, shareholder in Silicon Valley Law Group and chair of the American Bar Association's Artificial Intelligence and Robotics National Institute, previously told The Register.

Rosário, however, believes the technology can be used effectively. "I am convinced that ... humanity will experience a new technological revolution. All the tools we have developed as a civilization can be used for evil and good. That's why we have to show how it can be used for good," he said. ®

PS: Amazon announced its Q chat bot at re:Invent this week, a digital assistant for editing code, using AWS resources, and more. It's available in preview, and as it's an LLM system, we imagined it would make stuff up and get things wrong. And we were right: internal documents leaked to Platformer describe the neural network "experiencing severe hallucinations and leaking confidential data."
 
Top