bnew

Veteran
Joined
Nov 1, 2015
Messages
40,370
Reputation
6,972
Daps
127,170

Orca 2​

Orca 2 is a helpful assistant that is built for research purposes only and provides a single turn response in tasks such as reasoning over user given data, reading comprehension, math problem solving and text summarization. The model is designed to excel particularly in reasoning.

We open-source Orca 2 to encourage further research on the development, evaluation, and alignment of smaller LMs.

What is Orca 2’s intended use(s)?​

  • Orca 2 is built for research purposes only.
  • The main purpose is to allow the research community to assess its abilities and to provide a foundation for building better frontier models.

How was Orca 2 evaluated?​

  • Orca 2 has been evaluated on a large number of tasks ranging from reasoning to grounding and safety. Please refer to Section 6 and Appendix in the Orca 2 paper for details on evaluations.

Model Details​

Orca 2 is a finetuned version of LLAMA-2. Orca 2’s training data is a synthetic dataset that was created to enhance the small model’s reasoning abilities. All synthetic training data was moderated using the Microsoft Azure content filters. More details about the model can be found in the Orca 2 paper.

Please refer to LLaMA-2 technical report for details on the model architecture.

License​

Orca 2 is licensed under the Microsoft Research License.

Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.

Bias, Risks, and Limitations​

Orca 2, built upon the LLaMA 2 model family, retains many of its limitations, as well as the common limitations of other large language models or limitation caused by its training process, including:

Data Biases: Large language models, trained on extensive data, can inadvertently carry biases present in the source data. Consequently, the models may generate outputs that could be potentially biased or unfair.

Lack of Contextual Understanding: Despite their impressive capabilities in language understanding and generation, these models exhibit limited real-world understanding, resulting in potential inaccuracies or nonsensical responses.

Lack of Transparency: Due to the complexity and size, large language models can act as “black boxes”, making it difficult to comprehend the rationale behind specific outputs or decisions. We recommend reviewing transparency notes from Azure for more information.

Content Harms: There are various types of content harms that large language models can cause. It is important to be aware of them when using these models, and to take actions to prevent them. It is recommended to leverage various content moderation services provided by different companies and institutions. On an important note, we hope for better regulations and standards from government and technology leaders around content harms for AI technologies in future. We value and acknowledge the important role that research and open source community can play in this direction.

Hallucination: It is important to be aware and cautious not to entirely rely on a given language model for critical decisions or information that might have deep impact as it is not obvious how to prevent these models from fabricating content. Moreover, it is not clear whether small models may be more susceptible to hallucination in ungrounded generation use cases due to their smaller sizes and hence reduced memorization capacities. This is an active research topic and we hope there will be more rigorous measurement, understanding and mitigations around this topic.

Potential for Misuse: Without suitable safeguards, there is a risk that these models could be maliciously used for generating disinformation or harmful content.

Data Distribution: Orca 2’s performance is likely to correlate strongly with the distribution of the tuning data. This correlation might limit its accuracy in areas underrepresented in the training dataset such as math, coding, and reasoning.

System messages: Orca 2 demonstrates variance in performance depending on the system instructions. Additionally, the stochasticity introduced by the model size may lead to generation of non-deterministic responses to different system instructions.

Zero-Shot Settings: Orca 2 was trained on data that mostly simulate zero-shot settings. While the model demonstrate very strong performance in zero-shot settings, it does not show the same gains of using few-shot learning compared to other, specially larger, models.

Synthetic data: As Orca 2 is trained on synthetic data, it could inherit both the advantages and shortcomings of the models and methods used for data generation. We posit that Orca 2 benefits from the safety measures incorporated during training and safety guardrails (e.g., content filter) within the Azure OpenAI API. However, detailed studies are required for better quantification of such risks.

This model is solely designed for research settings, and its testing has only been carried out in such environments. It should not be used in downstream applications, as additional analysis is needed to assess potential harm or bias in the proposed application.


DEMO:

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
40,370
Reputation
6,972
Daps
127,170





ML News of the day🚨



Is it hard to keep up with all the amazing things happening in the ML ecosystem? Here is a quick summary of yesterday's top 5 ML releases - from small models to video generation!🚀



1. Microsoft releases Orca 2

Orca 2 goal is to explore small LLM capabilities. Orca 2 is a Llama 2 fine-tune trained with high-quality synthetic data with different reasoning techniques.



Why is this interesting? Recent research trends have shown impressive small model capabilities many times comparable with models that are 5-10x larger.



Paper: arxiv.org/abs/2311.11045

Model: huggingface.co/microsoft/Orc…



2. SEINE - Video Diffusion Model

SEINE allows short-to-long video generation as well as image-to-video. Its focus is on high-quality long videos that keep consistency and have smooth transitions. For example, you can give an initial and a final image, and it will generate a smooth video out of it.



Paper: arxiv.org/abs/2310.20700

Repo: github.com/Vchitect/SEINE



3. System 2 Attention (S2A)

Soft attention can assign a probability to irrelevant parts of a context. S2A regenerates the context so irrelevant parts are removed. Using S2A contexts produces more factual and objective responses.



Paper: arxiv.org/abs/2311.11829



4. Nous-Yarn-Llama

This model extends Llama 2 70B by further training on long context data using the YaRN extension method, increasing Llama 2 context window from 4k tokens to 32k.



Model: huggingface.co/NousResearch/…



5. Video LlaVA

Video LLaVA is a robust large vision-language baseline model with a mixed dataset of images and videos. This model can answer questions about input videos, images, or both at the same time (e.g. does the flag in the image appear in the video?)



Paper: arxiv.org/abs/2311.10122

Demo: huggingface.co/spaces/Langua…



New resources



Understanding training loss patterns by @stasbekman nitter.unixfox.eu/StasBekman/statu…

Learn a lot about fine-tuning LLMs and LoRAs by @rasbt nitter.unixfox.eu/rasbt/status/172…

Does sketching work? Learn about this tool to reduce matrix dimensions. by @ethanepperly huggingface.co/blog/ethanepp…
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
40,370
Reputation
6,972
Daps
127,170

NOVEMBER 20, 2023

Editors' notes


Synthetic imagery sets new bar in AI training efficiency​

by Rachel Gordon, Massachusetts Institute of Technology

Synthetic imagery sets new bar in AI training efficiency

An MIT team studies the potential of learning visual representations using synthetic images generated by text-to-image models. They are the first to show that models trained solely with synthetic images outperform the counterparts trained with real images, in large-scale settings. Credit: Alex Shipps/MIT CSAIL via the Midjourney AI image generator

Data is the new soil, and in this fertile new ground, MIT researchers are planting more than just pixels. By using synthetic images to train machine learning models, a team of scientists recently surpassed results obtained from traditional "real-image" training methods.

At the core of the approach is a system called StableRep, which doesn't just use any synthetic images; it generates them through ultra-popular text-to-image models like Stable Diffusion. It's like creating worlds with words.

So what's in StableRep's secret sauce? A strategy called "multi-positive contrastive learning."

"We're teaching the model to learn more about high-level concepts through context and variance, not just feeding it data," says Lijie Fan, MIT Ph.D. student in electrical engineering, affiliate of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), lead researcher on the work currently posted to the arXiv preprint server.

"When multiple images, all generated from the same text, all treated as depictions of the same underlying thing, the model dives deeper into the concepts behind the images, say the object, not just their pixels."

This approach considers multiple images spawned from identical text prompts as positive pairs, providing additional information during training, not just adding more diversity but specifying to the vision system which images are alike and which are different. Remarkably, StableRep outshone the prowess of top-tier models trained on real images, such as SimCLR and CLIP, in extensive datasets.

"While StableRep helps mitigate the challenges of data acquisition in machine learning, it also ushers in a stride towards a new era of AI training techniques. The capacity to produce high-caliber, diverse synthetic images on command could help curtail cumbersome expenses and resources," says Fan.

The process of data collection has never been straightforward. In the 1990s, researchers had to manually capture photographs to assemble datasets for objects and faces. The 2000s saw individuals scouring the internet for data. However, this raw, uncurated data often contained discrepancies when compared to real-world scenarios and reflected societal biases, presenting a distorted view of reality.

The task of cleansing datasets through human intervention is not only expensive, but also exceedingly challenging. Imagine, though, if this arduous data collection could be distilled down to something as simple as issuing a command in natural language.

A pivotal aspect of StableRep's triumph is the adjustment of the "guidance scale" in the generative model, which ensures a delicate balance between the synthetic images' diversity and fidelity. When finely tuned, synthetic images used in training these self-supervised models were found to be as effective, if not more so, than real images.

Taking it a step forward, language supervision was added to the mix, creating an enhanced variant: StableRep+. When trained with 20 million synthetic images, StableRep+ not only achieved superior accuracy but also displayed remarkable efficiency compared to CLIP models trained with a staggering 50 million real images.

Yet, the path ahead isn't without its potholes. The researchers candidly address several limitations, including the current slow pace of image generation, semantic mismatches between text prompts and the resultant images, potential amplification of biases, and complexities in image attribution, all of which are imperative to address for future advancements.

Another issue is that StableRep requires first training the generative model on large-scale real data. The team acknowledges that starting with real data remains a necessity; however, when you have a good generative model, you can repurpose it for new tasks, like training recognition models and visual representations.

The team notes that they haven't gotten around the need to start with real data; it's just that once you have a good generative model you can repurpose it for new tasks, like training recognition models and visual representations.

While StableRep offers a good solution by diminishing the dependency on vast real-image collections, it brings to the fore concerns regarding hidden biases within the uncurated data used for these text-to-image models. The choice of text prompts, integral to the image synthesis process, is not entirely free from bias, "indicating the essential role of meticulous text selection or possible human curation," says Fan.

"Using the latest text-to-image models, we've gained unprecedented control over image generation, allowing for a diverse range of visuals from a single text input. This surpasses real-world image collection in efficiency and versatility. It proves especially useful in specialized tasks, like balancing image variety in long-tail recognition, presenting a practical supplement to using real images for training," says Fan.

"Our work signifies a step forward in visual learning, towards the goal of offering cost-effective training alternatives while highlighting the need for ongoing improvements in data quality and synthesis."

"One dream of generative model learning has long been to be able to generate data useful for discriminative model training," says Google DeepMind researcher and University of Toronto professor of computer science David Fleet, who was not involved in the paper.

"While we have seen some signs of life, the dream has been elusive, especially on large-scale complex domains like high-resolution images. This paper provides compelling evidence, for the first time to my knowledge, that the dream is becoming a reality. They show that contrastive learning from massive amounts of synthetic image data can produce representations that outperform those learned from real data at scale, with the potential to improve myriad downstream vision tasks."

More information: Yonglong Tian et al, StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners, arXiv (2023). DOI: 10.48550/arxiv.2306.00984

Journal information: arXiv

Provided by Massachusetts Institute of Technology



Computer Science > Computer Vision and Pattern Recognition​

[Submitted on 1 Jun 2023 (v1), last revised 26 Oct 2023 (this version, v2)]

StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners​

Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, Dilip Krishnan
We investigate the potential of learning visual representations using synthetic images generated by text-to-image models. This is a natural question in the light of the excellent performance of such models in generating high-quality images. We consider specifically the Stable Diffusion, one of the leading open source text-to-image models. We show that (1) when the generative model is configured with proper classifier-free guidance scale, training self-supervised methods on synthetic images can match or beat the real image counterpart; (2) by treating the multiple images generated from the same text prompt as positives for each other, we develop a multi-positive contrastive learning method, which we call StableRep. With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP using the same set of text prompts and corresponding real images, on large scale datasets. When we further add language supervision, StableRep trained with 20M synthetic images achieves better accuracy than CLIP trained with 50M real images.
Comments:code is available at: this https URL
Subjects:Computer Vision and Pattern Recognition (cs.CV)
Cite as:arXiv:2306.00984 [cs.CV]
(or arXiv:2306.00984v2 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2306.00984
Focus to learn more

Submission history​

From: Yonglong Tian [view email]
[v1] Thu, 1 Jun 2023 17:59:51 UTC (5,106 KB)
[v2] Thu, 26 Oct 2023 15:16:57 UTC (5,109 KB)
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
40,370
Reputation
6,972
Daps
127,170

NOVEMBER 20, 2023 REPORT

Editors' notes


Researchers seek consensus on what constitutes Artificial General Intelligence​

by Peter Grad , Tech Xplore

AI
Credit: Pavel Danilyuk from Pexels

A team of researchers at DeepMind focusing on the next frontier of artificial intelligence—Artificial General Intelligence (AGI)—realized they needed to resolve one key issue first. What exactly, they asked, is AGI?

It is often viewed in general as a type of artificial intelligence that possesses the ability to understand, learn and apply knowledge across a broad range of tasks, operating like the human brain. Wikipedia broadens the scope by suggesting AGI is "a hypothetical type of intelligent agent [that] could learn to accomplish any intellectual task that human beings or animals can perform."

OpenAI's charter describes AGI as a set of "highly autonomous systems that outperform humans at most economically valuable work."

AI expert and founder of Geometric Intelligence Gary Marcus defined it as "any intelligence that is flexible and general, with resourcefulness and reliability comparable to (or beyond) human intelligence."

With so many variations in definitions, the DeepMind team embraced a simple notion voiced centuries ago by Voltaire: "If you wish to converse with me, define your terms."

In a paper published on the preprint server arXiv, the researchers outlined what they termed "a framework for classifying the capabilities and behavior of AGI models."

In doing so, they hope to establish a common language for researchers as they measure progress, compare approaches and assess risks.

"Achieving human-level 'intelligence' is an implicit or explicit north-star goal for many in our field," said Shane Legg, who introduced the term AGI 20 years ago.

In an interview with MIT Review, Legg explained, "I see so many discussions where people seem to be using the term to mean different things, and that leads to all sorts of confusion. Now that AGI is becoming such an important topic we need to sharpen up what we mean."

In the arXiv paper, titled "Levels of AGI: Operationalizing Progress on the Path to AGI," the team summarized several principles required of an AGI model. They include a focus on the capabilities of a system, not the process.

"Achieving AGI does not imply that systems 'think' or 'understand' [or] possess qualities such as consciousness or sentience," the team emphasized.

An AGI system must also have the ability to learn new tasks, and know when to seek clarification or assistance from humans for a task.

Another parameter is a focus on potential, and not necessarily actual deployment of a program. "Requiring deployment as a condition of measuring AGI introduces non-technical hurdles such as legal and social considerations, as well as potential ethical and safety concerns," the researchers explained.

The team then compiled a list of intelligence thresholds ranging from "Level 0, No AGI," to "Level 5, Superhuman." Levels 1–4 included "Emerging," "Competent," "Expert" and "Virtuosos" levels of achievement.

Three programs met the threshold of the label AGI. But those three, generative text models (ChatGPT, Bard and Llama 2), reached only "Level 1, Emerging." No other current AI programs met the criteria for AGI.

Other programs listed as AI included SHRDLU, an early natural language understanding computer developed at MIT, listed at "Level 1, Emerging AI."

At "Level 2, Competent" are Siri, Alexa and Google Assistant. The grammar checker Grammarly ranks at "Level 3, Expert AI."

Higher up this list, at "Level 4, Virtuoso," are Deep Blue and AlphaGo. Topping the list, "Level 5, Superhuman," are DeepMind's AlphaFold, which predicts a protein's 3D structure from its amino acid sequence; and StockFish, a powerful open-source chess program.

However, there is no single proposed definition for AGI, and there is constant change.

"As we gain more insights into these underlying processes, it may be important to revisit our definition of AGI," says Meredith Ringel Morris, Google DeepMind's principal scientist for human and AI interaction.

"It is impossible to enumerate the full set of tasks achievable by a sufficiently general intelligence," the researchers said. "As such, an AGI benchmark should be a living benchmark. Such a benchmark should therefore include a framework for generating and agreeing upon new tasks."

More information: Meredith Ringel Morris et al, Levels of AGI: Operationalizing Progress on the Path to AGI, arXiv (2023). DOI: 10.48550/arxiv.2311.02462

Journal information: arXiv




Computer Science > Artificial Intelligence​

[Submitted on 4 Nov 2023]


Levels of AGI: Operationalizing Progress on the Path to AGI​

Meredith Ringel Morris, Jascha Sohl-dikkstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, Shane Legg

We propose a framework for classifying the capabilities and behavior of Artificial General Intelligence (AGI) models and their precursors. This framework introduces levels of AGI performance, generality, and autonomy. It is our hope that this framework will be useful in an analogous way to the levels of autonomous driving, by providing a common language to compare models, assess risks, and measure progress along the path to AGI. To develop our framework, we analyze existing definitions of AGI, and distill six principles that a useful ontology for AGI should satisfy. These principles include focusing on capabilities rather than mechanisms; separately evaluating generality and performance; and defining stages along the path toward AGI, rather than focusing on the endpoint. With these principles in mind, we propose 'Levels of AGI' based on depth (performance) and breadth (generality) of capabilities, and reflect on how current systems fit into this ontology. We discuss the challenging requirements for future benchmarks that quantify the behavior and capabilities of AGI models against these levels. Finally, we discuss how these levels of AGI interact with deployment considerations such as autonomy and risk, and emphasize the importance of carefully selecting Human-AI Interaction paradigms for responsible and safe deployment of highly capable AI systems.

Subjects:Artificial Intelligence (cs.AI)
Cite as:arXiv:2311.02462 [cs.AI]
(or arXiv:2311.02462v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2311.02462

Focus to learn more


Submission history​

From: Meredith Morris [view email]

[v1] Sat, 4 Nov 2023 17:44:58 UTC (423 KB)
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
40,370
Reputation
6,972
Daps
127,170






RoboVQA: Multimodal Long-Horizon Reasoning
for Robotics

  • Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, Pete Florence, Wei Han, Robert Baruch, Yao Lu, Suvir Mirchandani, Peng Xu, Pannag Sanketi, Karol Hausman, Izhak Shafran, Brian Ichter, Yuan Cao

Scaling up RoboVQA with Google Meet​

Video conference tools like Google Meet provides powerful infrastructure for realtime video streaming, multi-participant support and high quality speech transcription. This helps scaling up data collection and deployment to anywhere, with any embodiment, while removing the requirement for a mouse + keyboard interface​

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
40,370
Reputation
6,972
Daps
127,170

About​

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

arxiv.org/pdf/2311.10122.pdf

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

If you like our project, please give us a star ⭐ on GitHub for latest update.​

hf_space Replicate demo and cloud API zhihu zhihu arXiv License Hits GitHub issues GitHub closed issues

PWC
PWC
PWC

💡 I also have other video-language projects that may interest you ✨.


LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, Li Yuan

📰 News​

  • [2023.11.20] 🤗Demo and code are available now! Welcome to watch 👀 this repository for the latest updates.

😮 Highlights​

Video-LLaVA exhibits remarkable interactive capabilities between images and videos, despite the absence of image-video pairs in the dataset.

💡 Simple baseline, learning united visual representation by alignment before projection​

  • With the binding of unified visual representations to the language feature space, we enable an LLM to perform visual reasoning capabilities on both images and videos simultaneously.

🔥 High performance, complementary learning with video and image​

  • Extensive experiments demonstrate the complementarity of modalities, showcasing significant superiority when compared to models specifically designed for either images or videos.
 
Top