The A.I Megathread (LLM , GPT , Development)

bnew · Jul 8, 2024

1/1
[CV] Understanding Alignment in Multimodal LLMs: A Comprehensive Study
[2407.02477] Understanding Alignment in Multimodal LLMs: A Comprehensive Study
- This paper examines alignment strategies for Multimodal Large Language Models (MLLMs) to reduce hallucinations and improve visual grounding. It categorizes alignment methods into offline (e.g. DPO) and online (e.g. Online-DPO).

- The paper reviews recently published multimodal preference datasets like POVID, RLHF-V, VLFeedback and analyzes their components: prompts, chosen responses, rejected responses.

- It introduces a new preference data sampling method called Bias-Driven Hallucination Sampling (BDHS) which restricts image access to induce language model bias and trigger hallucinations.

- Experiments align the LLaVA 1.6 model and compare offline, online and mixed DPO strategies. Results show combining offline and online can yield benefits.

- The proposed BDHS method achieves strong performance without external annotators or preference data, just using self-supervised data.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

SIGGRAPH 2024 Paper Alert

Paper Title: CharacterGen: Efficient 3D Character Generation from Single Images with
Multi-View Pose Canonicalization

Few pointers from the paper

In this paper authors have presented “CharacterGen”, a framework developed to efficiently generate 3D characters. CharacterGen introduces a streamlined generation pipeline along with an image-conditioned multi-view diffusion model.

This model effectively calibrates input poses to a canonical form while retaining key attributes of the input image, thereby addressing the challenges posed by diverse poses. A transformer-based, generalizable sparse-view reconstruction model is the other core component of their approach, facilitating the creation of detailed 3D models from multi-view images.

They also adopted a texture-back-projection strategy to produce high-quality texture map. Additionally, They have curated a dataset of anime characters, rendered in multiple poses and views, to train and evaluate their model.

Their approach has been thoroughly evaluated through quantitative and qualitative experiments, showing its proficiency in generating 3D characters with high-quality shapes and textures, ready for downstream applications such as rigging and animation.

Organization: @Tsinghua_Uni , @VastAIResearch

Paper Authors: Hao-Yang Peng, Jia-Peng Zhang, @MengHaoGuo1 , @yanpei_cao , Shi-Min Hu

Read the Full Paper here: [2402.17214] CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization

Project Page: CharacterGen: Efficient 3D Character Generation from Single Images

Code: GitHub - zjp-shadow/CharacterGen: [SIGGRAPH'24] CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization

Be sure to watch the attached Video-Sound on

Music by Grand_Project from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking

Few pointers from the paper

In this paper authors have introduced “PointOdyssey”, a large-scale synthetic dataset, and data generation framework, for the training and evaluation of long-term fine-grained tracking algorithms.

Their goal was to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion. Toward the goal of naturalism, they animated deformable characters using real-world motion capture data, they built 3D scenes to match the motion capture environments, and they rendered camera viewpoints using trajectories mined via structure-from-motion on real videos.

They created combinatorial diversity by randomizing character appearance, motion profiles, materials, lighting, 3D assets, and atmospheric effects. Their dataset currently includes 104 videos, averaging 2,000 frames long, with orders of magnitude more correspondence annotations than prior work.

They showed that existing methods can be trained from scratch in their dataset and outperform the published variants. Finally, they also introduced modifications to the PIPs point tracking method, greatly widening its temporal receptive field, which improves its performance on PointOdyssey as well as on two real-world benchmarks.

Organization: @Stanford

Paper Authors: @yang_zheng18 , @AdamWHarley , @willbokuishen , @GordonWetzstein , Leonidas J. Guibas

Read the Full Paper here: [2307.15055] PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking

Project Page: PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking

Simulator: GitHub - y-zheng18/point_odyssey: Official code for PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking (ICCV 2023)

Model: GitHub - aharley/pips2: PIPs++

Be sure to watch the attached Demo Video-Sound on

Music by Breakz Studios from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

ECCV 2024 Paper Alert

Paper Title: LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

Few pointers from the paper

In this paper, authors have introduced a novel problem -- egocentric action frame generation. The goal is to synthesize an image depicting an action in the user's context (i.e., action frame) by conditioning on a user prompt and an input egocentric image.

Notably, existing egocentric action datasets lack the detailed annotations that describe the execution of actions. Additionally, existing diffusion-based image manipulation models are sub-optimal in controlling the state transition of an action in egocentric image pixel space because of the domain gap.

To this end, they proposed to Learn EGOcentric (LEGO) action frame generation via visual instruction tuning. First, they introduced a prompt enhancement scheme to generate enriched action descriptions from a visual large language model (VLLM) by visual instruction tuning.

Then they proposed a novel method to leverage image and text embeddings from the VLLM as additional conditioning to improve the performance of a diffusion model. They validated their model on two egocentric datasets -- Ego4D and Epic-Kitchens.

Their experiments show substantial improvement over prior image manipulation models in both quantitative and qualitative evaluation. They also conducted detailed ablation studies and analysis to provide insights in their method.

Organization: GenAI,@Meta , @GeorgiaTech , @UofIllinois

Paper Authors: @bryanislucky , Xiaoliang Dai, Lawrence Chen, Guan Pang, @RehgJim ,@aptx4869ml

Read the Full Paper here:[2312.03849] LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

Project Page: LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

Code: GitHub - BolinLai/LEGO: This is the official code of LEGO paper.

Be sure to watch the attached Demo Video -Sound on

Music by AlexGrohl from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#ECCV2024

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: PWM: Policy Learning with Large World Models

Few pointers from the paper

Reinforcement Learning (RL) has achieved impressive results on complex tasks but struggles in multi-task settings with different embodiments. World models offer scalability by learning a simulation of the environment, yet they often rely on inefficient gradient-free optimization methods.

In this paper authors have introduced “Policy learning with large World Models (PWM)”, a novel model-based RL algorithm that learns continuous control policies from large multi-task world models.

By pre-training the world model on offline data and using it for first-order gradient policy learning, PWM effectively solves tasks with up to 152 action dimensions and outperforms methods using ground-truth dynamics.

Additionally, PWM scales to an 80-task setting, achieving up to 27% higher rewards than existing baselines without the need for expensive online planning.

Organization: @GeorgiaTech , @UCSanDiego , @nvidia

Paper Authors: @imgeorgiev , @VarunGiridhar3 , @ncklashansen , @animesh_garg

Read the Full Paper here: [2407.02466] PWM: Policy Learning with Large World Models

Project Page: PWM: Policy Learning with Large World Models

Code: GitHub - imgeorgiev/PWM: PWM: Policy Learning with Large World Models

Be sure to watch the attached Video -Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1
[CV] OccFusion: Rendering Occluded Humans with Generative Diffusion Priors
[2407.00316] OccFusion: Rendering Occluded Humans with Generative Diffusion Priors
- Most human rendering methods assume humans are fully visible, but occlusions are common in real life. This paper presents OccFusion to render occluded humans using 3D Gaussian splatting supervised by 2D diffusion models.

- The method has three stages - Initialization, Optimization, and Refinement.

- In Initialization, complete human masks are generated from partial visibility masks using diffusion models.

- In Optimization, 3D Gaussians are optimized based on observed regions and pose-conditioned SDS is applied in both posed and canonical space to ensure completeness.

- In Refinement, in-context inpainting is used with coarse renderings to refine appearance.

- The method achieves state-of-the-art efficiency and quality on simulated and real occlusions.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

ECCV 2024 Paper Alert

Paper Title: Relightable Neural Actor with Intrinsic Decomposition and Pose Control

Few pointers from the paper

Creating a digital human avatar that is relightable, drivable, and photorealistic is a challenging and important problem in Vision and Graphics.Humans are highly articulated creating pose-dependent appearance effects like self-shadows and wrinkles, and skin as well as clothing require complex and space-varying BRDF models.

While recent human relighting approaches can recover plausible material-light decompositions from multi-view video, they do not generalize to novel poses and still suffer from visual artifacts.

To address this, authors of this paper proposed “Relightable Neural Actor”, the first video-based method for learning a photorealistic neural human model that can be relighted, allows appearance editing, and can be controlled by arbitrary skeletal poses.

Importantly, for learning their human avatar, they solely require a multi-view recording of the human under a known, but static lighting condition. To achieve this, they represented the geometry of the actor with a drivable density field that models pose-dependent clothing deformations and provides a mapping between 3D and UV space, where normal, visibility, and materials are encoded.

To evaluate their approach in real-world scenarios, authors collected a new dataset with four actors recorded under different light conditions, indoors and outdoors, providing the first benchmark of its kind for human relighting, and demonstrating state-of-the-art relighting results for novel human poses.

Organization: @VcaiMpi , Saarland Informatics Campus, @VIACenterSB , @UniFreiburg

Paper Authors: @DiogoLuvizon , @VGolyanik , @AdamKortylewski , @marc_habermann , Christian Theobalt

Read the Full Paper here: [2312.11587] Relightable Neural Actor with Intrinsic Decomposition and Pose Control

Project Page: Relightable Neural Actor

Code: Coming

Be sure to watch the attached Video - Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#ECCV2024

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control

Few pointers from the paper

Portrait Animation aims to synthesize a lifelike video from a single source image, using it as an appearance reference, with motion (i.e., facial expressions and head pose) derived from a driving video, audio, text, or generation.

Instead of following mainstream diffusion-based methods, authors of this paper explored and extended the potential of the implicit-keypoint-based framework, which effectively balances computational efficiency and controllability.

Building upon this, authors developed a video-driven portrait animation framework named LivePortrait with a focus on better generalization, controllability, and efficiency for practical usage.

To enhance the generation quality and generalization ability, they scaled up the training data to about 69 million high-quality frames, adopted a mixed image-video training strategy, upgrade the network architecture, and design better motion transformation and optimization objectives.

Additionally, they discovered that compact implicit keypoints can effectively represent a kind of blendshapes and meticulously propose a stitching and two retargeting modules, which utilize a small MLP with negligible computational overhead, to enhance the controllability.

Experimental results demonstrate the efficacy of their framework even compared to diffusion-based methods. The generation speed remarkably reaches 12.8ms on an RTX 4090 GPU with PyTorch.

Organization: Kuaishou Technology, @ustc , @FudanUni

Paper Authors: Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, Di Zhang

Read the Full Paper here: [2407.03168] LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control

Project Page: Efficient Portrait Animation with Stitching and Retargeting Control

Code: GitHub - KwaiVGI/LivePortrait: Make one portrait alive!

Be sure to watch the attached Demo Video -Sound on

Music by Dmitrii Kolesnikov from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

ECCV 2024 Paper Alert

Paper Title: Fast View Synthesis of Casual Videos

Few pointers from the paper

Novel view synthesis from an in-the-wild video is difficult due to challenges like scene dynamics and lack of parallax. While existing methods have shown promising results with implicit neural radiance fields, they are slow to train and render.

This paper revisits explicit video representations to synthesize high-quality novel views from a monocular video efficiently. Authors treat static and dynamic video content separately. Specifically, they have built a global static scene model using an extended plane-based scene representation to synthesize temporally coherent novel video.

Their plane-based scene representation is augmented with spherical harmonics and displacement maps to capture view-dependent effects and model non-planar complex surface geometry. They opt to represent the dynamic content as per-frame point clouds for efficiency.

While such representations are inconsistency-prone, minor temporal inconsistencies are perceptually masked due to motion. Therefore, they developed a method to quickly estimate such a hybrid video representation and render novel views in real time.

Authors experiments showed that their method can render high-quality novel views from an in-the-wild video with comparable quality to state-of-the-art methods while being 100× faster in training and enabling real-time rendering.

Organization: @University of Maryland, College Park, @AdobeResearch , @Adobe

Paper Authors: @YaoChihLee , @ZhoutongZhang , Kevin Blackburn Matzen, @simon_niklaus , Jianming Zhang, @jbhuang0604 , Feng Liu

Read the Full Paper here: [2312.02135] Fast View Synthesis of Casual Videos

Project Page: Fast View Synthesis of Casual Videos

Be sure to watch the attached Demo Video -Sound on

Music by Pavel Bekirov from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#ECCV24

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion

Few pointers from the paper

With recent advances in video prediction, controllable video generation has been attracting more attention. Generating high fidelity videos according to simple and flexible conditioning is of particular interest.

To this end, authors of this paper have proposed a controllable video generation model using pixel level renderings of 2D or 3D bounding boxes as conditioning.

In addition, they also created a bounding box predictor that, given the initial and ending frames bounding boxes, can predict up to 15 bounding boxes per frame for all the frames in a 25-frame clip.

Given the novelty of their problem formulation, there is no existing standard way to evaluate models that seek to predict vehicle video with high fidelity.

Authors therefore presents a new benchmark consisting of a particular way of evaluating video generation models using the KITTI, Virtual KITTI 2 (vKITTI) and the Berkeley Driving Dataset (BDD 100k).

Organization: @Mila_Quebec

Paper Authors: Ge Ya (Olga) Luo, Zhi Hao Luo, Anthony Gosselin, Alexia Jolicoeur-Martineau, Christopher Pal

Read the Full Paper here: [2406.05630] Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion

Project Page: SOCIAL MEDIA TITLE TAG

Code: GitHub - oooolga/Ctrl-V

Be sure to watch the attached Demo Video -Sound on

Music by Rockot from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models

Few pointers from the paper

In this paper authors have introduced “HouseCrafter”, a novel approach that can lift a floorplan into a complete large 3D indoor scene (e.g., a house).

Their key insight is to adapt a 2D diffusion model, which is trained on web-scale images, to generate consistent multi-view color (RGB) and depth (D) images across different locations of the scene.

Specifically, the RGB-D images are generated autoregressively in a batch- wise manner along sampled locations based on the floorplan, where previously generated images are used as condition to the diffusion model to produce images at nearby locations.

The global floorplan and attention design in the diffusion model ensures the consistency of the generated images, from which a 3D scene can be reconstructed.

Through extensive evaluation of the 3D-Front dataset, authors demonstrate that HouseCraft can generate high-quality house-scale 3D scenes. Ablation studies also validate the effectiveness of different design choices

Organization: @Northeastern , @StabilityAI

Paper Authors: Hieu T. Nguyen, Yiwen Chen, @VikramVoleti @jampani_varun , @HuaizuJiang

Read the Full Paper here: [2406.20077] HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Model

Project Page: HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models

Code: Coming

Be sure to watch the attached Demo Video -Sound on

Music by Maksym Dudchyk from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Product Update

@elevenlabsio has just introduced

“VOICE ISOLATOR”

This new feature allows you to extract crystal-clear speech from any audio

Their vocal remover strips background noise for film, podcast, and interview post production.

Try it for Free here: Free Voice Isolator and Background Noise Remover | ElevenLabs

Be sure to watch the attached Demo Video -Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning

Few pointers from the paper

Building effective imitation learning methods that enable robots to learn from limited data and still generalize across diverse real-world environments is a long-standing problem in robot learning.

In this paper authors have proposed “EquiBot”, a robust, data-efficient, and generalizable approach for robot manipulation task learning. Their approach combines SIM(3)-equivariant neural network architectures with diffusion models.

This ensures that their learned policies are invariant to changes in scale, rotation, and translation, enhancing their applicability to unseen environments while retaining the benefits of diffusion-based policy learning, such as multi-modality and robustness.

They showed on a suite of 6 simulation tasks that their proposed method reduces the data requirements and improves generalization to novel scenarios.

In the real world, with 10 variations of 6 mobile manipulation tasks, they showed that their method can easily generalize to novel objects and scenes after learning from just 5 minutes of human demonstrations in each task.

Organization: @Stanford

Paper Authors: @yjy0625 , Zi-ang Cao , @CongyueD , @contactrika , @SongShuran ,@leto__jean

Read the Full Paper here: [2407.01479] EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning

Project Page: EquiBot

Code: GitHub - yjy0625/equibot: Official implementation for paper "EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning".

Be sure to watch the attached Video -Sound on

Music by Zakhar Valaha from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Few pointers from the paper

In this paper authors have presented “Diffusion Forcing”, a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels.

They applied Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens without fully diffusing past ones.

Their approach is shown to combine the strengths of next-token prediction models, such as variable-length generation, with the strengths of full-sequence diffusion models, such as the ability to guide sampling to desirable trajectories.

Their method offers a range of additional capabilities, such as

rolling-out sequences of continuous tokens, such as video, with lengths past the training horizon, where baselines diverge and

new sampling and guiding schemes that uniquely profit from Diffusion Forcing's variable-horizon and causal architecture, and which lead to marked performance gains in decision-making and planning tasks.

In addition to its empirical success, their method is proven to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution.

Organization: @MIT_CSAIL

Paper Authors: @BoyuanChen0 , Diego Marti Monso, @du_yilun , @max_simchowitz , @RussTedrake , @vincesitzmann

Read the Full Paper here: [2407.01392] Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Project Page: Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Code: GitHub - buoyancy99/diffusion-forcing: code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"

Be sure to watch the attached Demo Video -Sound on

Music by Nick Valerson from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: MimicMotion : High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Few pointers from the paper

In recent years, generative artificial intelligence has achieved significant advancements in the field of image generation, spawning a variety of applications.

However, video generation still faces considerable challenges in various aspects, such as controllability, video length, and richness of details, which hinder the application and popularization of this technology.

In this work, authors have proposed a controllable video generation framework, dubbed “MimicMotion”, which can generate high-quality videos of arbitrary length with any motion guidance.

Compared with previous methods, their approach has several highlights.

Firstly, with confidence-aware pose guidance, temporal smoothness can be achieved so model robustness can be enhanced with large-scale training data.

Secondly, regional loss amplification based on pose confidence significantly eases the distortion of image significantly.

Lastly, for generating long smooth videos, a progressive latent fusion strategy is proposed. By this means, videos of arbitrary length can be generated with acceptable resource consumption. With extensive experiments and user studies, MimicMotion demonstrates significant improvements over previous approaches in multiple aspects.

Organization: @TencentGlobal , @sjtu1896

Paper Authors: Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, Fangyuan Zou

Read the Full Paper here: [2406.19680] MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Project Page: SOCIAL MEDIA TITLE TAG

Code: GitHub - Tencent/MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Be sure to watch the attached Demo Video -Sound on

Music by Alexander Lisenkov from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran