Mira: A Mini-step Towards Sora-like Long Video Generation

ARC Lab, Tencent PCG

For any inquiries, please email to: mira-x@googlegroups.com

Introduction

We introduce Mira (Mini-Sora), an initial foray into the realm of high-quality, long-duration video generation in the style of Sora. Mira stands out from existing text-to-video (T2V) generation frameworks in several key ways:

1. Extended sequence length: While most frameworks are limited to generating short videos (2 seconds / 16 frames), Mira is designed to produce significantly longer sequences, potentially lasting 10 seconds, 20 seconds, or more.

2. Enhanced dynamics: Mira has the capability to create videos with rich dynamics and intricate motions, setting it apart from the more static outputs of current video generation technologies.

3. Strong 3D consistency: Despite the intricate dynamics and object interactions, Mira ensures the 3D integrity of objects is preserved throughout the video, avoiding noticeable distortions.

Please acknowledge that our work on Mira is in the experimental phase. There is significant gap between Mira and Sora in many key areas:

1.Interactive objects and environments: Sora supports the generation of videos where objects and surroundings engage in dynamic interactions, adding a layer of complexity and realism.

2. Sustained object consistency: Sora maintains consistent object shapes, even when they temporarily exit and re-enter the frame, ensuring continuity and coherence.

The Mira project is our endeavor to investigate and refine the entire data-model-training pipeline for Sora-like, lightweight T2V frameworks, and to preliminarily demonstrate the aforementioned Sora characteristics. Our goal is to foster innovation and democratize the field of content creation, paving the way for more accessible and advanced video generation tools.

Please note that we are currently conducting preliminary experiments for Mini-Sora. Our primary objective with this project is not to fully reproduce Sora, but rather to explore specific key components within the Sora framework and share our findings with the community.

MiraDiT

More Details

MiraData

MiraData is a large-scale video dataset with long duration and structured captions. It is specifically designed for long video generation tasks.

MiraData Youtube Video

Key Features of MiraData

1. Long Video Duration: Unlike previous datasets, where video clips are often very short (typically less than 6 seconds), MiraData focuses on uncut video segments with durations ranging from 1 to 2 minutes. This extended duration allows for more comprehensive modeling of video content.

2. Structured Captions: Each video in MiraData is accompanied by structural captions. These captions provide detailed descriptions from various perspectives, enhancing the richness of the dataset. The average caption length is 349 words, ensuring a thorough representation of the video content.

Current Status

In this initial release, MiraData includes two scenarios:
- Gaming: Videos related to gaming experiences.
- City/Scenic Exploration: Videos capturing urban or scenic views.

MiraData is still in its early stages, and we will release more scenarios and improve the quality of the dataset in the near future.

Samples from MiraData

The video features a man navigating through a series of urban environments in a video game. The character is dressed casually in a light-colored shirt and dark pants. He moves with a sense of urgency, suggesting a stealth or action-oriented gameplay scenario. The environments are detailed and realistic, with textures and lighting that give a sense of depth and immersion. The man interacts with the environment, climbing and jumping over obstacles, which indicates the game likely includes parkour or exploration elements. The presence of dialogue text suggests that there is a narrative component to the game, with other characters communicating with the man, possibly indicating cooperative gameplay or a story-driven mission.

A sequence from a video game set in a snowy environment. The player character, a warrior dressed in rugged armor with a fur collar, wields a long staff-like weapon. The character is seen engaging in combat with an unseen opponent, executing a series of dynamic, swirling attacks that leave trails in the snow. The scene transitions to a blizzard-obscured view of a traditional East Asian village, followed by the character trudging through deep snow, visibly struggling against the harsh weather conditions. The sequence ends with the character climbing a rocky, snow-covered mountain path, emphasizing the game's focus on exploration and survival in extreme environments.

The video showcases a man and a woman as they walk down a vibrant city street in a video game setting. The man is dressed casually with a cap, a hoodie, and light-colored pants, while the woman sports a light jacket and jeans. They move through a sunny, palm-lined environment that is reminiscent of a bustling urban area in a Southern Californian city. The virtual world is rich with detail, featuring realistic textures, dynamic lighting, and a variety of architectural styles that contribute to an immersive experience.

The video presents a sequence of events in a dimly lit, atmospheric kitchen setting. The ambiance is eerie, with a focus on stealth and tension. A small, humanoid character with a yellow raincoat sneaks through the environment, which is oversized in comparison to their diminutive size. The kitchen is occupied by large, grotesque chef-like figures, whose attention the small character must avoid to navigate the space. The mise-en-sc??ne is rich with detail, featuring kitchenware, food items, and furniture that contribute to the oppressive and grim atmosphere.

The video captures the dynamic and picturesque environment of a ski resort where people are engaging in winter sports. Skiers and snowboarders of various skill levels descend the slopes, carving through the pristine snow, with some taking leisurely paths while others take more aggressive, swift turns. The snow-covered landscape is dotted with evergreen trees, and the distant mountains provide a majestic backdrop. The clear blue sky above and the bright sunlight cast sharp shadows on the snow, enhancing the visual contrast and the sense of activity on the slopes.

The video captures a serene journey through a picturesque mountain village, showcasing the natural beauty and quaint architecture of the area. The viewer is taken on a visual stroll past charming wooden buildings, some adorned with signage and outdoor seating areas, as they are gradually introduced to the breathtaking mountain range in the background. The vibrant greenery of trees and well-kept grass complements the scene, while the clear blue sky above sets a peaceful tone for the experience.

The video captures a rainy night in a bustling cityscape, where the glow of neon signs and streetlights reflects off the wet pavement. People with umbrellas navigate the slick sidewalks, their movements unhurried as they avoid puddles. The city is alive with the vibrant colors of advertisements and storefronts, casting a cinematic quality over the scene. The rain adds a layer of tranquility to the otherwise busy urban environment, creating a juxtaposition between the calm of nature and the energy of city life.

The video captures the exhilarating experience of a motorcycle ride along a dirt road that cuts through a vast mountainous landscape. The rider's perspective gives a sense of speed and motion, with the road blurring beneath and the scenery rushing by. The mountains loom in the background, their peaks occasionally dusted with snow, suggesting a cooler climate or season. Overhead, the sky is a tapestry of clouds, hinting at the possibility of changing weather, adding a dynamic element to the ride.

We will remove the video samples from our dataset / Github / project webpage as long as you need it. Please contact tsaishienchen at gmail dot com for the request.

More Details