Just a year after the initial explosion of interest in AI video generation, the competitive landscape is reportedly undergoing a significant transformation. The focus is shifting from simply achieving video generation capabilities to the critical challenge of demonstrating profitability. This evolution appears to be eroding the once seemingly unassailable dominant status of OpenAI’s Sora, as a wave of new entrants since 2024 vie for a slice of the burgeoning market.
Is the “Cake-Sharing” Phase Underway in AI Video Generation?
The launch of OpenAI’s Sora in February 2024 ignited a frenzy in the AI video generation sector. Domestic startups and major tech companies in China and elsewhere rapidly entered the fray. Many of these new models and products have quickly approached, and in some cases even surpassed, Sora in terms of video length, quality, and efficiency, leading to questions about its continued dominance.
According to a recent a16z Top 100 AI Applications list, AI video generation tools have made significant strides in quality and controllability over the past six months. Notably, the report suggests that these tools have a higher potential for user monetization compared to other more hyped generative AI products.
The a16z analysis further indicates that the most popular applications don’t necessarily generate the most revenue. Tools focused on image/video editing, visual enhancement, “ChatGPT-like” imitations, and image/video generation are reportedly seeing higher revenue despite potentially narrower use cases.
Interestingly, three AI video generation applications – HailuoAI, Kling, and Sora – made their debut on the web-based version of the a16z list. Data up to January 2025 showed that both Hailuo and Kling had surpassed Sora in user traffic.
The monetization strategies employed by these AI video generation tools are largely similar, encompassing pay-as-you-go models, subscription services, free basic versions with premium features, enterprise customization, and combinations of these approaches.
A potential turning point in the shift towards prioritizing profitability was OpenAI’s recent adjustment to Sora’s pricing strategy in late March 2025. The company removed credit limits for paid users, allowing Plus and Pro subscribers to generate an unlimited number of videos. However, this change has not universally resonated with users.
Numerous users on platforms like X and Reddit reportedly expressed that despite the removal of credit restrictions, they are not inclined to use Sora. Many indicated a preference for perceived superior alternatives like Google’s Veo 2 or the open-source Wan2.1. Some users also pointed out that OpenAI’s decision to lift credit limits might be due to a lack of user adoption and expressed disappointment that the adjusted Sora still isn’t a complete, final product. This sentiment echoes earlier criticisms following Sora’s initial release in December 2024, where it reportedly received negative feedback regarding its video generation quality.
Amidst this evolving landscape, when users discuss video generation models and products they are more willing to use or pay for, names like Meta’s Emu, Google’s Veo 2, Alibaba’s Wan 2.1, and Kuaishou’s Kling 1.6 are frequently mentioned. These models are reportedly catching up to, and in some aspects exceeding, Sora in terms of generation quality and video length capabilities.
How AI Video Generation Players Are Monetizing Their Offerings
Following the surge in popularity of AI video generation, early entrants are now leveraging their products’ unique advantages and features to attract paying users, including individual creators, advertising studios, e-commerce bloggers, and professionals in the film and television industries.
While OpenAI’s Sora was initially a leader in generating high-definition 60-second videos, this is no longer a unique advantage. Several competitors have matched or even surpassed Sora in video length, clarity, and visual quality. Sora’s pricing page indicates that Plus users can generate 10-second videos, while Pro users can generate 20-second videos (with the possibility of extension). In contrast, newer models like Luma’s Ray2 and Vidu can generate one-minute high-definition videos, and Kuaishou’s Kling 1.6 can generate 5 or 10-second clips that can be extended to up to two minutes.
Functionally, popular video generation models and products currently offer features such as text-to-video, image-to-video, real-time video editing, and automatic addition of sound effects. Furthermore, many are incorporating new features based on specific application needs in their updates.
Beyond basic capabilities like video length and resolution, the ongoing iteration of AI video generation is focusing on crucial aspects for industries like film and advertising, including precise text control, consistent character portrayal, style customization, and even control over different camera angles and perspectives.
Some companies are also focusing on enhancing the scalability and adaptability of their products to suit video projects of varying sizes and complexities, supporting diverse video formats and resolutions, and integrating with other tools and platforms to meet a wider range of application scenarios.
To boost revenue, some companies are also employing technical strategies to reduce the development and computational costs associated with their video generation models, thereby increasing profit margins. This includes improving model architecture and adopting more efficient algorithms to enhance operational efficiency and reduce computational resource consumption during video generation. For example, Tencent’s Hunyuan Video model reportedly reduced computational consumption by 80% through scaling techniques. Additionally, research teams from Peking University, Kuaishou, and Beijing University of Posts and Telecommunications have proposed the Pyramidal Flow Matching method to reduce the processing required for training video generators by downsampling and progressively upsampling embeddings during training, thus lowering computational costs. Furthermore, the recently open-sourced Open-Sora 2.0 by Colossal-AI claims to achieve commercial-grade performance with an 11B parameter model trained for $200,000 (using 224 GPUs), rivaling models like HunyuanVideo and the 30B parameter Step-Video.
Areas for Improvement in Video Generation Models
The models and products emerging from domestic and international startups, unicorns, and internet giants are already impacting content creators in industries like advertising and entertainment. While some products are beginning to generate revenue for companies, current video generation models still face significant limitations.
You Yang, the founder of Colossal-AI, recently shared his views on the future development of video generation models, emphasizing the need for capabilities such as precise text control, arbitrary camera angles, consistent character portrayal, and style customization. He noted that while current text-to-image applications lack complete precise control, future video generation models have significant potential in accurately translating textual descriptions into video form. He also highlighted the importance of AI video large models being able to freely adjust camera angles and positions, similar to real-world filming, and maintaining consistent character appearance across different shots and scenes, which is crucial for advertising and film production.
Given the ongoing need for improvement, researchers from companies and universities are continuously exploring and proposing new methods. Researchers from Tsinghua University and Tencent recently proposed “Video-T1,” inspired by the application of Test-Time Scaling in LLMs, exploring its potential in video generation models. Their work frames Test-Time Scaling in video generation as a trajectory search problem from Gaussian noise space to the target video distribution and introduces Random Linear Search as a basic implementation. By randomly sampling multiple video generations and using a VLM for scoring, the best sample is selected as the output. They also proposed the Tree-of-Frames (ToF) method, which adaptively expands and prunes video branches to dynamically balance computational cost and generation quality, improving search speed and video quality. ToF uses a test-time verifier to evaluate intermediate results and employs heuristics to efficiently navigate the search space, evaluating at appropriate points in the video generation process to select promising generation trajectories, thus enhancing efficiency and quality. The researchers observed that the first frame significantly impacts overall video alignment and that different parts of the video (beginning, middle, end) have varying prompt alignment needs. To address this, they utilized chain-of-thought for single-frame image generation and hierarchical prompting to enhance frame generation and prompt alignment, building the overall Tree-of-Frames process. The Video-T1 model trained with ToF achieved a top score increase of 5.86% on the VBench benchmark, with model capability increasing with the number of samples selected during inference, demonstrating continuous scaling potential.
Researchers from Kuaishou Technology and the Chinese University of Hong Kong proposed the FullDiT method in March 2025, which integrates multi-task conditions (such as identity transfer, depth mapping, and camera motion) into trained video generation models, allowing users more granular control over the video generation process. FullDiT integrates ControlNet-like mechanisms directly into the training of video generation models, unifying multi-task conditions into a single trained model. It employs a unified attention mechanism to capture spatiotemporal relationships across different conditions, converting all condition inputs (text, camera motion, identity, and depth) into a unified token format and processing them through a series of Transformer layers with full self-attention. FullDiT’s training relies on tailored labeled datasets for each condition type and uses a progressive training process, introducing more challenging conditions earlier in training. Testing showed that FullDiT achieved state-of-the-art performance on metrics related to text, camera motion, identity, and depth control, generally outperforming other methods in overall quality metrics, although its smoothness was slightly lower than ConceptMaster.
This dynamic environment underscores the intense competition and rapid innovation within the AI video generation sector, as players increasingly focus on building sustainable and profitable businesses while continuing to push the boundaries of video generation technology.
The post first appeared on .