What Can Runway or Pika Learn from OpenAI’s Sora?
PLUS: “Chinese Vision Pro”; SoftBand’s Son AI strategy; First expensive Chinese AI startup
By: JINPENG LI
Last week, when OpenAI unveiled Sora, the term “text-to-video” immediately caught my attention, and I quickly shared the news with Oliver. The discussion about Sora on the internet has continued for a week, ranging from the background of the Sora research team to concerns about the future of similar text-to-video models like Runway and Pika. Recently, Yann LeCun (Chief AI Scientist at Meta) became the center of discussion again, due to his views that Sora cannot understand the physical world.
Everyone on the Chinese internet is talking about how you could make money off Sora (it’s not out in the open yet) and there’s a lot of doubt hanging over Pika’s future too. Last year, Pika’s founder, Demi Guo, was labeled as “Stanford genius beauty” and caught people’s attention. Guo has pointed out that the advancement of generative video technology is currently hindered by algorithms. When asked about Sora, she expressed excitement and a readiness to rival.
I hope to use this piece to clarify what kind of algorithmic approach Sora offers to Runway and Pika, in understandable words.
First off, let’s break down two key terms: “Diffusion Transformer” and “patch.” The foundation of Sora lies in its training on a vast dataset of videos and images, leveraging a “diffusion transformer” architecture that operates on spacetime patches of video.
Here’s a bit more detail:
Sora builds on the significant work of OpenAI’s DALL·E and GPT. For example, the detailed descriptions for video segments-covering everything from characters and settings to style and cinematography-are crafted using highly descriptive titles produced by DALL·E 3 for the visual training data. Moreover, OpenAI leverages GPT to expand short user prompts into elaborate, detailed captions before they’re sent to the video model.
Once Sora identifies the key information, it compresses the videos by inputting them into a network, reducing the dimensionality of visual data. These videos are sliced into many small squares, which OpenAI refers to as “patches.”
Patches act much like the basic data units or “tokens” in LLMs, forming the core material for training. This method makes video data preprocessing more efficient by removing the need for extensive preliminary tasks, such as ensuring uniform resolution and aspect ratio for model training.
No matter the format, all videos ultimately get cut into patches of the same format, similar to how all LEGO pieces are uniform blocks. Each patch is then given an additional dimension-time-upgrading it to a spacetime latent patch.
At this stage, the “diffusion transformer” begins its work. The diffusion model, an algorithm used by image and video model developers like Runway, Pika, and Midjourney, gradually refines the current patches until the final presentation is achieved. But Sora is more than just a diffusion model.
OpenAI’s report states that Sora employs a diffusion model based on Transformer architecture. The same architecture underpins GPT.
Back in 2021, the Google Brain team introduced a model called Vision Transformer (ViT), which identifies images by calculating the dependencies between pixels within the same image. Before this, language and vision were considered distinct realms: language as linear and sequential, and vision as spatial and parallel data. However, Transformers have shown that images can be treated as sequences, much like sentences composed of pixels in a structured order.
Not only images but most problems can be translated into sequence issues, for example, analyzing social media trends through the sequencing of user engagement patterns. Essentially, a video is just a series of continuous images.
With the “Diffusion Model + Transformer Architecture,” Sora can generate natural-looking videos. However, its simulation of real-world interactions still requires improvement.
Asia Must Reads
Chinese Vision Pro?
Since Apple released their first preview of the Vision Pro, like people in most parts of the world, customers in China went wild. Someone quickly joked, “When can we buy the Huaqiangbei version?”-referring to the famed consumer electronics market in Shenzhen known for its knock-off gadgets.
And true to form, you can already purchase a much cheaper Chinese version of the Vision Pro for only $280. EmdoorVR, the company behind this, has been waiting for such an opportunity. Their headset, while lacking eye-tracking and hand-gesture features, focuses on providing an entertainment experience.
Founded in 2015, EmdoorVR began as a contract manufacturer of VR devices for Chinese tech firms. The enterprise-facing business accounts for about 90% of the company’s total revenue. EmdoorVR is banking on Apple to lead the way in popularizing this technology. (Iris Deng / SCMP)
SoftBand’s Son AI strategy
Masayoshi Son, the CEO of SoftBank, is planning to raise up to $100 billion for a chip venture poised to compete with Nvidia, with 70% of the funding potentially coming from Middle Eastern institutions.
Why it matters: This venture marks another of Son’s forays into the chip industry, as SoftBank already owns about a 90% stake in chip designer Arm. According to Bloomberg, Son aims for this AI chip venture to complement Arm’s capabilities. After adopting a defensive strategy in tech investments during the pandemic, SoftBank has recently returned to profitability earlier this month. ( Min Jeon Lee / Bloomberg)
First expensive Chinese AI startup
Chinese AI startup Moonshot.AI has recently secured over $1 billion in new funding from Sequoia China, Xiaohongshu, Meituan, Alibaba. Founded in March last year, the core team has contributed to major projects like Google Gemini and Google Bard, establishing it as a talent-rich AI company.
Why it matters: The startup valuation has surged to $2.5 billion, overtaking the previously top-ranked MiniMax and Zhipu.ai. It launched its Moonshot model and the Kimi assistant, supporting up to 200,000 Chinese characters.
Bonus
ByteDance refutes claims of launching a “Chinese Sora” named Boximator, a project for controlling motions in videos via text. They clarified that Boximator is still a research project and falls short of international models in quality and video length.
Japanese leading car chip supplier Renesas Electronics pays $5.8 billion in cash to acquire Australian software tools provider Altium. This move comes as they try to diversify and scale up their business and alleviate geopolitical tensions where China is a key source of revenue. Renesas makes chips for Japan’s Toyota and Nissan. (Nic Flides and David Keohane / FT)
India’s technology sector is projected to grow at a slower pace this fiscal year, with clients reducing spending and delaying decision-making. (Reuters)
Japan’s ruling party will propose that the government introduce a new law to regulate GAI technologies within 2024, aiming to address issues such as disinformation and rights infringement. (Satoshi Sugiyama and Kantaro Komiya / Nikkei Asia)
Microsoft has tracked hacking groups from Russia, China, and Iran using OpenAI to refine their skills and deceive their targets. China denies these claims.(Raphael Satter / Reuters)
TikTok and Bigo Live hired many Pakistanis for content moderation as a way to make amends with their government during the pandemic. Currently, these workers are struggling due to the lack of career pathways. (Zhua Siddiqui / Rest of world)
A campaign from China is utilizing AI and a network of social media accounts to amplify American discontent and division ahead of the U.S. presidential election, focusing on issues like urban decay, homelessness, fentanyl abuse, gun violence, and crumbling infrastructure. (Aren’t these issues real anyway?) (Tiffany Hsu / New York Times)
Youtube Co-founder Stev Chen plans to launch a project in Taiwan to connect its tech talent with Silicon Valley’s resources. (Ralph Jennings / SCMP)
Originally published at https://theabacus.substack.com.