People's Daily to Build World's Largest Chinese Corpus for Sora

BEIJING, February 21 (TMTPOST)—The People’s Data, a subsidiary of the state-owned People’s Daily, has built a semantic corpus with nearly 300 million data entries, including news and Q&A among others, for big data models and generative artificial intelligence.

The information was disclosed in an article titled "People's Data to Build the World's Largest Chinese Corpus to Support Sora's New Usage Scenario” on Tuesday, a few days after the global debut of OpenAI’s text-to-video generation tool Sora.

Following the news, the stock price of the People's Daily soared by its daily limit of 10% to 25.64 yuan (US$ 3.57) per share.

"What we talked about in the article is our business focus in 2024," an insider told TMTPost.

According to the official website, People's Data Management Co., Ltd. is a platform of the People's Daily and the People's Daily Online. The company is dedicated to building a comprehensive big data operation ecosystem to make big data more convenient and efficient in serving economic and social development across various industries.

As a "national team" in the field of big data in the new era, the People's Data Management seizes the opportunity to undertake national-level big data projects such as the National Big Data Disaster Recovery Center, National Emergency Data Center, and Smart Party Building Data Center. It aims to create a secure, efficient, open, and shared national-level big data platform and is committed to handling the "storage, management, and use" of big data for various levels of party and government departments, central state-owned enterprises, private enterprises, and beyond.

Last Thursday, the U.S.-based company OpenAI announced the launch of Sora, a new generative artificial intelligence model. Sora can directly generate 60-second videos through text instructions, including highly detailed backgrounds, complex multi-angle shots, and emotionally rich characters, attracting global attention.

OpenAI said that Sora is the foundation of a model that can understand and simulate the real world, and this capability is considered a significant milestone toward achieving General Artificial Intelligence (AGI).

The People's Data noted in the article that the semantic corpus is designed for applications in large AI models, general artificial intelligence, intelligent internet, and other scenarios. It addresses over 10,000 key questions that current large models often struggle to answer adequately. Enriching and supplementing the corpus further could make data retrieval more convenient, further lowering the entry barriers for ordinary people to use AI and helping them access more comprehensive information in a simpler way.

However, People's Data did not disclose more details about the Chinese name or specifics of the corpus. As of now, Sora is still not available for the public and many lectures related to Sora or AI are on sell in China.

People's Data also emphasized the importance of "compliance" in AI technology and application innovation. It highlighted the need for future exploration to strengthen the security, norms, and sustainable development of AI large models.

It is crucial to fully exploit the value of various data resources and use mainstream value corpora as a starting point to promote the safe development of China's AI industry, said the company.