News Arena

Join us

Home
/

are-family-videos-on-youtube-training-ai-models

Technology

Are family videos on YouTube training AI models?

The promised artificial intelligence revolution requires data. Lots and lots of data. OpenAI and Google have begun using YouTube videos to train their text-based AI models. But what does the YouTube archive actually include? 

News Arena Network - New Delhi - UPDATED: June 28, 2024, 05:29 AM - 7 mins read

Are family videos on YouTube training AI models?

Are family videos on YouTube training AI models?

Representational Image


The promised artificial intelligence revolution requires data. Lots and lots of data. OpenAI and Google have begun using YouTube videos to train their text-based AI models. But what does the YouTube archive actually include? 

 

Our team of digital media researchers at the University of Massachusetts Amherst collected and analysed random samples of YouTube videos to learn more about that archive. We published an 85-page paper about that dataset and set up a website called TubeStats for researchers and journalists who need basic information about YouTube.

 

Now, we’re taking a closer look at some of our more surprising findings to better understand how these obscure videos might become part of powerful AI systems. We’ve found that many YouTube videos are meant for personal use or for small groups of people, and a significant proportion were created by children who appear to be under 13.

 

Bulk of the YouTube iceberg 

 

Most people’s experience of YouTube is algorithmically curated: Up to 70% of the videos users watch are recommended by the site’s algorithms. Recommended videos are typically popular content such as influencer stunts, news clips, explainer videos, travel vlogs and video game reviews, while content that is not recommended languishes in obscurity.

 

Some YouTube content emulates popular creators or fits into established genres, but much of it is personal: family celebrations, selfies set to music, homework assignments, video game clips without context and kids dancing. The obscure side of YouTube – the vast majority of the estimated 14.8 billion videos created and uploaded to the platform – is poorly understood.

 

Illuminating this aspect of YouTube – and social media generally – is difficult because big tech companies have become increasingly hostile to researchers.

 

We’ve found that many videos on YouTube were never meant to be shared widely. We documented thousands of short, personal videos that have few views but high engagement – likes and comments – implying a small but highly engaged audience. These were clearly meant for a small audience of friends and family. Such social uses of YouTube contrast with videos that try to maximize their audience, suggesting another way to use YouTube: as a video-centered social network for small groups.

 

Other videos seem intended for a different kind of small, fixed audience: recorded classes from pandemic-era virtual instruction, school board meetings and work meetings. While not what most people think of as social uses, they likewise imply that their creators have a different expectation about the audience for the videos than creators of the kind of content people see in their recommendations.

 

Fuel for the AI machine 

 

It was with this broader understanding that we read The New York Times exposé on how OpenAI and Google turned to YouTube in a race to find new troves of data to train their large language models. An archive of YouTube transcripts makes an extraordinary dataset for text-based models.

 

There is also speculation, fueled in part by an evasive answer from OpenAI’s chief technology officer Mira Murati, that the videos themselves could be used to train AI text-to-video models such as OpenAI’s Sora.

 

The New York Times story raised concerns about YouTube’s terms of service and, of course, the copyright issues that pervade much of the debate about AI. But there’s another problem: How could anyone know what an archive of more than 14 billion videos, uploaded by people all over the world, actually contains? It’s not entirely clear that Google knows or even could know if it wanted to.

 

Kids as content creators 

 

We were surprised to find an unsettling number of videos featuring kids or apparently created by them. YouTube requires uploaders to be at least 13 years old, but we frequently saw children who appeared to be much younger than that, typically dancing, singing or playing video games.

 

In our preliminary research, our coders determined nearly a fifth of random videos with at least one person’s face visible likely included someone under 13. We didn’t take into account videos that were clearly shot with the consent of a parent or guardian.

 

Our current sample size of 250 is relatively small – we are working on coding a much larger sample – but the findings thus far are consistent with what we’ve seen in the past. We’re not aiming to scold Google. Age validation on the internet is infamously difficult and fraught, and we have no way of determining whether these videos were uploaded with the consent of a parent or guardian. But we want to underscore what is being ingested by these large companies’ AI models.

 

Small reach, big influence 

 

It’s tempting to assume OpenAI is using highly produced influencer videos or TV newscasts posted to the platform to train its models, but previous research on large language model training data shows that the most popular content is not always the most influential in training AI models. A virtually unwatched conversation between three friends could have much more linguistic value in training a chatbot language model than a music video with millions of views.

 

Unfortunately, OpenAI and other AI companies are quite opaque about their training materials: They don’t specify what goes in and what doesn’t. Most of the time, researchers can infer problems with training data through biases in AI systems’ output. But when we do get a glimpse at training data, there’s often cause for concern. For example, Human Rights Watch released a report on June 10, 2024, that showed that a popular training dataset includes many photos of identifiable kids.

 

The history of big tech self-regulation is filled with moving goalposts. OpenAI in particular is notorious for asking for forgiveness rather than permission and has faced increasing criticism for putting profit over safety.

 

Concerns over the use of user-generated content for training AI models typically center on intellectual property, but there are also privacy issues. YouTube is a vast, unwieldy archive, impossible to fully review.

 

Models trained on a subset of professionally produced videos could conceivably be an AI company’s first training corpus. But without strong policies in place, any company that ingests more than the popular tip of the iceberg is likely including content that violates the Federal Trade Commission’s Children’s Online Privacy Protection Rule, which prevents companies from collecting data from children under 13 without notice.

 

With last year’s executive order on AI and at least one promising proposal on the table for comprehensive privacy legislation, there are signs that legal protections for user data in the US might become more robust.

 

Have you unwittingly helped train ChatGPT? 

 

The intentions of a YouTube uploader simply aren’t as consistent or predictable as those of someone publishing a book, writing an article for a magazine or displaying a painting in a gallery. But even if YouTube’s algorithm ignores your upload and it never gets more than a couple of views, it may be used to train models like ChatGPT and Gemini.

 

As far as AI is concerned, your family reunion video may be just as important as those uploaded by influencer giant Mr. Beast or CNN. 

 

This article first appeared in The Conversation



Related News
OpenAI unveils CriticGPT to improve GPT-4

OpenAI unveils CriticGPT to improve GPT-4

June 29, 2024, 01:40 AM - 2 mins read

Scientists at the University of Tokyo have discovered a revolutionary method to attach living human skin to robotic faces without causing damage or tearing.

Robots can now smile with real skin

June 28, 2024, 07:51 AM - 2 mins read

Are family videos on YouTube training AI models?

Are family videos on YouTube training AI models?

June 28, 2024, 05:29 AM - 7 mins read

An English sentence translated into Khasi using Google Translate.

Meghalaya's Khasi language now on Google translate

June 28, 2024, 04:49 AM - 2 mins read

He said from extracting data-driven insights to solving industry-specific problems to revolutionising customer engagement, the synergy between IT and AI will enable a new era of technological prowess...

AI can secure future of businesses: Anand Mahindra

June 28, 2024, 04:40 AM - 4 mins read

India's Data Center Capacity Expansion Expected to Quintuple: Report

India's data center capacity to grow fivefold

June 28, 2024, 01:32 AM - 2 mins read

What began as a straightforward test flight to the International Space Station (ISS) for NASA astronauts Butch Wilmore and Sunita Williams aboard Boeing’s Starliner spacecraft has transformed into an unplanned, prolonged stay in orbit.

A troubled journey in space

June 27, 2024, 07:55 AM - 4 mins read

An image of a USB Type-C cable and port.

India plans universal smartphone charging port

June 27, 2024, 06:15 AM - 2 mins read

Indian Space Research Organisation (ISRO) has announced that its upcoming mission, Chandrayaan-4, which aims to bring back samples from the Moon, will be launched in parts and assembled in space.

Chandrayaan-4 to launch in parts: ISRO

June 27, 2024, 03:02 AM - 2 mins read

Musk's SpaceX to help NASA deorbit ISS by 2030

SpaceX to help NASA deorbit ISS by 2030

June 26, 2024, 08:03 PM - 2 mins read

Scientists from the Department of Science and Technology (DST), Government of Sikkim, conducting research at East Rathong Glacier in Gyalshing district, Khangchendzonga National Park, at an elevation of 4,600 to 6,700 meters.

Sikkim's GLOF study identifies 19 dangerous lakes

June 26, 2024, 02:21 AM - 3 mins read

Union Minister Jitendra Singh announced the installation of seven new nuclear reactors, which is expected to increase the country's nuclear power generation capacity by approximately 70% over the next five years.

India to add 7 new nuclear reactors in 5 years

June 26, 2024, 12:13 AM - 2 mins read

https://www.instagram.com/thelaughclubofficial/
https://www.instagram.com/burraahhh_/

Technology

See All
OpenAI unveils CriticGPT to improve GPT-4

OpenAI unveils CriticGPT to improve GPT-4

June 29, 2024, 01:40 AM - 2 mins read

CriticGPT is still in development and is not yet accessible to users or testers. Its objective is to improve the quality of AI-generated code.

Read more
Scientists at the University of Tokyo have discovered a revolutionary method to attach living human skin to robotic faces without causing damage or tearing.

Robots can now smile with real skin

June 28, 2024, 07:51 AM - 2 mins read

The team created ‘anchors’ by applying collagen gel to small V-shaped holes on the robot’s exterior surface.  According to Professor Takeuchi, this method offers “a more seamless and durable attachment.” The combination of human skin's flexibility and the strong adhesion provided by these anchors allows for mechanical movement without damaging the skin.

Read more
An English sentence translated into Khasi using Google Translate.

Meghalaya's Khasi language now on Google translate

June 28, 2024, 04:49 AM - 2 mins read

This addition, along with 109 other languages, marks a significant step in bridging communication gaps for millions worldwide.

Read more
He said from extracting data-driven insights to solving industry-specific problems to revolutionising customer engagement, the synergy between IT and AI will enable a new era of technological prowess...

AI can secure future of businesses: Anand Mahindra

June 28, 2024, 04:40 AM - 4 mins read

He said from extracting data-driven insights to solving industry-specific problems to revolutionising customer engagement, the synergy between IT and AI will enable a new era of technological prowess...

Read more
India's Data Center Capacity Expansion Expected to Quintuple: Report

India's data center capacity to grow fivefold

June 28, 2024, 01:32 AM - 2 mins read

India is projected to annually increase its colocation data center capacity by 464 MW until 2028, as per a report from Cushman and Wakefield. The country requires an additional 1.7-3.6 GW (gigawatt) data centre capacity over and above the planned development of 2.32 GW (colocation) capacity.

Read more
What began as a straightforward test flight to the International Space Station (ISS) for NASA astronauts Butch Wilmore and Sunita Williams aboard Boeing’s Starliner spacecraft has transformed into an unplanned, prolonged stay in orbit.

A troubled journey in space

June 27, 2024, 07:53 AM - 4 mins read

While the astronauts remain safe, the prolonged stay highlights the persistent issues that have plagued Starliner from its inception. In addition to the previously known helium leak, four more leaks were discovered in Calypso’s propulsion system during the flight to the ISS.

Read more
An image of a USB Type-C cable and port.

India plans universal smartphone charging port

June 27, 2024, 06:15 AM - 2 mins read

According to reports, this new regulation is expected to be implemented by June 2025, potentially extending to laptops by 2026.

Read more
Indian Space Research Organisation (ISRO) has announced that its upcoming mission, Chandrayaan-4, which aims to bring back samples from the Moon, will be launched in parts and assembled in space.

Chandrayaan-4 to launch in parts: ISRO

June 27, 2024, 03:02 AM - 2 mins read

Chandrayaan-4, expected to surpass the carrying capacity of ISRO's most powerful rockets, will necessitate multiple launches. The spacecraft will be assembled in orbit before commencing its journey to the Moon, Somanath revealed on the sidelines of an event in Delhi.

Read more
Musk's SpaceX to help NASA deorbit ISS by 2030

SpaceX to help NASA deorbit ISS by 2030

June 26, 2024, 08:03 PM - 2 mins read

The International Space Station (ISS), which weighs 430,000 kilograms (950,000 pounds), stands as the largest individual structure ever assembled in space.

Read more
Scientists from the Department of Science and Technology (DST), Government of Sikkim, conducting research at East Rathong Glacier in Gyalshing district, Khangchendzonga National Park, at an elevation of 4,600 to 6,700 meters.

Sikkim's GLOF study identifies 19 dangerous lakes

June 26, 2024, 02:20 AM - 3 mins read

This revelation follows a scientific expedition by the Department of Science and Technology (DST), Government of Sikkim, to the East Rathong Glacier in West Sikkim.

Read more
Union Minister Jitendra Singh announced the installation of seven new nuclear reactors, which is expected to increase the country's nuclear power generation capacity by approximately 70% over the next five years.

India to add 7 new nuclear reactors in 5 years

June 26, 2024, 12:13 AM - 2 mins read

Singh emphasized the need for the Department of Atomic Energy to enhance integration and collaboration to fully harness its potential. This includes capacity building, knowledge sharing, and leveraging resources and expertise. He highlighted the importance of developing indigenous technology to ensure energy security.

Read more

TOP CATEGORIES

  • Nation

QUICK LINKS

About us Rss FeedSitemapPrivacy PolicyTerms & Condition
logo

2024 News Arena India Pvt Ltd | All rights reserved | The Ideaz Factory