Leaked list reveals websites Meta is scraping to train its AI (LSA, many Black porn sites among them)

east · Aug 6, 2025

...but not the coli, abw, boxden, nt, or the rest of the multiverse :mjlol:

list here: https://www.dropsitenews.com/api/v1/file/b3555944-e204-4f5e-9a64-e44281b19a82.pdf

Meta has scraped data from the most-trafficked domains on the internet —including news organizations, education platforms, niche forums, personal blogs, and even revenge porn sites—to train its artificial intelligence models, according to a leaked list obtained by Drop Site News. By scraping data from roughly 6 million unique websites, including 100,000 of the top-ranked domains, Meta has generated millions of pages of content to use for Meta’s AI-training pipeline.

The sites that Meta scrapes consist of copyrighted content, pirated content, and adult videos, some of whose content is potentially illegally obtained or recorded, as well as news and original content from prominent outlets and content publishers. They include mainstream businesses like Getty Images, Shopify, Shutterstock, but also extreme pornographic content, including websites advertising explicit sexual content and humiliation porn that exploits teenagers.

While high-profile sites like The New York Times, which has engaged in litigation to prevent their content from being used to train AI models, are absent from the list, the leak shows that Meta often found ways to stop sites from defending themselves from being scraped. The scrapers ignored common web protocols that site owners use to block automated scraping, including “robots.txt” which is a text file placed on websites aimed at preventing the indexing of context. The data were shared with Drop Site by whistleblowers frustrated over Meta’s support for Israel in conducting its genocide in the Gaza Strip. According to the whistleblowers, the data is indicative of Meta’s unethical and potentially illegal business practices more broadly.

LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AI

The tech giant is sidestepping guardrails that websites use to prevent being scraped, data show, in a move whistleblowers say is unethical and potentially illegal.

www.dropsitenews.com

Pazzy · Aug 6, 2025

east said:
...but not the coli, abw, boxden, nt, or the rest of the multiverse

list here: https://www.dropsitenews.com/api/v1/file/b3555944-e204-4f5e-9a64-e44281b19a82.pdf

LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AI

The tech giant is sidestepping guardrails that websites use to prevent being scraped, data show, in a move whistleblowers say is unethical and potentially illegal.

www.dropsitenews.com

They are here data collecting too. Read the terms of service

Capitol · Aug 6, 2025

The internet is one big data mine. Collecting information is big business. They have been trying to curate the perfect echo chamber for each individual

east · Aug 6, 2025

Capitol said:
The internet is one big data mine. Collecting information is big business. They have been trying to curate the perfect echo chamber for each individual

that's the facial purpose, there's always an ulterior one tho

In a ceremony in June at Joint Base Myer-Henderson Hall in Arlington, Va., four current and former executives from Meta, OpenAI and Palantir lined up onstage to swear an oath to support and defend the United States. At the event, they were pronounced lieutenant colonels in the Army's new technical innovation unit, Detachment 201, which will advise the Army on new technologies for potential combat.

Last year, Meta changed its policies to allow its A.I. technologies to be used for military purposes. Andrew Bosworth, Meta’s chief technology officer and one of the new lieutenant colonels in Detachment 201, said America’s “national security benefits enormously from American industry bringing these technologies to life.”

Meta declined to comment.

https://www.nytimes.com/2025/08/04/technology/google-meta-openai-military-war.html

LadyJ2 · Aug 6, 2025

“It’s just the internet it’s not real life” with another L.

Everyone is on the internet.

summwunn · Aug 6, 2025

east said:
that's the facial purpose, there's always an ulterior one tho

https://www.nytimes.com/2025/08/04/technology/google-meta-openai-military-war.html

....so now people can get recruited into the Army Reserve as O-5 without any prior enlistments or ROTC ?!?!!! . . . and getting the age-waiver at the same time ?!?!!!

Wargames · Aug 6, 2025

east said:
...but not the coli, abw, boxden, nt, or the rest of the multiverse

list here: https://www.dropsitenews.com/api/v1/file/b3555944-e204-4f5e-9a64-e44281b19a82.pdf

LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AI

The tech giant is sidestepping guardrails that websites use to prevent being scraped, data show, in a move whistleblowers say is unethical and potentially illegal.

www.dropsitenews.com

Can’t have the black male AI say fukk these crackers in a middle of an answer Based on source material.

:yeshrug:

Though give it time, the AI will eventually figure it out on its own

:ufdup:

Leaked list reveals websites Meta is scraping to train its AI (LSA, many Black porn sites among them)

More options

east

Screwed up... till tha casket drops!!

LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AI

Pazzy

Superstar

LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AI

Capitol

Did this work

east

Screwed up... till tha casket drops!!

LadyJ2

Superstar

summwunn

All Star

Wargames

One Of The Last Real Ones To Do It

LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AI

Similar threads