← Atlas
Atlas survey

Data harvesting and privacy — what a non-technical audience needs to know

AI runs on harvested data. The legal wins are mostly piracy cases, not privacy ones; defaults still leak, and your every prompt is logged unless you turn it off.

29 sources ~6 min read #23 ai · privacy · data · gdpr · security · llm

TL;DR — Modern AI is built on three pipelines of harvested data: scraped public web, your chats and posts (on by default), and a $294B data-broker industry that now also feeds AI [22]. The legal news of 2025–26 is mixed: AI companies have paid out billions — but for piracy (Anthropic, $1.5B [10]) and biometrics (Clearview, ~$51.75M equity [12]), not for privacy. Europe’s flagship privacy fine — Italy’s €15M against OpenAI — was overturned in March 2026 [4]. For the audience the practical message is workflow-level, not legal: don’t paste secrets into a chatbot, opt out where the toggle exists, and assume any “share” link can end up on Google [20].

The three pipelines that feed AI

Pipeline What gets harvested Who’s affected Real example
Public-web scraping Anything reachable by a crawler — articles, forums, photos, code, even videos OpenAI/Google allegedly transcribed from YouTube [2] Authors, photographers, anyone who ever posted online Clearview AI: 60B+ facial images scraped from social media, Venmo, news sites [13]
First-party (your inputs) Prompts, uploads, voice notes, “memories”; ON by default on free ChatGPT [27]; also LinkedIn profiles + posts since 3 Nov 2025 [18] Every chatbot user, every active LinkedIn member OmniGPT breach: 34M user messages + thousands of API keys leaked [15]
Data brokers → AI Location, purchases, health proxies, app behavior, sold for training and inference Anyone with a phone FBI buys location histories on US citizens from brokers — no warrant needed [29]

The data-broker market alone was ~$294B in 2025, with ~5,000 brokers globally; California’s 2025 disclosures revealed 33 brokers admitting they sell US-resident data to entities in China, Russia, North Korea, or Iran (five include precise GPS) [22] [23].

What’s gone wrong recently

Concrete incidents the audience may have heard about — useful as examples in the talk:

Date What happened Why it matters for non-technical viewers
Mar 2023 Samsung engineers pasted semiconductor source code, defect-detection algorithms, and meeting notes into ChatGPT — three leaks in 20 days, then a company-wide ban [14] “Helping the AI” can equal “publishing to a third party”
Jul–Aug 2025 ~4,500 shared ChatGPT chats showed up in Google; researchers found ~100K had been scraped. Names, resumes, kids’ names, “emotionally sensitive disclosures” exposed [20] The “Share” button + an extra checkbox = public web page
Aug 2025 OpenAI killed the discoverable-share feature; older links/cached copies still surface [21] Once indexed, the genie doesn’t fully go back in the bottle
Sep 2025 Anthropic settled Bartz et al. v. Anthropic for $1.5B over 7M+ pirated books used to train Claude (~$3,000/book × 500K) [10] Even AI labs admit some of their training data was unlawfully obtained
2025 OmniGPT (chatbot aggregator) breach: 34M+ user messages + API keys leaked [15] “It’s just a chatbot” is wrong — every prompt is a server-side log
Nov 2025 LinkedIn flips AI training ON by default for EU/EEA/Swiss/Canadian/HK users; opt-out is buried two settings deep, and pre-cutover posts are non-revocable [18] [19] Defaults matter; “consent” via inertia
Dec 2024 Italy’s Garante fines OpenAI €15M for training on personal web data without legal basis — the first GDPR enforcement action to bite a GenAI provider [3] Privacy regulators have teeth in theory
Mar 2026 Court of Rome annuls that same Italy fine — Europe’s only successful GenAI-privacy enforcement now hangs in limbo [4] [5] EU privacy enforcement against AI is still mostly noise

Where the law sits in 2026

The settled-ish parts. The EU AI Act is in effect; from 2 Aug 2026 every general-purpose AI provider must publish a “sufficiently detailed summary” of their training data, using a mandatory European Commission template [9]. High-risk systems carry data-governance duties under Article 10 — quality, bias checks, full provenance records [8]. Non-compliance: up to €15M or 3% of global revenue [8]. California’s DROP opt-out platform launched 1 Jan 2026; from 1 Aug 2026 brokers must scan it every 45 days and delete matched records [24].

The unsettled part. Whether scraping public web data containing personal information is lawful under GDPR is still open. EU DPAs have converged on “legitimate interest with safeguards + opt-out” [6]; the proposed Digital Omnibus would codify that, though the Council may strip the provision [7]. The Italy reversal removed the strongest enforcement precedent. Net: Europe says you have rights; courts so far have not made AI labs pay for violating them.

The settled parts that aren’t about privacy. AI labs are paying out heavily — but for copyright (Bartz: training itself was fair use, but how the data was acquired wasn’t [11]) and biometrics (Clearview’s BIPA settlement: 23% equity stake, ~$51.75M [12]). Useful framing for the audience: the lawsuits that stick aren’t “you used my data” but “you stole my book” or “you scanned my face.”

The technical wrinkle: models can leak training data

Worth one slide so the audience understands why harvesting is a privacy issue and not just a copyright one:

  • Carlini et al. (2020) showed an outside attacker could coax GPT-2 into reciting hundreds of verbatim training examples — names, phone numbers, emails, code, even 128-bit UUIDs — using only the public API [25].
  • In Nov 2025, “Retracing the Past” demonstrated Confusion-Inducing Attacks that still extract memorized data from production-aligned models by steering them into high-entropy states [26].

→ If a piece of personal data is in the training set, a model can — under adversarial pressure — emit it. This is the empirical backbone of the GDPR claim.

What the audience should actually do

The defaults are not on your side. Three concrete moves, in order of payoff:

  1. Don’t paste secrets. ⚠ 73.8% of workplace ChatGPT accounts are personal, not enterprise [17]; 47% of GenAI-using employees use personal accounts at work [16]; shadow-AI-related breaches now cost $4.63M on average — ~$670K more than baseline [28]. Treat any chatbot prompt as if you’d posted it on a public forum.
  2. Opt out where the toggle exists. ChatGPT: Settings → Data Controls → “Improve the model for everyone” → off [27]. LinkedIn: Settings → Data Privacy → off (and accept that pre-3-Nov-2025 content is gone) [19]. Use Temporary Chat for one-shot questions you don’t want logged.
  3. Be careful with “Share” links and browser AI extensions. Sharable chat links are public web pages unless explicitly hidden — and have already leaked once at scale [20]. Browser extensions advertising “AI assistant” features have been caught intercepting and reselling chat content [1].

One-line takeaway for the talk

The harvest is the product. Privacy law has barely touched it; copyright and biometrics law have. Until that changes, the user-side defense is workflow hygiene and a handful of hidden opt-out toggles.

Citations · 29 sources

Click the Citations tab to load…