OpenAI is said to be asking third-party contractors applying for a gig to submit examples of “real, paid work on production-level projects” its developers have produced in their previous or current jobs, an indication both of how desperate the industry has become for good-quality training data and, depending on whom you ask, how slippery ethical standards are regarding confidentiality, copyright, and risk management.
What OpenAI Is Requesting From Contractors Now
Now, Wired reports that OpenAI and data provider Handshake AI are telling contractors to describe the work they did as part of their employment and upload actual artifacts — Word documents, PDFs, PowerPoint files, spreadsheets, images, or code repositories — instead of summaries. The idea seems to be that the corpus would be filled with “real-world examples from a specific domain,” reflecting the kind of office work you might do every day, such as project proposals, analysis decks, customer emails, and technical documentation.

According to contractors, they are instructed to sanitize proprietary and personally identifiable information from files before they are uploaded, with instructions referencing a ChatGPT-powered “Superstar Scrubbing” tool designed to assist in redaction. OpenAI did not respond to a request for comment from Wired.
Why Real-World Data Matters For Training AI Models
AI models today are starved for instances that capture the nuance of real workflows. Public web data doesn’t live up to that standard — either too generic, unreliable in quality, or lacking the formatting, formulas, and context found in business documents. Internal memos, product specs, budget models, and process docs reveal how knowledge work actually gets done. That’s exactly the kind of signal models need to perform tasks such as drafting a market analysis, constructing a financial spreadsheet, or turning meeting notes into an actionable plan.
The data squeeze is real. Some analysts, like the ones at Epoch AI, have cautioned that a data wall could come soon for high-quality texts and force labs into partnerships and licensing existing archives or creating bespoke datasets. “Eighty plus percent of business data today is unstructured,” says Gartner, referring to PDFs, slides, and emails as the bulk of enterprise-level learning resources, thus not available for OpenAI’s web-text training. One way to help narrow that gap, without outright raiding a customer’s private corpus, would be to pay contractors to curate realistic examples.
The Legal and Ethical Tripwires in This Approach
This method is not without risk, even with redaction. Any lab that relies on contractors to decide what’s safe to upload is “placing themselves at great risk,” intellectual-property attorney Evan Brown told Wired. Confidentiality depends on context, and the potential for mishaps is high. Scrubbing tools can overlook subtle information buried in metadata, document revision histories, once-hidden spreadsheet tabs, or comments. “Just because you black out names and numbers does not remove protections under trade secrets or copyright if the underlying content is still identifiable,” he wrote.
There’s a compliance angle, too. Many employment agreements and NDAs (and perhaps just intuition) say you cannot share job product beyond your employer — no matter how many private fields are stripped. And copyright doesn’t go away when you remove PII: a report’s structure, phrasing, charts, novel analysis — all of that can still be covered by copyright. Recent lawsuits over AI training — from the news publishers and authors to image libraries — are a reminder that permission and provenance do matter. Though those are for upconversion and dataset use, the methodology is the same for curated uploads.

A Broader Industry Pattern Emerging in AI Training
OpenAI is not the only one seeking better training material. Data vendors and AI labs have also increasingly employed managed workforces to generate, label, and evaluate complex examples — legal reasoning problems, spreadsheet modeling tasks, and software debugging questions — in hopes of increasing performance on tasks that drive enterprise interest. Companies like Scale AI and Surge AI have popularized expert labeling. Others license archives from publishers or arrange access to private repositories. The common thread between those approaches is the same: synthetic or scraped text just can’t substitute for task-grounded data that reflects actual jobs.
But the line between “representative” and “repurposed” is slender. If contractors regenerate templates, or paraphrase employer work too similarly, we might still be feeding models protected expression. On the other hand, if the guidance is to generate fiction and some of that little-used dataset omits messy details, then we can’t forget to include messy details in office-type documents when training. It’s finding that balance — between authenticity and not violating the law — where things get tricky.
What Contractors and Businesses Should Do Now
What’s safe for contractors is not to upload anything that was produced under an employment or client agreement unless you own it and have secured the rights to repurpose the materials.
If you are asked for samples, think about creating sample documents — short, made-up ones that display your abilities without coming from previous work product — and strip away anything traceable to any employer, client, or project.
For AI companies, a defensible approach would involve an unambiguous prohibition on employer documents, automatic and manual processes to check that there is no sensitive material in the artifacts, explicit permissions for any licensed content you make use of, and an audit trail tracking provenance. Human-in-the-loop review needs to consider more than just text; it must also look at metadata and embedded objects. Legal teams will also desire policies that seem defensible if regulators or plaintiffs ask about how a dataset was obtained.
The Bottom Line on Training Data and Ownership Risks
OpenAI’s reported request highlights a fundamental tension in AI development: to train on white-collar tasks, models require real-world, high-signal data — but the best examples are often someone else’s protected work. Until licensed pipelines become the industry norm, or we come up with better ways to approximate real-world artifacts without borrowing expression, more stickiness is coming at the juncture of innovation, privacy, and IP.
