Open Data and the AI Black Box

Share It

We're taking part in Copyright Week, a series of actions and discussions supporting key principles that should guide copyright policy. Every day this week, various groups are taking on different elements of copyright law and policy, and addressing what's at stake and what we need to do to make sure that copyright promotes creativity and innovation.

Artificial Intelligence (AI) grabs headlines with new tools like ChatGPT and DALL-E 2, but it is already here and having major impacts on our lives. Increasingly we see law enforcement, medical care, schools and workplaces all turning to the black box of AI to make life-altering decisions—a trend we should challenge at every turn.

The vast and often secretive data sets behind this technology, used to train AI with machine learning, come with baggage. Data collected through surveillance and exploitation will reflect systemic biases and be “learned” in the process. In their worst form, the buzzwords of AI and machine learning are used to "tech wash" this bias, allowing the powerful to buttress oppressive practices behind the supposed objectivity of code.

It's time to break open these black boxes. Embracing collaboratively maintained Open Data sets in the development of AI would not only be a boon to transparency and accountability for these tools, but makes it possible for the would-be subjects to create their own innovative and empowering work and research. We need to reclaim this data and harness the power of a democratic and open science to build better tools and a better world.

Garbage in, Gospel out

Machine Learning is a powerful tool, and there are many impressive use-cases: like searching for signs of life on Mars or building synthetic antibodies. But at their core these algorithms are only as "intelligent" as the data they're fed. You know the saying: "garbage in, garbage out." Machine Learning ultimately relies on training data to learn how to make good guesses—the logic behind which is typically unknown even to the developers. But even the best guesses shouldn’t be taken as gospel.

Things turn dire when this veiled logic is used to make life-altering decisions. Consider the impact of predictive policing tools, which are built on a foundation of notoriously inaccurate and biased crime data. This AI-enabled search for "future crimes" is a perfect example of how this new tool launders biased police data into biased policing—with algorithms putting an emphasis on already over-policed neighborhoods. This self-fulfilling prophecy even gets rolled out to predict criminality by the shape of your face. Then when determining cash bail, another algorithm can set the price using data riddled with the same racist and classist biases.

Fortunately, transparency laws let researchers identify and bring attention to these issues. Crime data, warts and all, is often made available to the public. This same transparency is not expected from private actors like your employer, your landlord, or your school.

The answer isn’t simply to make all this data public. Some AI is trained on legitimately sensitive information, even if publicly available. They are toxic assets sourced by a mix of surveillance and compelled data disclosures. Preparation of this data is itself dubious, often relying on armies of highly exploited workers with no avenues to flag issues with the data or its processing. And despite many "secret sauce" claims, anonymizing these large datasets is very difficult and maybe even impossible, and the impacts of a breach would disproportionately impact the people tracked and exploited to produce it.

Instead, embracing collaboratively maintained open data sets would empower data scientists, who are already experts in transparency and privacy issues pertaining to data, to maintain them more ethically. By pooling resources in this way, consensual and transparent data collection would help address these biases, but unlock the creative potential of open science for the future of AI.

An Open and Empowering Future of AI

As we see elsewhere in Open Access, this removal of barriers and paywalls helps less-resourced people access and build expertise. The result could be an ecosystem where AI doesn’t just serve the haves over the have-nots, but in which everyone can benefit from the development of these tools.

Open Source software has long proven the power of pooling resources and collective experimentation. The same holds true of Open Data—making data openly accessible can identify deficits and let people build on one another's work more democratically. Purposefully biasing data (or "data poisoning") is possible and this unethical behavior already happens in less transparent systems and is harder to catch. While a move towards using Open Data in AI development would help mitigate bias and phony claims, it’s not a panacea; even harmful and secretive tools can be built with good data.

But an open system for AI development, from data, to code, to publication, can bring many humanitarian benefits, like in AI’s use in life-saving medical research. The ability to remix and quickly collaborate on medical research can supercharge the research process and uncover missed discoveries in the data. The result? Tools for lifesaving medical diagnosis and treatments for all peoples, mitigating the racial, gender, and other biases in medical research.

Open Data makes data work for the people. While the expertise and resources needed for machine learning remain a barrier for many, crowd-sourced projects like Open Oversight already empower communities by making information about law enforcement visibility and transparency. Being able to collect, use, and remix data to make their own tools brings AI research from the ivory towers to the streets and breaks down oppressive power imbalances.

Open Data is not just about making data accessible. It's about embracing the perspectives and creativity of all people to set the groundwork for a more equitable and just society. It's about tearing down exploitative data harvesting and making sure everyone benefits from the future of AI.

Open Access

Related Updates

Deeplinks Blog by Mario Trujillo, Jacob Hoffman-Andrews, Tori Noble | December 2, 2025

AI Chatbot Companies Should Protect Your Conversations From Bulk Surveillance

AI companies have a responsibility to their users to make sure the warrant requirement is strictly followed, to resist unlawful bulk surveillance requests, and to be transparent with their users about the number of government requests they receive.

Deeplinks Blog by Hayley Tsukayama | November 20, 2025

The Trump Administration’s Order on AI Is Deeply Misguided

Widespread news reports indicate that President Donald Trump’s administration has prepared an executive order to punish states that have passed laws attempting to address harms from artificial intelligence (AI) systems. This approach is deeply misguided.

Deeplinks Blog by Molly Buckley | November 14, 2025

A Surveillance Mandate Disguised As Child Safety: Why the GUARD Act Won't Keep Us Safe

A new bill sponsored by Sen. Hawley (R-MO), Sen. Blumenthal (D-CT), Sen. Britt (R-AL), Sen. Warner (D-VA), and Sen. Murphy (D-CT) would require AI chatbots to verify all users’ ages, prohibit minors from using AI tools, and implement steep criminal penalties for chatbots that promote or solicit certain harms. That...

Deeplinks Blog by Josh Richman | September 30, 2025

Wave of Phony News Quotes Affects Everyone—Including EFF

Whether due to generative AI hallucinations or human sloppiness, the internet is increasingly rife with bogus news content—and you can count EFF among the victims. WinBuzzer published a story June 26 with the headline, “Microsoft Is Getting Sued over Using Nearly 200,000 Pirated Books for AI...

Deeplinks Blog by Matthew Guariglia | September 16, 2025

California, Tell Governor Newsom: Regulate AI Police Reports and Sign S.B. 524

Californians should urge Gov. Gavin Newsom to sign S.B. 524: a common-sense bill that takes important first-step reforms to regulate police reports written by generative AI. This is crucial, as watchdogs struggle to figure out where and how AI is being used in a police context. S.B. 524 does several...

Deeplinks Blog by Matthew Guariglia | September 4, 2025

California Lawmakers: Support S.B. 524 to Rein in AI Written Police Reports

EFF urges California state lawmakers to pass S.B. 524, authored by Sen. Jesse Arreguín. This bill is an important first step in regaining control over police using generative AI to write their narrative police reports. This bill does several important things: It mandates that police reports written by AI...

Deeplinks Blog by Tori Noble, Kit Walsh | August 14, 2025

President Trump’s War on “Woke AI” Is a Civil Liberties Nightmare

A new executive order called “Preventing Woke AI in the Federal Government,” released alongside the AI Action Plan, seeks to strong-arm AI companies into modifying their models to conform with the Trump Administration’s ideological agenda.

Deeplinks Blog by Josh Richman | August 13, 2025

Podcast Episode: Separating AI Hope from AI Hype

If you believe the hype, artificial intelligence will soon take all our jobs, or solve all our problems, or destroy all boundaries between reality and lies, or help us live forever, or take over the world and exterminate humanity. That’s a pretty wide spectrum, and leaves a lot of people...

Press Release | July 10, 2025

EFF Investigation: AI Product for Police Reports is Designed to Hinder Audits

SAN FRANCISCO – Axon Enterprise's Draft One product, which uses generative artificial intelligence to write police report narratives based on body-worn camera audio, seems designed to stymie any attempts at auditing, transparency, and accountability, an Electronic Frontier Foundation (EFF) investigation has found. The investigation – based...

Deeplinks Blog by Tori Noble | June 23, 2025

Copyright Cases Should Not Threaten Chatbot Users’ Privacy

Like users of all technologies, ChatGPT users deserve the right to delete their personal data. Nineteen U.S. States, the European Union, and a host of other countries already protect users’ right to delete. For years, OpenAI gave users the option to delete their conversations with ChatGPT, rather than let their...

Open Access

Garbage in, Gospel out

An Open and Empowering Future of AI

Related Issues

Related Issues

Open Data and the AI Black Box

Open Data and the AI Black Box

Garbage in, Gospel out

An Open and Empowering Future of AI

Related Issues

Related Updates

Related Issues

Follow EFF:

Contact

About

Issues

Updates

Press

Donate