We're taking part in Copyright Week, a series of actions and discussions supporting key principles that should guide copyright policy. Every day this week, various groups are taking on different elements of copyright law and policy, and addressing what's at stake and what we need to do to make sure that copyright promotes creativity and innovation.
Artificial Intelligence (AI) grabs headlines with new tools like ChatGPT and DALL-E 2, but it is already here and having major impacts on our lives. Increasingly we see law enforcement, medical care, schools and workplaces all turning to the black box of AI to make life-altering decisions—a trend we should challenge at every turn.
The vast and often secretive data sets behind this technology, used to train AI with machine learning, come with baggage. Data collected through surveillance and exploitation will reflect systemic biases and be “learned” in the process. In their worst form, the buzzwords of AI and machine learning are used to "tech wash" this bias, allowing the powerful to buttress oppressive practices behind the supposed objectivity of code.
It's time to break open these black boxes. Embracing collaboratively maintained Open Data sets in the development of AI would not only be a boon to transparency and accountability for these tools, but makes it possible for the would-be subjects to create their own innovative and empowering work and research. We need to reclaim this data and harness the power of a democratic and open science to build better tools and a better world.
Garbage in, Gospel out
Machine Learning is a powerful tool, and there are many impressive use-cases: like searching for signs of life on Mars or building synthetic antibodies. But at their core these algorithms are only as "intelligent" as the data they're fed. You know the saying: "garbage in, garbage out." Machine Learning ultimately relies on training data to learn how to make good guesses—the logic behind which is typically unknown even to the developers. But even the best guesses shouldn’t be taken as gospel.
Things turn dire when this veiled logic is used to make life-altering decisions. Consider the impact of predictive policing tools, which are built on a foundation of notoriously inaccurate and biased crime data. This AI-enabled search for "future crimes" is a perfect example of how this new tool launders biased police data into biased policing—with algorithms putting an emphasis on already over-policed neighborhoods. This self-fulfilling prophecy even gets rolled out to predict criminality by the shape of your face. Then when determining cash bail, another algorithm can set the price using data riddled with the same racist and classist biases.
Fortunately, transparency laws let researchers identify and bring attention to these issues. Crime data, warts and all, is often made available to the public. This same transparency is not expected from private actors like your employer, your landlord, or your school.
The answer isn’t simply to make all this data public. Some AI is trained on legitimately sensitive information, even if publicly available. They are toxic assets sourced by a mix of surveillance and compelled data disclosures. Preparation of this data is itself dubious, often relying on armies of highly exploited workers with no avenues to flag issues with the data or its processing. And despite many "secret sauce" claims, anonymizing these large datasets is very difficult and maybe even impossible, and the impacts of a breach would disproportionately impact the people tracked and exploited to produce it.
Instead, embracing collaboratively maintained open data sets would empower data scientists, who are already experts in transparency and privacy issues pertaining to data, to maintain them more ethically. By pooling resources in this way, consensual and transparent data collection would help address these biases, but unlock the creative potential of open science for the future of AI.
An Open and Empowering Future of AI
As we see elsewhere in Open Access, this removal of barriers and paywalls helps less-resourced people access and build expertise. The result could be an ecosystem where AI doesn’t just serve the haves over the have-nots, but in which everyone can benefit from the development of these tools.
Open Source software has long proven the power of pooling resources and collective experimentation. The same holds true of Open Data—making data openly accessible can identify deficits and let people build on one another's work more democratically. Purposefully biasing data (or "data poisoning") is possible and this unethical behavior already happens in less transparent systems and is harder to catch. While a move towards using Open Data in AI development would help mitigate bias and phony claims, it’s not a panacea; even harmful and secretive tools can be built with good data.
But an open system for AI development, from data, to code, to publication, can bring many humanitarian benefits, like in AI’s use in life-saving medical research. The ability to remix and quickly collaborate on medical research can supercharge the research process and uncover missed discoveries in the data. The result? Tools for lifesaving medical diagnosis and treatments for all peoples, mitigating the racial, gender, and other biases in medical research.
Open Data makes data work for the people. While the expertise and resources needed for machine learning remain a barrier for many, crowd-sourced projects like Open Oversight already empower communities by making information about law enforcement visibility and transparency. Being able to collect, use, and remix data to make their own tools brings AI research from the ivory towers to the streets and breaks down oppressive power imbalances.
Open Data is not just about making data accessible. It's about embracing the perspectives and creativity of all people to set the groundwork for a more equitable and just society. It's about tearing down exploitative data harvesting and making sure everyone benefits from the future of AI.