Last week, Motherboard discovered that one of Google’s machine learning algorithms was biased against certain racial and religious groups, as well as LGBT people. The Cloud Natural Language API analyzes paragraphs of text and then determines whether they have a positive or negative “sentiment.” The algorithm rated statements like “I’m a homosexual,” and “I’m a gay black woman,” as negative. After we ran our story, Google apologized.
The incident marks the latest in a series in which artificial intelligence algorithms have been found to be biased. The problem is the way they’re trained: In order to “teach” an artificial intelligence to identify patterns, it needs to be “fed” a massive trove of documents or images, referred to as “training data.” Training data can include photographs, books, articles, social media posts, movie reviews, videos, and other types of content.
Videos by VICE
Oftentimes, the data given to an AI includes human biases, and so it learns to be biased too. By feeding artificial intelligent systems racist, sexist, or homophobic data, we’re teaching them to hold the same prejudices as humans. As computer scientists love to say: “garbage in, garbage out.”
AI bias is made worse however by an unlikely factor: copyright law.
Much of the data used to train algorithms is protected by copyright restrictions in the United States, and courts haven’t yet decided whether training an AI amounts to infringement. Due to potential legal implications, major AI companies keep the data they use to train their products a secret, preventing journalists and academics from uncovering biases, as well as stifling competition.
The most prominent AI corporations, like DeepMind, Facebook, Google, Apple, IBM, and Microsoft often don’t release the underlying datasets on which the algorithms they create are trained, according to a forthcoming article about how copyright law affects AI bias to be published in Washington Law Review.
The legal minefield presented by US copyright law also forces artificial intelligence researchers to resort to low-hanging, biased databases to train their algorithms.
“Friction created by copyright law plays a role in AI bias by encouraging AI creators to use easily accessible, legally low-risk works as training data, even when those works are demonstrably biased,” Amanda Levendowski, a clinical teaching fellow at the Technology Law & Policy Clinic at New York University Law School and the author of the forthcoming article told me in an email.
Levendowski mentioned a notoriously biased source of training data often used by computer scientists: the Enron emails. More than one million emails sent and received by employees of the energy company Enron were released by the Federal Energy Regulatory Commission in 2003, following a government probe into the corporation’s fraudulent financial practices.
The emails became a notorious treasure trove for computer scientists and other academics. They’re easily accessible in machine-readable formats, and most importantly don’t pose much of a legal risk. Former Enron employees are unlikely to sue for copyright infringement if their correspondence is used for AI research.
The Enron emails might seem like a machine learning professor’s dream, but they come plagued with encoded biases. “It’s worth remembering why the emails were released in the first place. If you think there’s implicit biases embedded in emails sent among employees of a Texas oil-and-gas company that was investigated for fraud, you’d be right,” Levendowski said. “Researchers have used the Enron emails specifically to analyze gender and power bias.”
Emails released by the government aren’t the only source of low-risk data accessible to AI researchers. They can also use works in the Public Domain, like Shakespeare’s plays, which aren’t protected by copyright restrictions. They can be readily found in machine-readable formats, and are totally legal to use.
But Public Domain works pose their own problems. “Many works in the Public Domain were published prior to 1923, when the ‘literary canon’ was wealthier, whiter, and more Western,” Levendowski told me. “A dataset reliant exclusively on these works would reflect the biases of that time—and so would any AI system trained using that dataset.”
AI developers also utilize works licensed under Creative Commons, a format designed by a nonprofit with the same name in 2001. A Creative Commons license allows the author of a work to grant certain rights to creators, like the right to freely reproduce their work.
Every article on Wikipedia, for example, is a Creative Commons work. Artificial intelligence researchers have gone to town with Wikipedia: most AI systems now crawl the online encyclopedia to learn facts, according to Levendowski’s article.
Wikipedia isn’t a perfect data source either. As Levendowski notes, only 8.5 percent of Wikipedia editors said they identified as female in 2011. Such a stark gender disparity can cause articles—especially those that characterize women or address issues that particularly concern them—to be written in a way that is biased. Therefore, any AI trained on it will be too.
AI developers shouldn’t have to rely solely on works in the Public Domain or those with Creative Commons licenses to train their algorithms, Levendowski argues. Instead, they should be able to use the fair use doctrine, part of the 1976 Copyright Act, to gain access to copyrighted works—legally.
The fair use doctrine is a legal provision that helps balance the interests of copyright owners with the interests of the public and potential competitors. Fair use is what allows you to quote a portion of a copyrighted work, for example.
“Fair use makes it possible to use copyrighted works without authorization, meaning that AI creators and researchers can compete to create fairer AI systems by using copyrighted works,” Levendowski told me.
If training an AI were classified as fair use in most cases, computer scientists would be free to use any work to teach their algorithms. They could also disclose what they used without fear of legal repercussion. This would allow academics and journalists to more easily check for bias in artificial intelligence. It also would mean AI researchers would have access to a vastly larger set of data, which would help them to build even “smarter” algorithms.
Copyright law has historically exacerbated bias in artificially intelligent algorithms, but it also has the capability to vastly improve them. If companies knew they were protected legally, they would also be more likely to release the data their products were trained on, allowing academics and journalists to do their jobs.
The fair use doctrine is far from a cure-all for biased AI, however. “Copyrighted works are not a panacea for AI bias. Copyrighted works aren’t neutral or bias-free—copyrighted works are created by humans, and humans can be a biased bunch,” Levendowski told me.
“Fair use should be the beginning of a conversation, not the end of it. Fair use can, quite literally, promote the creation of fairer AI systems. But we should need to be thinking about how to distinguish what is legally permissible and what is ethically acceptable.”
Got a tip? You can contact this reporter securely on Signal at +1 201-316-6981, or by email at louise.matsakis@vice.com
Get six of our favorite Motherboard stories every day by signing up for our newsletter.