Breaking the Black Box

How Machines Learn to Be Racist

This is the fourth installment in a series that aims to explain and peer inside the black-box algorithms that increasingly dominate our lives.

Early computers were mostly just big calculators, helping us process large numbers. Now, however, computers are so powerful that they are learning how to make decisions on their own in the rapidly growing field of artificial intelligence.

But AI-enabled machines are only as smart as the knowledge they have been fed. Microsoft learned that lesson the hard way earlier this year when it released an AI Twitter bot called Tay that had been trained to talk like a Millennial teen. Within 24 hours, however, a horde of Twitter users had retrained Tay to be a racist Holocaust-denier, and Microsoft was forced to kill the bot.

This was not the first episode of an AI system learning the wrong lessons from its data inputs. Last year, Google’s automatic image recognition engine tagged a photo of two black people as “gorillas” — presumably because the machine learned on a database that hadn’t included enough photos of either animals or people. The company apologized and said they would fix it.

To illustrate how sensitive AI systems are to their information diet, we built an AI engine that deduced synonyms from news articles published by different types of news organizations. We used an algorithm created by Google called word2vec, that is one of the neural nets that Google uses in its search engine, its image recognition tool, and to generate automatic email responses.

We trained the synonym picker by having it “read” hundreds of thousands of articles from six different categories of news outlets:

Then we let the synonym picker guess which words appeared to have similar meanings, based on the knowledge it gained from each news database. The varied results generated by each category were striking.

Consider the synonyms generated for “BlackLivesMatter.” For the Left-trained AI, “hashtag” was the closest synonym; for the Right-trained AI, it was “AllLivesMatter.” For the AI trained with Digital news outlets, close synonyms were “Ferguson” and “Bernie.”

Or consider synonyms for “woman.” In the Tabloids, “victim” ranked high, while in ProPublica (admittedly trained on the smallest amount of data), “knifepoint” ranked as a close synonym. For “man,” the words “son,” “lover” and “gentleman” were ranked about as high on the list of synonyms by news outlets on the Left as “stabs,” “suspect” and “burglar” were by outlets on the Right.

And for “abortion,” the Left-trained AI chose “contraception” as a close synonym, while the Right-trained AI chose “parenthood” and “late-term.” The Mainstream-media-trained AI chose “clinics” among its top synonyms.

Try it for yourself here.

See What AI Learns

We’ve created this AI system using Google’s open source technology, and trained it to produce synonyms based on what it learned from different news sources. We trained it on six different datasets, each composed of tens of thousands of articles published by the news outlets described below.

The synonyms are ranked in descending order based on how closely the AI system thought it matched the word entered. Highlighted words are synonyms that are unique to a dataset.

Left

The NationHuffington Post

  1. osama 44%
  2. trump 42%
  3. laden 42%
  4. huckabee 42%
  5. romney 42%
  6. jeffress 41%
  7. thrice-married 41%
  8. bin 41%
  9. newt 41%
  10. falwell 40%
  11. gibes 40%
  12. rubio 40%
  13. nomineedonald 40%
  14. mitt 40%
  15. gingrich 39%
  16. mackowiak 39%
  17. r-kan 39%
  18. pawlenty 39%
  19. santorum 39%
  20. lewandowski 39%

Right

The Daily CallerBreitbart

  1. donald 48%
  2. trump 47%
  3. bachmann 46%
  4. holt 44%
  5. joenbc 44%
  6. cruz 43%
  7. gingrich 42%
  8. donaldtrump 42%
  9. trump 42%
  10. realdonaldtrump 42%
  11. lauer 42%
  12. santorum 42%
  13. coulter 41%
  14. bouie 41%
  15. cain 41%
  16. lester 41%
  17. nominee 41%
  18. scarborough 41%
  19. candidacy 41%
  20. voight 41%

Mainstream

The New York TimesThe Washington Post

  1. dnainfo 38%
  2. donald 37%
  3. aired 36%
  4. wfan 36%
  5. fox 36%
  6. cnbc 35%
  7. in 34%
  8. bastard 34%
  9. calling 34%
  10. univision 34%
  11. univisiondeportes 33%
  12. upfront 33%
  13. msnbc 33%
  14. nbcolympics 33%
  15. podcast 33%
  16. cbc 33%
  17. reported 33%
  18. azteca 32%
  19. adding 32%
  20. trump 32%

Digital

The Daily BeastVox

  1. mogul 43%
  2. billionaire 41%
  3. trump 40%
  4. goeas 39%
  5. yuge 39%
  6. bloomberg 39%
  7. kelly 39%
  8. long-anticipated 39%
  9. mcmullin 38%
  10. megyn 38%
  11. zhirinovsky 38%
  12. adelson 38%
  13. nunberg 37%
  14. he 37%
  15. non-endorsement 37%
  16. christie 36%
  17. fiorina 36%
  18. carson 36%
  19. lewandowski 36%
  20. huckabee 36%

Tabloids

New York PostNew York Daily News

  1. donald 82%
  2. clinton 68%
  3. hillary 65%
  4. nominee 61%
  5. romney 60%
  6. gop 58%
  7. presidential 57%
  8. campaign 54%
  9. realdonaldtrump 54%
  10. debate 52%
  11. kasich 51%
  12. republican 51%
  13. kellyanne 51%
  14. candidate 50%
  15. mccain 50%
  16. obama 50%
  17. sexist 49%
  18. pence 48%
  19. mitt 48%
  20. kaine 47%

ProPublica

  1. donald 65%
  2. gingrich 51%
  3. hillary 49%
  4. cruz 49%
  5. clinton 48%
  6. huntsman 47%
  7. usfl 46%
  8. pence 44%
  9. pac 43%
  10. cain 43%
  11. mate 42%
  12. mitt 41%
  13. candidate 40%
  14. dayton 40%
  15. lahood 40%
  16. bachmann 40%
  17. obama 40%
  18. palin 39%
  19. blago 39%
  20. presidential 39%

Check out our previous episodes, including our tool that shows you what Facebook knows about you.

Additional design and production by Rob Weychert and David Sleight.