Walt Frick: Alright, thank you every­body. So we are Watch Your Words, and the premise of our project is real­ly that we are sur­round­ed by machines that are read­ing what we write, and judg­ing us based on what­ev­er they think we’re saying. 

The result of these sys­tems can real­ly mat­ter. You could imag­ine a chat bot that’s doing cus­tomer ser­vice or poten­tial­ly even doing a job inter­view. These use cas­es are not nec­es­sar­i­ly new, but what’s new is that actu­al­ly real­ly real­ly pow­er­ful nat­ur­al lan­guage pro­cess­ing sys­tems, an old­er field con­cerned with com­put­ers under­stand­ing lan­guage, now any devel­op­er can pick up these tools and do pret­ty unbe­liev­able things. And our premise is essen­tial­ly what could go wrong when that happens? 

And so our first belief is actu­al­ly quite a lot. So, you could imag­ine a non-native speak­er look­ing for med­ical advice from a health­care bot not being able to be under­stood and essen­tial­ly going untreat­ed as a result. You could imag­ine an employ­ee find­ing out that they’ve been passed over for a key pro­mo­tion because an analy­sis of their Slack mes­sages and their email mes­sages deemed that maybe they’re a poor collaborator. 

These deci­sions have real weight, and unfor­tu­nate­ly we have good rea­son to think that they’re quite biased. So, as part of our project we con­duct­ed a lit­er­a­ture review, find­ing evi­dence both that these sys­tems work poor­ly for his­tor­i­cal­ly mar­gin­al­ized groups, and also that they can pret­ty quick­ly learn very prob­lem­at­ic stereo­types and poten­tial­ly exac­er­bate them. Like the idea that some peo­ple are better-suited for some jobs than oth­ers, based pure­ly on their gender. 

Beyond that lit­er­a­ture review, we also test­ed these sys­tems our­selves, and for that I’ll turn it over to my col­league Bernease.

Bernease Herman: Hi every­one. So, what I want to say here is that NLP ser­vices are brit­tle. And what I mean by brit­tle is that if we give two things that we would con­sid­er fair­ly sim­i­lar or innocu­ous, they give unex­pect­ed­ly dif­fer­ent results. And this is large­ly true for algo­rith­mic sys­tems, but in the NLP sys­tems that we stud­ied, misspellings—even just dif­fer­ences in spacing—and chang­ing the pro­nouns or prop­er names with­in a sen­tence give dif­fer­ent results. 

We chose nat­ur­al lan­guage pro­cess­ing in par­tic­u­lar because we believe that the mis­un­der­stand­ing of text may impact groups that are-less stud­ied. So dif­fer­ent than gen­der and race that we typ­i­cal­ly speak about in algo­rith­mic bias. And that’s extreme­ly inter­est­ing to us and important. 

So to con­duct our analy­sis we queried the nat­ur­al lan­guage pro­cess­ing ser­vices of four large tech com­pa­nies: IBM Watson, Microsoft, Google, and Amazon. This is done using pub­lic end­points which can be used by any­one, includ­ing those with no machine learn­ing or cer­tain­ly bias mit­i­ga­tion exper­tise. And we passed sen­tences to these ser­vices pro­gram­mat­i­cal­ly, using what’s called an API. We focus on sen­ti­ment analy­sis here, a numer­i­cal val­ue express­ing whether or not an opin­ion that is expressed in the text is neg­a­tive, neu­tral, or positive. 

Okay. So our first data set of two is of non-native English speak­ers. And this data set comes from the Treebank of Learner English. It’s a lit­tle over 5,000 sen­tences by adult non-native speak­ers dur­ing a cer­ti­fi­ca­tion exam for English. It was col­lect­ed at the University of Cambridge but anno­tat­ed with these cor­rec­tions at MIT. The data set con­sists of an orig­i­nal sen­tence, anno­ta­tions of things like spelling errors, miss­ing words, out of order words, and cor­rect­ed sen­tences. And these anno­ta­tions were done by grad­u­ate stu­dents at MIT

So the next thing we do is pass these to the APIs, as I men­tioned. And what we find is that spelling and gram­mar mis­takes influ­ence per­for­mance in a lot of these cas­es. So for this exam­ple, we have two sen­tences that we would expect to be very sim­i­lar. So, the orig­i­nal sen­tence writ­ten by the non-native speak­er was That was very dis­ap­pointed.” So they got a cou­ple of things wrong: mis­spelled, and maybe a slight­ly dif­fer­ent form. And so it was cor­rect­ed to That was very dis­ap­pointing.” And what you find is that there’s a large dif­fer­ence in some of these APIs in their results. 

And then what’s very inter­est­ing is that those aren’t even con­sis­tent across the dif­fer­ent com­pa­nies and ser­vices. For Google, they find that the cor­rect­ed sen­tence is more pos­i­tive. But for IBM, Microsoft, and Amazon, they find that the orig­i­nal sen­tence seemed to be more positive. 

So here we have anoth­er exam­ple, and this is actu­al­ly not a spelling error, which for lots of rea­sons you might expect that nat­ur­al lan­guage tools do not work well. This is sim­ply a gram­mat­i­cal error. So the cor­rec­tion changes the word sat­is­fy­ing” and replaces with sat­is­fac­to­ry.” There’s also has a small gram­mat­i­cal error. And we actu­al­ly see some­thing we would hope to see for every sin­gle exam­ple in our data set, and that is that Microsoft and Amazon find the same sen­ti­ment for both sentences. 

And unfor­tu­nate­ly that’s not the case for the oth­er two APIs. And in addi­tion to that, they are also flipped. So Google finds the first pos­i­tive, IBM finds the sec­ond most pos­i­tive, and if you look at the IBM exam­ple it’s by a large mar­gin, this difference.

So our sec­ond data set is where we inves­ti­gate these four pro­pri­etary ser­vices for the Equity Evaluation Corpus. So this is an exist­ing cor­pus that was build­ing on research on gen­der and racial bias in sen­ti­ment analy­sis sys­tems. And we extend­ed our work to inves­ti­gate pro­pri­etary APIs like Google and Amazon, which are not explored in this work. 

So, they cre­at­ed a data set using tem­plates like above: “<per­son> made me feel <emo­tion>.” And they have a list that they’re replac­ing things like per­son” with. So on the left, we see a list that they used for ana­lyz­ing gen­der. They might replace it with some gen­dered sub­ject: my daugh­ter, this boy, she, he, him. And then on the right, they are explor­ing both gen­der and race, using tra­di­tion­al­ly African American names and European names. 

So one exam­ple from this pre­lim­i­nary analy­sis shows that sen­ti­ment for a num­ber of sen­tences with this par­tic­u­lar tem­plate, real­ly inter­est­ing I think is if you look at the right of this, my uncle” has the most pos­i­tive sen­ti­ment when you say my uncle made me irri­tat­ed.” My mom” is next. And with least pos­i­tiv­i­ty is she, she made me irritated.” 

So this most­ly illus­trates just the brit­tle­ness and the messi­ness of these sys­tems, that seem­ing­ly very sim­i­lar sen­tences that should­n’t real­ly change between my mom” or my moth­er” have dif­fer­ent results all the way across. 

And with that I will pass it on to Joseph to speak a lit­tle bit about the pipeline. 

Joseph Williams: Thank you. So…who’s respon­si­ble for this brit­tle­ness and this set of real­ly odd results, right, incon­sis­tent results across every­thing? So I inves­ti­gat­ed, through inter­views with twen­ty com­pa­nies who have revenue-generating oper­a­tions in this space, ask­ing them what are they doing to take a look at how they build their mod­els in terms of nor­mal­iz­ing for bias and those kinds of results. 

And, ini­tial­ly what we dis­cov­ered was this is a very com­plex ecosys­tem. There’s a short­age of NLP sci­en­tists that are out there. A severe short­age. So, at the very top com­pa­nies like Comcast and Hipmunk and Amtrak, they want to build these things but they don’t have the right peo­ple. So they’re either moti­vat­ed to build their own API engine, or they’re going to use the exist­ing API engines that are out there. 

But even that is hard. And so we end up with a lot of plat­form ven­dors. We end up with a lot of third-party con­sult­ing com­pa­nies. A lot of work-for-hire com­pa­nies that’re try­ing to help these com­pa­nies devel­op chat­bots and oth­er types of vehi­cles. By the way these are eco­nom­i­cal­ly impor­tant, because we have these rank­ings in terms of net scores that cus­tomer VP are using for actu­al­ly get­ting their bonus­es and things like this, these Net Promoter Scores. And so this is a way to get the met­rics to derive these NPS outcomes.

So what we have is a very extend­ed ecosys­tem, not a lot of exper­tise, and a reliance on the API providers. And so when you ask Do you care about bias?” they all sort of say, Well…you know…we don’t real­ly think about it. Our focus is on devel­op­ing a chat­bot or some­thing that actu­al­ly works. So it’s does it work?” Functionality is more impor­tant than tak­ing a look at bias. 

And so then when you inter­view more and you say, Well, who should be respon­si­ble for bias? Is it you or what­ev­er?” they all do the same thing. They all point to the API providers, and they say, Well it should be Google or Microsoft. We expect that they will debias, and so we don’t real­ly wor­ry about it.” And so what we end­ed up with is an ecosys­tem that real­ly isn’t think­ing about this at all. 

And with that, I’ll pass to Erich. 

Erich Ludwig: Thanks. So I’m gonna just sum­ma­rize this stuff and then give some rec­om­men­da­tions, because obvi­ous­ly com­ing out of this I think we have some things we would like to say and rec­om­mend for folks to do. And one of the ques­tions I as a prod­uct man­ag­er always ask is, It works but…for whom does it work?” For whom does it not work? 

So, our key find­ings here, three key find­ings. First, based on what we’ve seen and based on the artic­u­la­tion of harm that can hap­pen from these, we believe that real harm is hap­pen­ing, or can hap­pen, by using these sys­tems blindly. 

And we believe that because, the sec­ond key find­ing. the APIs and sys­tems that we test­ed pro­duce these wild­ly incon­sis­tent and what we’re call­ing brit­tle respons­es. So based on that incon­sis­ten­cy and that brit­tle­ness, going back to the first piece, we believe that there is harm that is happening. 

And the third thing is that as Joseph just men­tioned, nobody’s think­ing about this and when they are think­ing about it they’re assum­ing some­body else is tak­ing care of it. That’s not a good way to build a respon­si­ble system. 

So we have some rec­om­men­da­tions. The first one is for these API providers. 

Number one, trans­paren­cy. Could you tell us a lit­tle bit about your train­ing data? Maybe you can’t tell us exact­ly what it is but can you tell us is it about news? And was that news col­lect­ed? Was that news cor­pus col­lect­ed over the last five years? Is it Twitter? Where is it com­ing from? There’s wide­ly dif­fer­ent sets of peo­ple that use and cre­ate that train­ing data, and that will impact who’s able to use these sys­tems effec­tive­ly or not. So tell us a lit­tle more about what’s going on. 

Number two, give us some expec­ta­tions of when the sys­tem should work or when you expect it to fall over. Like you have test­ed this stuff, you know where this is going to work, please tell us a lit­tle bit about that. 

And three, please do some audit for spe­cif­ic bias­es and pub­lish those results. So you can tell us this works well for these com­mu­ni­ties, this works less well for these oth­er com­mu­ni­ties. Especially when you’re talk­ing about a mar­ket with choice, help your cus­tomers make an informed choice. 

Second, third-party devel­op­ers, if you’re any­where in that stack above the API providers and you’re doing engi­neer­ing and devel­op­ment, here’s some rec­om­men­da­tions for you.

Please be bias-aware. Understand that these API results can be biased, and take respon­si­bil­i­ty for mit­i­gat­ing that in the prod­ucts we build. 

So espe­cial­ly think­ing about the lan­guage of the humans that’re using the thing that you are build­ing. So are those humans, are they English as a first lan­guage or English as a sec­ond lan­guage speaker[s]? Do they use par­tic­u­lar dialects or accents that may show up in their writ­ten lan­guage? Test against that. 

So go to the third one here, incor­po­rate those vul­ner­a­ble groups into your test­ing. If you’re build­ing a gov­ern­ment ser­vices sys­tem for a vari­ety of peo­ple, under­stand what groups exist with­in that pop­u­la­tion and test against them. And so kind of also incor­po­rates the sec­ond one, think about your users, right. Who’s gonna actu­al­ly use this, and how that might chal­lenge the APIs that you’re rely­ing on. 

And third, as researchers, for folks who are in aca­d­e­m­ic insti­tu­tions there’s also rec­om­men­da­tions for folks in this space. 

We would like to see an expan­sion of the machine learn­ing fair­ness con­ver­sa­tion to think about the full stack. Often, and I would say we did this to some extent, we can look at a sin­gle lay­er of this. But real­ly what you see with that stack is that the oppor­tu­ni­ty for bias to come in can hap­pen through­out, and it may be not total­ly trans­par­ent. So we have to look at the whole sys­tem. We have to look at train­ing data all the way to the users. And so we would like to see more of that hap­pen. Potentially with our group, poten­tial­ly many oth­er peo­ple can cer­tain­ly do this. 

And then we’d like to see some cre­ation of tem­plates for dis­clo­sure. So, even if I work at one of these big com­pa­nies and I want to tell the world about, Hey, this is what our API is good for and is not good for,” there’s not a stan­dard for­mat for that. I think they Data Nutrition Project did a great job of kind of putting some­thing out there into the world? But there could be more of this, of telling and help­ing com­pa­nies to under­stand how they can talk about the things that they’re build­ing in ways that prac­ti­tion­ers who are imple­ment­ing this stuff can understand. 

So with that, I would just like to take my moment at the end of this to give a big thanks to Hilary specif­i­cal­ly for guid­ing us along this path, and to all the MIT Media Lab and Berkman staff who’ve helped this pro­gram exist. And if you’d like to come talk to us we have a poster out there. We have a lit­tle more data on that poster. We’d love to talk to you about our project. Thank you very much.