Elizabeth Dubois: So, I am Elizabeth. I’m going to talk to you today about positionality-aware machine learn­ing. We are going to start off with a ques­tion. Tomato: fruit or veg­etable, what do you guys think? [var­i­ous answers shout­ed from audi­ence] Fruit, veg­etable, right. 

Okay. It’s a mat­ter of per­spec­tive. It’s also a mat­ter of con­text in which you want to use that answer. If you’re a botanist, you say fruit. If you are a nutri­tion­ist, you say veg­etable. Lawyers and judges in the US have agreed: veg­etable. Computer vision researchers? say it’s miscellaneous. 

This idea of clas­si­fi­ca­tion, it’s the process of assign­ing a name or a cat­e­go­ry to a par­tic­u­lar, idea, con­cept, thing. It’s a process that we go through in our dai­ly lives con­tin­u­al­ly. The idea of clas­si­fi­ca­tion is also the idea of cre­at­ing tax­onomies for under­stand­ing the world around us and reduc­ing the large amounts of nuance and detail that there are in the world use­ful­ly. We see it when we’re try­ing to under­stand how dis­eases spread around the world. We use it when we’re try­ing to under­stand what online harass­ment looks like. We use it to under­stand dif­fer­ences in race or gender. 

Gender’s a par­tic­u­lar­ly inter­est­ing one, par­tic­u­lar­ly in the kind of Western soci­etal con­text where at one point we saw gen­der as pret­ty much agreed upon as a bina­ry vari­able. There were two options. But now that’s no longer the case. So if we’re think­ing about these clas­si­fi­ca­tion process­es and try­ing to embed them into the machines we’re build­ing, we need to be think­ing about it crit­i­cal­ly and in the con­text that we’re cur­rent­ly in and what might change mov­ing forward. 

We do not always design mod­els that affect peo­ple’s lives or have poten­tial harms. […] Like, our auto-toner mod­el for image re-colorization, that does not inflict harm.

Or, maybe it does?

It is kind of a weird algo­rithm that may lead to white wash­ing … actu­al­ly you know, … I take it back.
Extract from user inter­views, Senior Data Scientist, Major US News Publication [slide]

This is a quote that we had from one of the many user inter­views we con­duct­ed with dif­fer­ent ML and AI engi­neers. This woman is at a major US news orga­ni­za­tion, and she talked about the idea of clas­si­fi­ca­tion and when it might present prob­lems in terms of harms it could cause. She said there are some­times when it just does­n’t mat­ter. It’s not an issue to do with harm. She start­ed with the idea of okay, our auto-toner mod­el for imagery col­oriza­tion, well that’s not going to cause any­one harm.” Paused. Thought about it. Actually maybe it does. It’s kind of a weird algo­rithm that may to whitewashing.

And so this is some­thing we saw time and time again as we were ask­ing these prac­ti­tion­ers to think about their clas­si­fi­ca­tion choic­es and when they might be prob­lem­at­ic, that once we start­ed dig­ging into that prob­lem they real­ized there was this poten­tial for prob­lem­at­ic deci­sion­mak­ing that on the sur­face was­n’t an issue in the first place. 

And so this is where we come to our idea of posi­tion­al­i­ty. Positionality is the spe­cif­ic posi­tion or per­spec­tive that an indi­vid­ual takes giv­en their past expe­ri­ences, their knowl­edge; their world­view is shaped by posi­tion­al­i­ty. It’s a unique but par­tial view of the world. And when we’re design­ing machines we’re embed­ding posi­tion­al­i­ty into those machines with all of the choic­es we’re mak­ing about what counts and what does­n’t count. 

So this is a very very sim­pli­fied data pipeline, okay. This is when we go from data into our ML mod­el that we are try­ing to train. I’m going to use the con­text of online harass­ment. Let’s imag­ine we have a whole bunch of tweets and we want to just find whether or not those tweets are exem­pli­fy­ing harass­ment or not. 

Well, we would grab our data. We would apply new labels to that data: harassment/not harass­ment. And then we’d train a mod­el on it so it could predict. 

This requires a real­ly com­plex clas­si­fi­ca­tion sys­tem, right? So we have a sys­tem to decide what counts as harass­ment or what does­n’t. We have to train a whole bunch of anno­ta­tors to lit­er­al­ly go through the data piece by piece and assign those labels. Then they apply that. And that’s when we get to feed into that model. 

So think­ing about online harass­ment, in a project I worked on we start­ed with three cat­e­gories. That was our clas­si­fi­ca­tion sys­tem, our tax­on­o­my. We had pos­i­tive, neu­tral, and neg­a­tive. Every tweet was going to fit into one of these categories. 

Our anno­ta­tors could not agree. Three cat­e­gories did not work. And it was because there was a bunch of bound­ary cas­es between neu­tral and neg­a­tive. It caused tons of prob­lem, we could not get good inter-coder reli­a­bil­i­ty or inter-rater reli­a­bil­i­ty, which is a com­mon tool to use to assess agreement. 

We added a fourth cat­e­go­ry called crit­i­cal,” and all of a sud­den our anno­ta­tors agreed the major­i­ty of the time. We had to redesign our clas­si­fi­ca­tion sys­tem in order to respond to the actu­al data the way that it was being pre­sent­ed and the way humans inter­act with that and under­stand it. 

So what we’re say­ing is, to inter­ro­gate these clas­si­fi­ca­tion sys­tems we need to be think­ing about what counts, in what con­text. We need to be think­ing about who those anno­ta­tors are, why we’ve select­ed them, how they’ve been trained, at what moment in time. And we need to think about the actu­al appli­ca­tion of the clas­si­fi­ca­tion sys­tems, and ques­tion whether or not there is suf­fi­cient agree­ment and whether or not our approach has been reliable. 

That was an exam­ple of a home-grown clas­si­fi­ca­tion sys­tem for a very spe­cif­ic project, but this idea of posi­tion­al­i­ty is embed­ded even in the very old, insti­tu­tion­al­ized clas­si­fi­ca­tion sys­tems that are used around the world. So, the International Classification of Diseases is a tool that’s used inter­na­tion­al­ly to iden­ti­fy and clas­si­fy health prob­lems. And it actu­al­ly under­pins a lot of the US health­care billing system. 

This is an exam­ple of the dif­fer­ent codes you can use in the ICD for being harmed by birds. So there is a code for hav­ing been harmed by a chick­en, or a goose, or a par­rot. There is no code for ostrich, though. Okay. Think about how big an ostrich is. And then think about maybe liv­ing in Australia. If you ask an Australian what is going to be a more risky, harm­ful health sit­u­a­tion, being kicked by an ostrich or being bit­ten by a goose? Probably they’re gonna think the ostrich is the more impor­tant thing to count. 

But the ICD, it was­n’t devel­oped with that in mind. The ICD was devel­oped with its ori­gins in the 1850s. It’s now main­tained by the WHO. And it was designed pri­mar­i­ly by white men in Western Europe and North America. Their posi­tion­al­i­ty, it’s embed­ded in the ICD today and it will con­tin­ue to be unless we rou­tine­ly ques­tion what that posi­tion­al­i­ty looks like. 

Ultimately, choic­es here are inevitable. And this idea of remov­ing bias, it just does­n’t jive when we’re under­stand­ing that these choic­es are gonna hap­pen regard­less. A lot of the con­ver­sa­tion about debi­as­ing algo­rithms is about adding rows. If you just add enough data, you’ll be able to get a rep­re­sen­ta­tive view of the world. But if you lim­it your­self to only the columns for par­rot, chick­en, and goose and you don’t have a col­umn for ostrich, you will nev­er cap­ture how many ostrich kicks there were. 

So, if the debi­as­ing debate isn’t help­ful what do we do instead? We argue that you could look towards being positionality-aware. And we sug­gest that there are three basic steps that machine learn­ing engi­neers and oth­ers involved in the process can take. 

The first is to uncov­er posi­tion­al­i­ty in your own work­flows. Look not only at the clas­si­fi­ca­tion sys­tems but also the data and the mod­els that you’re mak­ing use of, and think about where posi­tion­al­i­ty enters. Keep track of it. 

Next is to try­ing to assure there’s con­text align­ment. That’s an align­ment between the clas­si­fi­ca­tion sys­tem and the con­text in which it was devel­oped, and the actu­al appli­ca­tion sce­nario for the machine learn­ing tool that you are creating. 

And here let’s return to that online harass­ment exam­ple. We devel­oped that for Twitter. Maybe we want to use it on Reddit now. If you’re think­ing about just tak­ing the mod­el that was cre­at­ed for Twitter and apply­ing it to Reddit, there’s very few options for embed­ding a positionality-aware approach. If you’re think­ing about well maybe if I just feed in a bunch of new data I can solve the problem—so you trained it on Twitter data, now you’re going to train it on Reddit data, that’ll get you clos­er. But what you actu­al­ly need to do is ques­tion that clas­si­fi­ca­tion sys­tem. You need to go back and look at how you’re actu­al­ly assess­ing what counts as harass­ment and what doesn’t. 

Because the way peo­ple com­mu­ni­cate on Twitter is dif­fer­ent from Reddit. On Twitter you have a short char­ac­ter count. You might use hash­tags, at-replies. On Reddit you are prob­a­bly talk­ing in very spe­cif­ic sub­red­dits. You’re prob­a­bly engag­ing in par­tic­u­lar lan­guage because you know there’s a mod­er­a­tor watch­ing what you’re doing and keep­ing track to make sure that you’re with­in the bounds of what that com­mu­ni­ty has deemed to be accept­able. You have way more space to do it, right. And so the ways that we clas­si­fy con­tent for Twitter and Reddit, they’re prob­a­bly going to be dif­fer­ent. Certainly the ways we train our anno­ta­tors has to be dif­fer­ent, because those approach­es do not work when the con­tent and the con­text are com­plete­ly changed. 

The last step here is to remem­ber that you need to be con­tin­u­al­ly try­ing to ensure that that align­ment exists. The mod­els might change, the data might change, the clas­si­fi­ca­tion sys­tems them­selves might change. The ICD, it’s changed by the WHO rel­a­tive­ly rou­tine­ly. And so if you’re mak­ing use of it, you need to update your approaches. 

It’s also impor­tant to rec­og­nize that the con­text in which you’re build­ing some­thing might change whether you like it or not. And so hav­ing a lack of con­trol there kind of requires you to be aware of what’s shift­ing, in order to build a rea­son­able and respon­si­ble tool. 

So, with all of this in mind, what we did was run a work­shop with ML engi­neers, and we’ve got a num­ber of oth­er work­shops already sub­mit­ted. So we’ve sub­mit­ted to EPIC and FAT*. We’ve cre­at­ed a white paper that’s avail­able on our web site and plan to write a more detailed posi­tion paper that we can make avail­able widely. 

I will let you go explore the web site on your own, but before I do that I just want to leave you with this: Right now, ML and AI sys­tems kind of are like a one-size-fits-all t‑shirt. They fit very few peo­ple, a lot of us end up kind of unhap­py. But we can do bet­ter. We can har­ness this oppor­tu­ni­ty to be aware of the very spe­cif­ic con­texts in which these tools can be deployed, think about how they can be tracked over time, and find ways to serve the spe­cif­ic needs of the users and the devel­op­ers in order to be aware of the par­tic­u­lar per­spec­tive from which we are design­ing and that per­spec­tive which is embed­ded in all of the tools we’re cre­at­ing. Thanks.