Elizabeth Dubois: So, I am Elizabeth. I’m going to talk to you today about positionality-aware machine learn­ing. We are going to start off with a ques­tion. Tomato: fruit or veg­etable, what do you guys think? [var­i­ous answers shout­ed from audi­ence] Fruit, veg­etable, right. 

Okay. It’s a mat­ter of per­spec­tive. It’s also a mat­ter of con­text in which you want to use that answer. If you’re a botanist, you say fruit. If you are a nutri­tion­ist, you say veg­etable. Lawyers and judges in the US have agreed: veg­etable. Computer vision researchers? say it’s miscellaneous. 

This idea of clas­si­fi­ca­tion, it’s the process of assign­ing a name or a cat­e­go­ry to a par­tic­u­lar, idea, con­cept, thing. It’s a process that we go through in our dai­ly lives con­tin­u­al­ly. The idea of clas­si­fi­ca­tion is also the idea of cre­at­ing tax­onomies for under­stand­ing the world around us and reduc­ing the large amounts of nuance and detail that there are in the world use­ful­ly. We see it when we’re try­ing to under­stand how dis­eases spread around the world. We use it when we’re try­ing to under­stand what online harass­ment looks like. We use it to under­stand dif­fer­ences in race or gender. 

Gender’s a par­tic­u­lar­ly inter­est­ing one, par­tic­u­lar­ly in the kind of Western soci­etal con­text where at one point we saw gen­der as pret­ty much agreed upon as a bina­ry vari­able. There were two options. But now that’s no longer the case. So if we’re think­ing about these clas­si­fi­ca­tion process­es and try­ing to embed them into the machines we’re build­ing, we need to be think­ing about it crit­i­cal­ly and in the con­text that we’re cur­rent­ly in and what might change mov­ing forward. 

We do not always design mod­els that affect peo­ple’s lives or have poten­tial harms. […] Like, our auto-toner mod­el for image re-colorization, that does not inflict harm.

Or, maybe it does?

It is kind of a weird algo­rithm that may lead to white wash­ing … actu­al­ly you know, … I take it back.
Extract from user inter­views, Senior Data Scientist, Major US News Publication [slide]

This is a quote that we had from one of the many user inter­views we con­duct­ed with dif­fer­ent ML and AI engi­neers. This woman is at a major US news orga­ni­za­tion, and she talked about the idea of clas­si­fi­ca­tion and when it might present prob­lems in terms of harms it could cause. She said there are some­times when it just does­n’t mat­ter. It’s not an issue to do with harm. She start­ed with the idea of okay, our auto-toner mod­el for imagery col­oriza­tion, well that’s not going to cause any­one harm.” Paused. Thought about it. Actually maybe it does. It’s kind of a weird algo­rithm that may to whitewashing.

And so this is some­thing we saw time and time again as we were ask­ing these prac­ti­tion­ers to think about their clas­si­fi­ca­tion choic­es and when they might be prob­lem­at­ic, that once we start­ed dig­ging into that prob­lem they real­ized there was this poten­tial for prob­lem­at­ic deci­sion­mak­ing that on the sur­face was­n’t an issue in the first place. 

And so this is where we come to our idea of posi­tion­al­i­ty. Positionality is the spe­cif­ic posi­tion or per­spec­tive that an indi­vid­ual takes giv­en their past expe­ri­ences, their knowl­edge; their world­view is shaped by posi­tion­al­i­ty. It’s a unique but par­tial view of the world. And when we’re design­ing machines we’re embed­ding posi­tion­al­i­ty into those machines with all of the choic­es we’re mak­ing about what counts and what does­n’t count. 

So this is a very very sim­pli­fied data pipeline, okay. This is when we go from data into our ML mod­el that we are try­ing to train. I’m going to use the con­text of online harass­ment. Let’s imag­ine we have a whole bunch of tweets and we want to just find whether or not those tweets are exem­pli­fy­ing harass­ment or not. 

Well, we would grab our data. We would apply new labels to that data: harassment/not harass­ment. And then we’d train a mod­el on it so it could predict. 

This requires a real­ly com­plex clas­si­fi­ca­tion sys­tem, right? So we have a sys­tem to decide what counts as harass­ment or what does­n’t. We have to train a whole bunch of anno­ta­tors to lit­er­al­ly go through the data piece by piece and assign those labels. Then they apply that. And that’s when we get to feed into that model. 

So think­ing about online harass­ment, in a project I worked on we start­ed with three cat­e­gories. That was our clas­si­fi­ca­tion sys­tem, our tax­on­o­my. We had pos­i­tive, neu­tral, and neg­a­tive. Every tweet was going to fit into one of these categories. 

Our anno­ta­tors could not agree. Three cat­e­gories did not work. And it was because there was a bunch of bound­ary cas­es between neu­tral and neg­a­tive. It caused tons of prob­lem, we could not get good inter-coder reli­a­bil­i­ty or inter-rater reli­a­bil­i­ty, which is a com­mon tool to use to assess agreement. 

We added a fourth cat­e­go­ry called crit­i­cal,” and all of a sud­den our anno­ta­tors agreed the major­i­ty of the time. We had to redesign our clas­si­fi­ca­tion sys­tem in order to respond to the actu­al data the way that it was being pre­sent­ed and the way humans inter­act with that and under­stand it. 

So what we’re say­ing is, to inter­ro­gate these clas­si­fi­ca­tion sys­tems we need to be think­ing about what counts, in what con­text. We need to be think­ing about who those anno­ta­tors are, why we’ve select­ed them, how they’ve been trained, at what moment in time. And we need to think about the actu­al appli­ca­tion of the clas­si­fi­ca­tion sys­tems, and ques­tion whether or not there is suf­fi­cient agree­ment and whether or not our approach has been reliable. 

That was an exam­ple of a home-grown clas­si­fi­ca­tion sys­tem for a very spe­cif­ic project, but this idea of posi­tion­al­i­ty is embed­ded even in the very old, insti­tu­tion­al­ized clas­si­fi­ca­tion sys­tems that are used around the world. So, the International Classification of Diseases is a tool that’s used inter­na­tion­al­ly to iden­ti­fy and clas­si­fy health prob­lems. And it actu­al­ly under­pins a lot of the US health­care billing system. 

This is an exam­ple of the dif­fer­ent codes you can use in the ICD for being harmed by birds. So there is a code for hav­ing been harmed by a chick­en, or a goose, or a par­rot. There is no code for ostrich, though. Okay. Think about how big an ostrich is. And then think about maybe liv­ing in Australia. If you ask an Australian what is going to be a more risky, harm­ful health sit­u­a­tion, being kicked by an ostrich or being bit­ten by a goose? Probably they’re gonna think the ostrich is the more impor­tant thing to count. 

But the ICD, it was­n’t devel­oped with that in mind. The ICD was devel­oped with its ori­gins in the 1850s. It’s now main­tained by the WHO. And it was designed pri­mar­i­ly by white men in Western Europe and North America. Their posi­tion­al­i­ty, it’s embed­ded in the ICD today and it will con­tin­ue to be unless we rou­tine­ly ques­tion what that posi­tion­al­i­ty looks like. 

Ultimately, choic­es here are inevitable. And this idea of remov­ing bias, it just does­n’t jive when we’re under­stand­ing that these choic­es are gonna hap­pen regard­less. A lot of the con­ver­sa­tion about debi­as­ing algo­rithms is about adding rows. If you just add enough data, you’ll be able to get a rep­re­sen­ta­tive view of the world. But if you lim­it your­self to only the columns for par­rot, chick­en, and goose and you don’t have a col­umn for ostrich, you will nev­er cap­ture how many ostrich kicks there were. 

So, if the debi­as­ing debate isn’t help­ful what do we do instead? We argue that you could look towards being positionality-aware. And we sug­gest that there are three basic steps that machine learn­ing engi­neers and oth­ers involved in the process can take. 

The first is to uncov­er posi­tion­al­i­ty in your own work­flows. Look not only at the clas­si­fi­ca­tion sys­tems but also the data and the mod­els that you’re mak­ing use of, and think about where posi­tion­al­i­ty enters. Keep track of it. 

Next is to try­ing to assure there’s con­text align­ment. That’s an align­ment between the clas­si­fi­ca­tion sys­tem and the con­text in which it was devel­oped, and the actu­al appli­ca­tion sce­nario for the machine learn­ing tool that you are creating. 

And here let’s return to that online harass­ment exam­ple. We devel­oped that for Twitter. Maybe we want to use it on Reddit now. If you’re think­ing about just tak­ing the mod­el that was cre­at­ed for Twitter and apply­ing it to Reddit, there’s very few options for embed­ding a positionality-aware approach. If you’re think­ing about well maybe if I just feed in a bunch of new data I can solve the problem—so you trained it on Twitter data, now you’re going to train it on Reddit data, that’ll get you clos­er. But what you actu­al­ly need to do is ques­tion that clas­si­fi­ca­tion sys­tem. You need to go back and look at how you’re actu­al­ly assess­ing what counts as harass­ment and what doesn’t. 

Because the way peo­ple com­mu­ni­cate on Twitter is dif­fer­ent from Reddit. On Twitter you have a short char­ac­ter count. You might use hash­tags, at-replies. On Reddit you are prob­a­bly talk­ing in very spe­cif­ic sub­red­dits. You’re prob­a­bly engag­ing in par­tic­u­lar lan­guage because you know there’s a mod­er­a­tor watch­ing what you’re doing and keep­ing track to make sure that you’re with­in the bounds of what that com­mu­ni­ty has deemed to be accept­able. You have way more space to do it, right. And so the ways that we clas­si­fy con­tent for Twitter and Reddit, they’re prob­a­bly going to be dif­fer­ent. Certainly the ways we train our anno­ta­tors has to be dif­fer­ent, because those approach­es do not work when the con­tent and the con­text are com­plete­ly changed. 

The last step here is to remem­ber that you need to be con­tin­u­al­ly try­ing to ensure that that align­ment exists. The mod­els might change, the data might change, the clas­si­fi­ca­tion sys­tems them­selves might change. The ICD, it’s changed by the WHO rel­a­tive­ly rou­tine­ly. And so if you’re mak­ing use of it, you need to update your approaches. 

It’s also impor­tant to rec­og­nize that the con­text in which you’re build­ing some­thing might change whether you like it or not. And so hav­ing a lack of con­trol there kind of requires you to be aware of what’s shift­ing, in order to build a rea­son­able and respon­si­ble tool. 

So, with all of this in mind, what we did was run a work­shop with ML engi­neers, and we’ve got a num­ber of oth­er work­shops already sub­mit­ted. So we’ve sub­mit­ted to EPIC and FAT*. We’ve cre­at­ed a white paper that’s avail­able on our web site and plan to write a more detailed posi­tion paper that we can make avail­able widely. 

I will let you go explore the web site on your own, but before I do that I just want to leave you with this: Right now, ML and AI sys­tems kind of are like a one-size-fits-all t‑shirt. They fit very few peo­ple, a lot of us end up kind of unhap­py. But we can do bet­ter. We can har­ness this oppor­tu­ni­ty to be aware of the very spe­cif­ic con­texts in which these tools can be deployed, think about how they can be tracked over time, and find ways to serve the spe­cif­ic needs of the users and the devel­op­ers in order to be aware of the par­tic­u­lar per­spec­tive from which we are design­ing and that per­spec­tive which is embed­ded in all of the tools we’re cre­at­ing. Thanks.

Help Support Open Transcripts

If you found this useful or interesting, please consider supporting the project monthly at Patreon or once via Cash App, or even just sharing the link. Thanks.