Hi. I’m going to give a talk this evening about what I like to call activist meta­da­ta. First off, I’m Harlo. I cur­rent­ly work for a group called The Guardian Project. We build pri­mar­i­ly mobile (but not entire­ly) soft­ware that is built for cir­cum­ven­tion tech­nol­o­gy, and our clientele—well I mean every­one, we want every­one to use the apps—but we also do a lot of hands-on sup­port for jour­nal­ists, legal clin­ics, human rights activists, and oth­er do-gooders all over the world. I also just fin­ished a fel­low­ship at the New York Times spon­sored by both the Mozilla Foundation and the Knight Foundation, which are both large sup­port­ers of the press. And I worked with the computer-assisted report­ing team on their work­flows of doc­u­ments.

My research inter­ests right now are and have been for the past cou­ple of of years are in meta­da­ta. And that’s a real­ly real­ly sexy word right now, so: What’s meta­da­ta? I Googled it. I Googled it. Okay. So there are two types of meta­da­ta, struc­tur­al meta­da­ta and design meta­da­ta” or what­ev­er. Actually meta­da­ta is data about data. Data that describes the data that you’re look­ing at. By way of illus­trat­ing what meta­da­ta is I decid­ed to actu­al­ly go to the image search because this is what meta­da­ta is. You will get the answer to your ques­tion via being around the ques­tion and see­ing what kind of data pops up to the fore.

From video @1:42

From video @1:42

You’ll notice that one of the most pop­u­lar asso­ci­a­tions with the term meta­da­ta, because it’s been pushed to the fore­front of the Google image search is “NSA.” And of course liv­ing in the post-Snowden world, you guys all under­stand exact­ly how this word got asso­ci­at­ed with the NSA in such a way that it was pushed to the top of Google’s image search­es. So that’s a very very inter­est­ing way of think­ing about what meta­da­ta actu­al­ly is.

So how can it be activist-y? I’ll give you a cou­ple of exam­ples. This one I real­ly real­ly like by a jour­nal­ist from Reuters called Megan Twohey, where she inves­ti­gates these ille­gal exchanges of adopt­ed chil­dren in the United States via advanced search­es on Yahoo! Groups. And she was able to call a lot of this data in an auto­mat­ed way, mas­sage it using data-scraping tech­niques and nat­ur­al lan­guage pro­cess­ing in order to paint this pic­ture about how chil­dren in our coun­try are just being tossed around from fam­i­ly to fam­i­ly like you would a car or a used Prada bag or some­thing like that.

From video @3:30

From video @3:30

Here’s anoth­er exam­ple that I love from a group called SITU Studios out in London. This is an analy­sis of bal­lis­tic foren­sic analy­sis of an event that took place in the Palestinian region hav­ing to do with a protest. A pro­tes­tor was shot by an Israeli sol­dier, and the cul­pa­bil­i­ty of that sol­dier was even­tu­al­ly proven via foren­sic analy­sis of the video of that protest event that was tak­en hap­haz­ard­ly from three dif­fer­ent angles. And then SITU Studio, which is actu­al­ly an archi­tec­tur­al firm, took those three videos and were able to prove cul­pa­bil­i­ty based off of the image meta­da­ta cap­tured inside the videos along­side with match­ing up audio, match­ing up cam­era angles, and things like that.

These are real­ly real­ly excel­lent, excel­lent exam­ples of activist meta­da­ta. However these exam­ples are kind of few and far between, for a cou­ple of rea­sons, and actu­al­ly that pre­vents them from being activist-y. I’ll tell you why. The bal­lis­tic foren­sic infor­ma­tion, in order to achieve that par­tic­u­lar domain knowl­edge actu­al­ly takes a lot of study. It takes a lot of invest­ment to actu­al­ly peruse the sci­en­tif­ic jour­nals that aren’t nec­es­sar­i­ly open to you because you don’t have the domain knowl­edge and you’re not part of the uni­ver­si­ty sys­tem, or what­ev­er bar­ri­ers are in your way that pre­vent you from achiev­ing this domain knowl­edge and put[ting] it to data.

From video @5:13

From video @5:13

Also there’s a kind of stereo­type of the nerd once again. When you google a nerd, there you go. When you Google that pro­file (Google of course being the absolute unmit­i­gat­ed truth) that pro­file doesn’t nec­es­sar­i­ly match the pro­file of those who are sit­ting in human rights orga­ni­za­tions, or in press out­lets try­ing to answer these ques­tions.

Another thing that stands in people’s way is mon­ey. And there’s a sys­tem that works on a lot of these domain-specific ques­tions in an elas­tic way called Palantir, which is a real­ly real­ly great pro­gram, how­ev­er it’s incred­i­bly expen­sive. It’s not open-source, despite the fact that they use the terms open” and source” so much on their web sites, it’s not. It’s a very closed-source pro­gram. The open source” that they talk about is the data that they pull down from it and work on. That said, what we decid­ed to do was to find a way to answer ques­tions using meta­da­ta, cre­at­ing tools that were elas­tic enough for peo­ple to use under any sce­nario they can imag­ine that kind of looked like Palantir but was 100% open-source and for the peo­ple.

This start­ed out with a project that I start­ed with The Guardian Project, an orga­ni­za­tion called WITNESS, and the International Bar Association called InformaCam. It’s a hor­ri­ble name and I’m very sor­ry. But, for instance, you take a pho­to­graph or a video and we all are aware of EXIF data but we decid­ed that we were going to add a whole bunch of extra meta­da­ta. So in addi­tion to your EXIF data you also have your accelerom­e­ter data, so the way that you actu­al­ly hold the device as you’re tak­ing a pho­to or fram­ing your per­fect shot. We also sam­ple things like light meter val­ues, because that actu­al­ly cor­rob­o­rates any sto­ry you might tell about what time of day it is. Geo is lat­i­tude and lon­gi­tude, that’s kind of old. We decid­ed that we could bet­ter cor­rob­o­rate loca­tion by adding extra data points such as vis­i­ble cell tow­er IDs, vis­i­ble wifi devices, and stuff like that. I’ll tell you an inter­est­ing sto­ry about the wifi devices. But we also allowed peo­ple to add in human-readable bit of data. So in the protest exam­ple, you can say, That’s the police­man that shoved me.” or what­ev­er. And all of this infor­ma­tion was then tak­en and import­ed into a pro­gram that I’ll talk to you about very short­ly, and made auto­mat­i­cal­ly index­able and search­able. So you could actu­al­ly run Google-like queries, that say Show me all of the pho­tos that were tak­en on the Brooklyn bridge around this one par­tic­u­lar cop in the area.”

From video @8.32

From video @8.32

In our ini­tial exper­i­ments with this par­tic­u­lar pro­gram, we actu­al­ly kind of noticed that—and you’ll see on the right side of this screen—we have some wifi net­works. I unfor­tu­nate­ly don’t have a pic­ture of it here, but we start­ed to notice that a lot of those BSS IDs, which actu­al­ly cor­re­spond to routers that you might see in a room, were show­ing up as all zeroes. What that means is that you prob­a­bly have a case of IMSI catch­ers or StingRays” or some­thing like that on your hands. So that became an inter­est­ing data point, to say like show me all the protest pho­tos where there was a Stingray.”

We built a cou­ple of ver­sions of this, one being the ver­sion that we built for the International Bar Association that is a lit­tle bit more stream­lined. In addi­tion to tak­ing all that meta­da­ta, you are able to tag stuff and say that’s the vic­tim, that’s the per­pe­tra­tor” or what­ev­er. That was index­able and search­able. We also built a ver­sion of it that had a sim­pler and more graphically-intensive inter­face for a legal clin­ic in the Southern United States, for migrant farm work­ers. So they could actu­al­ly take self­ies to check in at work, and if ever a farmer said that So-and-so was not work­ing sev­en hours on this farm, he was only work­ing six, we could actu­al­ly show them the meta­da­ta and say that you’re going to have to pay this per­son for the work that they did. This app also allowed peo­ple to take inci­dent reports for on-the-job safe­ty vio­la­tions, and also took track of how long their lunch breaks were so we could file more long-term, further-reaching reports.

The soft­ware that we use to pull down and make use of all this data is our open-source ver­sion of Palantir, which I call Unveillance. It’s kind of like a mix­ture of Palantir, which I described before, and Dropbox. So you can just take a group of files, dump them in your fold­er and then in the back­ground, just as Dropbox does, it per­forms cal­cu­la­tions on the doc­u­ments that you put into this fold­er, and it allows you to take that data from those cal­cu­la­tions and mas­sage them into what­ev­er ques­tions you might have to answer. Working with The Guardian Project and the New York Times, I was able to bring this about, and I start­ed to use it to answer some more ques­tions, because what good are you if you’re not try­ing to muck­rake, right?

So I thought about the recent group of doc­u­ments that came out of Darren Wilson’s grand jury tri­al, and how there are a lot of real­ly inter­est­ing reports cir­cu­lat­ing about how the officer’s per­cep­tion of Michael Brown col­ored the way col­ored the way that he treat­ed him, and hav­ing to do with his size and his race, and how that gave him the illu­sion of a more dan­ger­ous sus­pect. So I put the grand jury tes­ti­mo­ny data through some accel­er­at­ed test search­es using the Unveillance engine hav­ing to with how they talk about this man’s size. And this is a lit­tle video here where we’ve done some top­ic mod­el­ling based off of the search terms that we had here.

[Over the next para­graph, Harlo is talk­ing over a block of video run­ning from approx­i­mate­ly 12:00 to 13:40. Inline time­stamps are linked to screen­shots of some spe­cif­ic ref­er­ences, for con­text.]

[12:29] Down at the bot­tom we have groups of sub­jects that come out of nat­ur­al lan­guage pro­cess­ing that inform where in these var­i­ous parts of the depo­si­tions peo­ple are talk­ing about his size and how they’re talk­ing about his size. And we noticed that a lot of these things are usu­al­ly linked to drugs and a para­noia that this is a crazy per­son on some sort of hal­lu­cino­genic drug or what­ev­er. And so we were able to then search deep­er with­in the cor­pus of doc­u­ments. [12:50] That pink doc­u­ment, num­ber 5, is Darren Wilson’s tes­ti­mo­ny him­self. The OCR-ing, which is how you use opti­cal recog­ni­tion in order to get text out of PDF doc­u­ments is a lit­tle bit imper­fect, so our engine allows you to edit that if you need to and then run those process­es again. But this is the unedit­ed doc­u­ment. And then final­ly we come to, this is once again the grand jury tes­ti­mo­ny of Darren Wilson, [13:20] able to run this through more nat­ur­al lan­guage pro­cess­ing where we can map cer­tain terms, like mar­i­jua­na” and gun” and stuff like that, and his large­ness or what­ev­er on to spe­cif­ic parts of the doc­u­ment, [13:36] in order to draw cer­tain con­clu­sions about stuff like that.

So where next? Something that I find inter­est­ing (This is a project that we’re work­ing on this week with the help of some of the oth­er speak­ers that you’ll hear from tonight and the oth­er evenings), [we’re] work­ing on a project called Foxy Doxxing, which is inspired by this inter­est­ing case that came out a cou­ple of weeks ago, maybe two weeks ago, about how a woman who had been attacked on Twitter by you know, the trolls” decid­ed to take foren­sic analy­sis into her own hands in order to find out who was attack­ing her online. What I find inter­est­ing about this par­tic­u­lar sce­nario is that the woman here, she’s a secu­ri­ty researcher who works for the Tor project, so she’s incred­i­bly tech­ni­cal­ly savvy of a devel­op­er, as you can imag­ine. And the tools that she had used, and the tech­niques that she had used, are not nec­es­sar­i­ly avail­able to any­one who might seek pro­tec­tion. Unfortunately what I’ve come to learn from work­ing with sev­er­al news­rooms and speak­ing to sev­er­al jour­nal­ists, par­tic­u­lar­ly women but not always, is that there’s a huge dis­con­nect between…actually there’s no tech­ni­cal capa­bil­i­ty at any of their news­rooms to pro­tect them from these par­tic­u­lar threats. And that’s kind of sad, giv­en that this one par­tic­u­lar secu­ri­ty researcher was able to fend for her­self and find her attack­ers, yet you go to some­body who works for the Washington Post and she can’t do any­thing. The Washington Post can’t do any­thing. So that’s where we’re going next with this par­tic­u­lar engine.

And so the strengths and weak­ness­es are, as I was men­tion­ing domain-specific knowl­edge before. Like that foren­sic bal­lis­tic exam­ple. I per­son­al­ly didn’t spend decades research­ing nat­ur­al lan­guage pro­cess­ing. I don’t plan on spend­ing much more doing nat­ur­al lan­guage pro­cess­ing. But because this engine that we’ve cre­at­ed is open-source, it actu­al­ly runs on gists on Github from spe­cif­ic users. So if some­one who has more domain-specific knowl­edge than I do looks at my lit­tle snip­pets of code that run those inte­gral pieces of pro­gram­ming and says, Well, you know you might want to change that.” they can sub­mit some sort of edits to Github and if I accept them and run the doc­u­ments through the process­es over again, we can actu­al­ly get bet­ter at ana­lyz­ing things, togeth­er.

And that’s it. So thank you for lis­ten­ing to my lit­tle show and tell. Thanks.


Help Support Open Transcripts

If you found this useful or interesting, please consider supporting the project monthly at Patreon or once via Square Cash, or even just sharing the link. Thanks.