Hi. I’m going to give a talk this evening about what I like to call activist meta­da­ta. First off, I’m Harlo. I cur­rent­ly work for a group called The Guardian Project. We build pri­mar­i­ly mobile (but not entire­ly) soft­ware that is built for cir­cum­ven­tion tech­nol­o­gy, and our clientele—well I mean every­one, we want every­one to use the apps—but we also do a lot of hands-on sup­port for jour­nal­ists, legal clin­ics, human rights activists, and oth­er do-gooders all over the world. I also just fin­ished a fel­low­ship at the New York Times spon­sored by both the Mozilla Foundation and the Knight Foundation, which are both large sup­port­ers of the press. And I worked with the computer-assisted report­ing team on their work­flows of documents. 

My research inter­ests right now are and have been for the past cou­ple of of years are in meta­da­ta. And that’s a real­ly real­ly sexy word right now, so: What’s meta­da­ta? I Googled it. I Googled it. Okay. So there are two types of meta­da­ta, struc­tur­al meta­da­ta and design meta­da­ta” or what­ev­er. Actually meta­da­ta is data about data. Data that describes the data that you’re look­ing at. By way of illus­trat­ing what meta­da­ta is I decid­ed to actu­al­ly go to the image search because this is what meta­da­ta is. You will get the answer to your ques­tion via being around the ques­tion and see­ing what kind of data pops up to the fore. 

From video @1:42

From video @1:42

You’ll notice that one of the most pop­u­lar asso­ci­a­tions with the term meta­da­ta, because it’s been pushed to the fore­front of the Google image search is NSA.” And of course liv­ing in the post-Snowden world, you guys all under­stand exact­ly how this word got asso­ci­at­ed with the NSA in such a way that it was pushed to the top of Google’s image search­es. So that’s a very very inter­est­ing way of think­ing about what meta­da­ta actu­al­ly is.

So how can it be activist‑y? I’ll give you a cou­ple of exam­ples. This one I real­ly real­ly like by a jour­nal­ist from Reuters called Megan Twohey, where she inves­ti­gates these ille­gal exchanges of adopt­ed chil­dren in the United States via advanced search­es on Yahoo! Groups. And she was able to call a lot of this data in an auto­mat­ed way, mas­sage it using data-scraping tech­niques and nat­ur­al lan­guage pro­cess­ing in order to paint this pic­ture about how chil­dren in our coun­try are just being tossed around from fam­i­ly to fam­i­ly like you would a car or a used Prada bag or some­thing like that.

From video @3:30

From video @3:30

Here’s anoth­er exam­ple that I love from a group called SITU Studios out in London. This is an analy­sis of bal­lis­tic foren­sic analy­sis of an event that took place in the Palestinian region hav­ing to do with a protest. A pro­tes­tor was shot by an Israeli sol­dier, and the cul­pa­bil­i­ty of that sol­dier was even­tu­al­ly proven via foren­sic analy­sis of the video of that protest event that was tak­en hap­haz­ard­ly from three dif­fer­ent angles. And then SITU Studio, which is actu­al­ly an archi­tec­tur­al firm, took those three videos and were able to prove cul­pa­bil­i­ty based off of the image meta­da­ta cap­tured inside the videos along­side with match­ing up audio, match­ing up cam­era angles, and things like that. 

These are real­ly real­ly excel­lent, excel­lent exam­ples of activist meta­da­ta. However these exam­ples are kind of few and far between, for a cou­ple of rea­sons, and actu­al­ly that pre­vents them from being activist‑y. I’ll tell you why. The bal­lis­tic foren­sic infor­ma­tion, in order to achieve that par­tic­u­lar domain knowl­edge actu­al­ly takes a lot of study. It takes a lot of invest­ment to actu­al­ly peruse the sci­en­tif­ic jour­nals that aren’t nec­es­sar­i­ly open to you because you don’t have the domain knowl­edge and you’re not part of the uni­ver­si­ty sys­tem, or what­ev­er bar­ri­ers are in your way that pre­vent you from achiev­ing this domain knowl­edge and put[ting] it to data.

From video @5:13

From video @5:13

Also there’s a kind of stereo­type of the nerd once again. When you google a nerd, there you go. When you Google that pro­file (Google of course being the absolute unmit­i­gat­ed truth) that pro­file does­n’t nec­es­sar­i­ly match the pro­file of those who are sit­ting in human rights orga­ni­za­tions, or in press out­lets try­ing to answer these questions.

Another thing that stands in peo­ple’s way is mon­ey. And there’s a sys­tem that works on a lot of these domain-specific ques­tions in an elas­tic way called Palantir, which is a real­ly real­ly great pro­gram, how­ev­er it’s incred­i­bly expen­sive. It’s not open-source, despite the fact that they use the terms open” and source” so much on their web sites, it’s not. It’s a very closed-source pro­gram. The open source” that they talk about is the data that they pull down from it and work on. That said, what we decid­ed to do was to find a way to answer ques­tions using meta­da­ta, cre­at­ing tools that were elas­tic enough for peo­ple to use under any sce­nario they can imag­ine that kind of looked like Palantir but was 100% open-source and for the people. 

This start­ed out with a project that I start­ed with The Guardian Project, an orga­ni­za­tion called WITNESS, and the International Bar Association called InformaCam. It’s a hor­ri­ble name and I’m very sor­ry. But, for instance, you take a pho­to­graph or a video and we all are aware of EXIF data but we decid­ed that we were going to add a whole bunch of extra meta­da­ta. So in addi­tion to your EXIF data you also have your accelerom­e­ter data, so the way that you actu­al­ly hold the device as you’re tak­ing a pho­to or fram­ing your per­fect shot. We also sam­ple things like light meter val­ues, because that actu­al­ly cor­rob­o­rates any sto­ry you might tell about what time of day it is. Geo is lat­i­tude and lon­gi­tude, that’s kind of old. We decid­ed that we could bet­ter cor­rob­o­rate loca­tion by adding extra data points such as vis­i­ble cell tow­er IDs, vis­i­ble wifi devices, and stuff like that. I’ll tell you an inter­est­ing sto­ry about the wifi devices. But we also allowed peo­ple to add in human-readable bit of data. So in the protest exam­ple, you can say, That’s the police­man that shoved me.” or what­ev­er. And all of this infor­ma­tion was then tak­en and import­ed into a pro­gram that I’ll talk to you about very short­ly, and made auto­mat­i­cal­ly index­able and search­able. So you could actu­al­ly run Google-like queries, that say Show me all of the pho­tos that were tak­en on the Brooklyn bridge around this one par­tic­u­lar cop in the area.”

From video @8.32

From video @8.32

In our ini­tial exper­i­ments with this par­tic­u­lar pro­gram, we actu­al­ly kind of noticed that—and you’ll see on the right side of this screen—we have some wifi net­works. I unfor­tu­nate­ly don’t have a pic­ture of it here, but we start­ed to notice that a lot of those BSS IDs, which actu­al­ly cor­re­spond to routers that you might see in a room, were show­ing up as all zeroes. What that means is that you prob­a­bly have a case of IMSI catch­ers or StingRays” or some­thing like that on your hands. So that became an inter­est­ing data point, to say like show me all the protest pho­tos where there was a Stingray.”

We built a cou­ple of ver­sions of this, one being the ver­sion that we built for the International Bar Association that is a lit­tle bit more stream­lined. In addi­tion to tak­ing all that meta­da­ta, you are able to tag stuff and say that’s the vic­tim, that’s the per­pe­tra­tor” or what­ev­er. That was index­able and search­able. We also built a ver­sion of it that had a sim­pler and more graphically-intensive inter­face for a legal clin­ic in the Southern United States, for migrant farm work­ers. So they could actu­al­ly take self­ies to check in at work, and if ever a farmer said that So-and-so was not work­ing sev­en hours on this farm, he was only work­ing six, we could actu­al­ly show them the meta­da­ta and say that you’re going to have to pay this per­son for the work that they did. This app also allowed peo­ple to take inci­dent reports for on-the-job safe­ty vio­la­tions, and also took track of how long their lunch breaks were so we could file more long-term, further-reaching reports.

The soft­ware that we use to pull down and make use of all this data is our open-source ver­sion of Palantir, which I call Unveillance. It’s kind of like a mix­ture of Palantir, which I described before, and Dropbox. So you can just take a group of files, dump them in your fold­er and then in the back­ground, just as Dropbox does, it per­forms cal­cu­la­tions on the doc­u­ments that you put into this fold­er, and it allows you to take that data from those cal­cu­la­tions and mas­sage them into what­ev­er ques­tions you might have to answer. Working with The Guardian Project and the New York Times, I was able to bring this about, and I start­ed to use it to answer some more ques­tions, because what good are you if you’re not try­ing to muck­rake, right?

So I thought about the recent group of doc­u­ments that came out of Darren Wilson’s grand jury tri­al, and how there are a lot of real­ly inter­est­ing reports cir­cu­lat­ing about how the offi­cer’s per­cep­tion of Michael Brown col­ored the way col­ored the way that he treat­ed him, and hav­ing to do with his size and his race, and how that gave him the illu­sion of a more dan­ger­ous sus­pect. So I put the grand jury tes­ti­mo­ny data through some accel­er­at­ed test search­es using the Unveillance engine hav­ing to with how they talk about this man’s size. And this is a lit­tle video here where we’ve done some top­ic mod­el­ling based off of the search terms that we had here. 

[Over the next para­graph, Harlo is talk­ing over a block of video run­ning from approx­i­mate­ly 12:00 to 13:40. Inline time­stamps are linked to screen­shots of some spe­cif­ic ref­er­ences, for context.]

[12:29] Down at the bot­tom we have groups of sub­jects that come out of nat­ur­al lan­guage pro­cess­ing that inform where in these var­i­ous parts of the depo­si­tions peo­ple are talk­ing about his size and how they’re talk­ing about his size. And we noticed that a lot of these things are usu­al­ly linked to drugs and a para­noia that this is a crazy per­son on some sort of hal­lu­cino­genic drug or what­ev­er. And so we were able to then search deep­er with­in the cor­pus of doc­u­ments. [12:50] That pink doc­u­ment, num­ber 5, is Darren Wilson’s tes­ti­mo­ny him­self. The OCR-ing, which is how you use opti­cal recog­ni­tion in order to get text out of PDF doc­u­ments is a lit­tle bit imper­fect, so our engine allows you to edit that if you need to and then run those process­es again. But this is the unedit­ed doc­u­ment. And then final­ly we come to, this is once again the grand jury tes­ti­mo­ny of Darren Wilson, [13:20] able to run this through more nat­ur­al lan­guage pro­cess­ing where we can map cer­tain terms, like mar­i­jua­na” and gun” and stuff like that, and his large­ness or what­ev­er on to spe­cif­ic parts of the doc­u­ment, [13:36] in order to draw cer­tain con­clu­sions about stuff like that. 

So where next? Something that I find inter­est­ing (This is a project that we’re work­ing on this week with the help of some of the oth­er speak­ers that you’ll hear from tonight and the oth­er evenings), [we’re] work­ing on a project called Foxy Doxxing, which is inspired by this inter­est­ing case that came out a cou­ple of weeks ago, maybe two weeks ago, about how a woman who had been attacked on Twitter by you know, the trolls” decid­ed to take foren­sic analy­sis into her own hands in order to find out who was attack­ing her online. What I find inter­est­ing about this par­tic­u­lar sce­nario is that the woman here, she’s a secu­ri­ty researcher who works for the Tor project, so she’s incred­i­bly tech­ni­cal­ly savvy of a devel­op­er, as you can imag­ine. And the tools that she had used, and the tech­niques that she had used, are not nec­es­sar­i­ly avail­able to any­one who might seek pro­tec­tion. Unfortunately what I’ve come to learn from work­ing with sev­er­al news­rooms and speak­ing to sev­er­al jour­nal­ists, par­tic­u­lar­ly women but not always, is that there’s a huge dis­con­nect between…actually there’s no tech­ni­cal capa­bil­i­ty at any of their news­rooms to pro­tect them from these par­tic­u­lar threats. And that’s kind of sad, giv­en that this one par­tic­u­lar secu­ri­ty researcher was able to fend for her­self and find her attack­ers, yet you go to some­body who works for the Washington Post and she can’t do any­thing. The Washington Post can’t do any­thing. So that’s where we’re going next with this par­tic­u­lar engine. 

And so the strengths and weak­ness­es are, as I was men­tion­ing domain-specific knowl­edge before. Like that foren­sic bal­lis­tic exam­ple. I per­son­al­ly did­n’t spend decades research­ing nat­ur­al lan­guage pro­cess­ing. I don’t plan on spend­ing much more doing nat­ur­al lan­guage pro­cess­ing. But because this engine that we’ve cre­at­ed is open-source, it actu­al­ly runs on gists on Github from spe­cif­ic users. So if some­one who has more domain-specific knowl­edge than I do looks at my lit­tle snip­pets of code that run those inte­gral pieces of pro­gram­ming and says, Well, you know you might want to change that.” they can sub­mit some sort of edits to Github and if I accept them and run the doc­u­ments through the process­es over again, we can actu­al­ly get bet­ter at ana­lyz­ing things, together. 

And that’s it. So thank you for lis­ten­ing to my lit­tle show and tell. Thanks.