Moderator Okay. Welcome to the colloquium tonight. The colloquium tonight is entitled “The End of Virtual: Digital Methods” by Richard Rogers, and it’s a very important aspect that Richard will be talking about, namely what happens when material, analog material, moves into the digital space but even more importantly what happens when the material that more and more we deal with is born digital. So how do the methods actually change? How do we need to think about research methods when the material is all digital? How do we research what’s happening in the Internet? How do we research the cultural aspects of the Internet, of people connecting, issues of people combining different aspects? How do we come up with different methods that really make sure that we can research that appropriately? But also, can we just transfer existing methods, scholarly methods, into the digital realm? Or do we need to develop new methods? Is it something that also the new methods might translate back into the more analog realm?
So these are all the questions, and many more, that Richard will address. Let me briefly introduce him. Richard is a professor at the University of Amsterdam, professor of media studies. And he’s the chair of the New Media and Digital Culture program at the University of Amsterdam. He’s also the director of govcom.org, and that’s a group that’s responsible for the Issue Crawler. Some of you might have seen that, a very interesting visualization tool for the Internet, and other political tools. And he’s also one of the founders of the Digital Methods Initiative, and that’s reworking the Internet and methods for Internet research.
He has published quite a few books. One of them is Information Politics on the Web, and that’s MIT Press, 2004. That was also awarded the best book award of the year by the American Society of Information Science & Technology. And he’s working on a new book that’s called Digital Methods. Hence the title of tonight’s talk. And that also is going to appear at MIT Press. So please join me in welcoming Richard Rodgers.
Richard Rogers: What I’m going to do today is situate digital methods as an approach, as an outlook, in the history of Internet-related research. I’d like to divide up the history of Internet research largely into three eras, the first being where we thought of the Web as a kind of cyberspace. And these particular periods that I’m going to tell you about, they’re transhistorical—they overlap. But I think there’ve been some changes over the last ten, twenty years in how we do research with the Internet. So this is what I would like to— I’d like to highlight the changes in the dominant ways of thinking.
So, in the early days we had arguably this idea of the Web as cyberspace, where the dominant form of Internet-related research was kind of cybercultural studies. And one of the interesting things about cybercultural studies was looking at and promoting the Internet as being something very very different. In fact as being a kind of other realm. In fact as being a virtual realm where at the time—sort of seeing the Web as cyberspace treated the Internet, and the Web, as a virtual realm, as something that stood apart. And also it was promoted and thought of as being quite transformative—it would transform identity, it would transform corporality, it would transform ideas of politics, etc.
Now, around 1998 with the Steve Jones volume Doing Internet Research, and 1999, 2000 with a couple of important monographs by virtual ethnographers, in particular by Slater and Miller, they in some sense sought to debunk all of the various claims of the Internet as being transformative. So in marched the ethnographers first and later the social scientists. And they surveyed, and they visited Internet cafes. And what they did was in some sense grounded…it grounded Internet-related research.
And interestingly enough, the move that they made in doing user studies was to go offline. So they interviewed, they observed, and what they found was that all of the various transformative qualities were a little bit different than one had previously thought. So, one’s identity is not just rooted in the online but is in fact also rooted in the offline. All of these things are a bit mixed.
Now, this went on for some time and it’s still going on. The social scientific impact on Internet-related research has been great. But what I would like to argue is something happened sometime around 2007, 2008. And this is the first time when I came up— I saw a number of the developments that went on and I came up with a term called “online groundedness.” And online groundedness is a term that I coined in order to try to think about research that takes data, online data, about the real and does research about society using the Internet. Right, so no longer is the Internet this realm apart, this virtual space, this cyberspace. No longer do we go offline in order to find out about what’s going on online. But rather nowadays, arguably we’ve moved into a period where the online… Or online data sets—so the Web is data. Online data sets serve as a means to study not just online culture but rather culture, and society.
So this is the move that I’m making with digital methods. So let me just get directly to an example so you know what I mean. It was in August 2007 when I read quite an innocent article in a Dutch newspaper. And investigative journalists wrote that they were researching hate, basically. And the Internet of course has always been this…beginning with Cass Sunstein’s observation in Republic.com, the Internet’s always been the site for hate and extremism research.
In any case, Dutch investigative journalists asked the question of whether or not Dutch culture is hardening. And in order to answer that question, they didn’t go native. So they didn’t embed themselves like journalists do studying hooliganism, for example, in writing a book about hooliganism. They didn’t go native. They didn’t visit the social history library and the special pamphlets collection and the looking up handbills and things like this. They didn’t interview extremism experts.
They went online. And they in fact went to the Internet Archive and looked at web pages. And looked at the history of about a hundred different web sites. They compared right-wing web sites—right-of-center web sites—with extremist web sites. And they looked, and they saw that over time the language on the right-wing sites was beginning to approximate the language on extremist sites. So right-of-center web sites themselves in their word choice and in the issue language that they would use and the slogans etc., was becoming more and more extremist. And thereby on the basis of studying web sites, they concluded that Dutch culture is hardening.
Now for those of us who have spent the last…I dunno ten years hearing about and thinking about the Web as a virtual realm, as Web as cyberspace, as something with an asterisk on it, for those of us who’ve only gone to the Web to study online culture, this was radical. Right, so using the Web to make a finding about what’s going on in society.
Now, interestingly enough, they grounded their claim—and now this is the tricky point, and this is where a lot of people get a little bit…well, start asking questions. They grounded their claim with online data. So they grounded their claim—so the claim that the Dutch culture is hardening, and using the data of web sites—they grounded it in the online. So this is why I came up with this term online groundedness. They used the online as the baseline, as the means of calibration. Which is radical.
So I’m going to give you a few other examples of this, just so you’ve seen them, or just so you can think about them in these terms. Now, you will have heard of Google Flu Trends. Google Flu Trends is very very interesting because Google Flu Trends uses search query log data. Those folks searching for flu and flu-related symptoms online, their locations are found—they’re located, and then the places of flu are thereby plotted. So they’re using online data, data gained through what I would call registrational interactivity, data gained through search engine logs, to find out where flu’s at.
Now the interesting thing about Google Flu Trends is that immediately there was an outroar. So hang on. This method is very very different from the traditional method. The traditional method is we rely on emergency room reports, other traditional data-collecting that then is fed to the Center for Disease Control, and they come out with officially where flu is at. And where other disease is as well. Google Flu Trends, interestingly—and this is why there was such a great deal of interest around it—anticipates flu by…they’re sort of seven days approximately ahead of the Center for Disease Control.
However, before they could make claims about how well their data…well, how well they work, they had to check it against the CDC data, right. So they had to ground their claims in the traditional data. So it’s an interesting project because it finds out what’s happening in society and culture through Web data, yet it doesn’t use the Web as the baseline. So it’s not grounding the findings online. I just want to put— I mean, this was a few… This was last year. Google Flu Trends has now expanded to something like fifteen, seventeen countries.
This project’s different. I don’t know if you saw this…2009, I think. The day before Thanksgiving. A series of graphics were published in The New York Times. And what you see here is where people were querying a particular popular recipe site, allrecipes.com—I don’t know if you use it, allrecipes.com. I think it’s probably the most popular in the US. I use the BB— There’s also Epicurious, is the other one? Yeah. And then there’s the BBC—anyway. This is the biggest one. And what you see here is of course a map of the US and the darker the area, the more purple, is the higher the incidence of queries for a recipe. And so this is sweet potato pie. This is the day before Thanksgiving. People looking for macaroni and cheese, which I liked. Sweet potato. Corn casserole, you see the Corn Belt. Green beans. Turkey brine. Yams.
So what you see here is a kind of geography of taste, if you will. A geography of taste, geography of preference. And when I was looking at this I thought to myself well how else would you do this? You know, we can… How else would you sort of chart geography of taste? I mean, we could get supermarket data. We could interview. We could survey. And then I thought to myself, are those types of activities, are they actually fundable? Quite difficult.
However, this is quite— You know, I mean there’s a lot of validity checking to be done, etc. But what you have before you suddenly is a means by which one can do research about preference, distributed preference, using online data. Which you probably couldn’t do, at least as quickly, in any other way.
Now, what I want to do very very briefly is then contrast digital methods—which I will go into in more detail in a minute—with the other paradigm, if you will, that came out of the social science beginning with anthropologists and then later with a very important research program in Britain called the Virtual Society Program from 1997 to 2002 which I write about in a little booklet called The End of the Virtual. These are the sort of standard ways in which one does kind of Internet-related research, with “virtual methods.” And what I would like to argue is that a lot of these virtual methods are being in some sense ported onto, or transferred onto, the Internet without necessarily the needed sensitivity of digital culture. And increasingly what’s happening with these kind of methods is that what’s resulting are not necessarily findings or grounded findings but rather indicators. So we live in an age where the output of a lot of Internet-related research using virtual methods are “indicators.”
And so what I want to talk a little bit about is how the methods might change, or perhaps even should change. Or at least how other methods can live alongside virtual methods. Now, one of the things that interests me is fact-checking. Because fact-checking…I mean, not only in the US context where… Was it after—which presidential debate was it when factcheck.org went down the next day or that evening because everyone was checking factcheck.org and then Soros took over the domain for the night and it all became quite messy. And not because of fact-checking and its traditional association with the blogosphere. But rather as an everyday sort of method, right. Either for investigative journalism more formally, or for a lot of different work that we do.
It’s interesting that traditionally we ask at the end of the interview if it’s gone well, who else do we interview? And we ask the second person about what we found out from the first person. And this is how we snowball. Now, when we think about the online being mixed into this, we can look up people in advance. So I don’t know if you’ve looked up you know, me before this talk or whatever. But then the question is does the order of checking now change? So after all of this, do you now— After the interview, do look the person up again to check the veracity or the context of what the person said in the interview, right. So where’s the baseline? Where’s the grounding going on?
So, what I would like to talk about is to think about how the methods, or at least how the sort of philosophy or theory of methods, might change if we begin to take the online a lot more seriously. If we begin to take online data, Web data, more seriously. Now what I would like to do is I would like to introduce to you a kind of methodological philosophy which I have called “digital methods.” And what digital methods does is it has a number of principles.
And the first one is, or the major one is, to follow the medium. To follow the medium. And to think that the medium itself has methods built in, has in-built methods. And so to think about what the medium has to offer in terms of methods. And specifically, digital methods has a particular outlook or approach. What it does…like many software projects, what it does is it looks for what are the natively-digital objects that are available. Links, tags, date stamps, edits, reversion—whatever; loads of them. It looks at what kind of natively-digital objects are on offer online.
And then it asks itself the question of, how to the dominant devices handle these objects? What do search engines do with links, for example? How do the dominant devices online handle these objects? And then subsequently the question is, how can you repurpose the methods of the medium for social and cultural research? So it is a question of looking at how do we repurpose a search engine? How do we repurpose Facebook? How do we repurpose Wikipedia? How do we repurpose…you name it. What can we build on top of these things? Or beside them. Or how can we learn from how they handle the natively-digital objects?
And then the tricky part comes. When we make our findings, the question is are they grounded in the online? So we’re constantly in some sense playing epistemological chicken. Do we need to go offline to ground? Or can we ground them in the online? And how confident are we when we ground them in the online?
So what I’m gonna do is I’m gonna take you through digital methods from sort of like the ground up, if you will. From some of the more basic elements of the Web. Natively-digital objects: links, tags, etc. How do you study links, and make findings about them for social and cultural research? So I’ll go from like, the micro to the macro. So from the link… So how does Google…or how do search engines treat the links, and how can you learn from them? And what else can you do with them? The web site…I treat the web site as an archived object, and ask myself the question how does the Internet Archive, how does the Wayback Machine treat web sites? And how can we repurpose how they treat web sites for other purposes of research? Engines, etc.
And what I’m gonna do, I mean I have a couple in parentheses. I won’t have time to treat them all but I’m gonna go through the link, the web site, the engine. I’ll just tell you that I also study spheres: the blogosphere, the websphere, the newssphere, tagosphere, the imagesphere, the videosphere. I see spheres as engine-demarcated spaces.
The webs; the Web these days is no longer in the singular but rather plural, largely because of geolocation technology so that we have the emergence of national webs. You’re in France typing google.com and you get redirected to google.fr. You’re sent home by default. So with geolocation technology we now have the rise of webs.
I’ll talk about social networking sites and introduce you to a research practice called post-demographics. How do you study Wikipedia? How do you repurpose Wikipedia? How do you repurpose Twitter? These are some of the things that the digital methods research program does. Each of these particular levels, if you will, all have associated PhD candidates with them and will be attending the MIT 7 conference in a couple of weeks.
The link. How are links sort of normally studied, and how else can we study them using the insights from digital methods? Well, links traditionally have been studied sort of like, two or three ways. From hypertext theory of course you will know that the links have been thought of as sort of paths that when applied to the Web…sort of off-author paths, where the surfer authors one’s own story through the Web. It’s…a bit old-fashioned. I mean it’s old-fashioned not only because of the fact that surfing is dead. So there’s not habitual visitation of web sites. People longer surf, they… However they do WWILF. This is a sort of British term, WWILFing, I don’t know if you’ve heard of it. Stands for “what was I looking for?” WWILFing.
And also this speaks to these ideas of the cognitive impact of the Web and of engines. And also because engines increasingly organize our paths, right. So it’s not the surfer with that will, but rather the engine as an ordering device. Nevertheless, links are also traditionally studied through small worlds and path theory, where what’s studied is the optimal route. The optimal path between two— I mean, it’s interesting the…it was Barabasi in Linked: The New Science of Networks who wrote that Bill Clinton asked Vern Jordan to get Monica Lewinsky a job after the incident, because Vern Jordan was the closest distance of anyone to the Fortune 500 CEOs. Something like…they calculated this, he was 2.2 handshakes away. So this is path. That’s the path.
And of course the social network analysis is classic, is then one’s position…not the path, but one’s position. Is one central, is one peripheral, is one highly between, etc. And therefore are you a broker, or are you…
What does the medium do? How does the medium treat links? And what can we learn from them? Well Google as the dominant medium device treats link as reputation markers, as relevance markers. So what we did is we decided to capture links. And this is a picture from 1999. This is one of the earlier maps that we made where we’re looking at how sites link to one another number. And on a very micro, a very fine-grained level. You know, you’ve seen these sort of massive link maps, right? And you’re like, what do they say?
Well, I mean if you zoom, what they tell you about is a kind of micro politics of association, if you will. And it’s very normal, as well. So who links to whom, and who doesn’t. The missing links. So this is a classic one. This is the multinational in yellow links to Greenpeace; Greenpeace doesn’t link back. No way. And then both the multinational corporation and the large NGO link to government—those are all sort of government or international organizations. And government does not link back, no way. And this is all very normal.
This is an output of the Issue Crawler, issuecrawler.net. It’s software that I developed. Recently had its ten year anniversary. It’s a crawler. So you insert URLs. It crawls them. It grabs all the outlinks of each of the URLs you’ve inserted. And then it does hyperlink analysis and it outputs a variety of visualization. This one’s the cluster map. And what we’re mapping here is the Armenian NGO space. So we inputted at all these Armenian NGOs, and they are in blue and red. And you see the network they organize where the blue and red ones are quite interlinked, and then they also link to a lot of international organizations. A lot of UN organizations. And a lot of donors and funders. So all of the Armenians link to all the funders and donors, and all the funders and donors don’t link back.
This is another map. On the left is the FATA network, on the right is the Hamas network. We took all FATA-related URLs, crawled them, and what you see here in FATA is a sort of civic web of links to newspapers, media sources. Links to also local NGOs as well as international NGOs. Hamas is kind of underground. A very very differ—sort of underground way. It link only to sort RSS readers. That’s a very very different style of linking, indicating a very very different style of communication. And also one can draw— I mean, if one compares various groups… If you compare Hamas to Hezbollah, they’ll have the same sort of linking behaviors. All to RSS feeds, for subscribers.
[indistinct question from audience member]
Location-free. Hamas-related web sites and FATA-related web sites.
[indistinct question from other audience member]
Well no— I mean, it’s… Well. So, Hamas has… And also a lot of the organizations of that RSS ilk, have…yeah, about ten, fifteen, twenty, twenty-five web sites, and then they’re in a variety of languages and a variety of countries. A variety of top-level domains, country domains. And then when you crawl them, what you find is that they only link to one another, and only link to RSS readers. They don’t link to anything else. Whereas FATA, all the FATA-related web sites, so those and those as well. They disclose a very very different kind of network, linking to the newspaper—and a very very different kind of infoculture, if you will, linking to newspapers, to local NGOs, to international NGOs.
What else can you do with links? This is work that I did for the OpenNet Initiative, which is the Internet censorship researchers, the Berkman Center and the University of Toronto. Those folks asked me to try to come up with a way in which to contribute to Internet censorship research using link analysis. And this particular piece of work was inspired by an observation that was in the Reporters Without Borders, rsf.org, Paris-based organization, an observation made in the cyber dissident handbook (I think it was 2005) where the Saudi Minister of Information boasted that they were blocking or censoring 400,000 web sites. And the OpenNet Initiative, in their traditional methodological way, traditional sampling operation, was checking 2,000 web sites per country. And so I was like, well if they’re boasting that they’re blocking 400,000 and you’re only checking 2,000 per country, how do we build out the list? How do we discover previously unknown censored web sites?
What I did is I took one of the categories of their web sites, put it into the Issue Crawler, crawled the web sites, and then I annotated it, the map. And so what you see here are nodes in red that are blocked, censored, in Iran. This is for Iran. In blue, sites that are not blocked. And then in red with those little pins on them, are sites that we discovered were blocked, previously unknown censored web sites.
How did we do it? Very very simple. We ran them through one of our tools that we built, which just checks proxies. And this is the tricky thing, right. Can we ground this just through this kind of tool or do we need to go to Iran and sit at a computer there and know for sure that it’s blocked? In any case, what the researchers in Toronto…they were checking the BBC and they kept continually finding that the BBC was not blocked. And on our link map, the BBC page that was linked to as being most relevant according to the network actors was actually the Persian language page. The regular BBC site gets a response code of “okay” whereas the Persian-language one is blocked in Iran.
I’m just gonna move along, and if there are questions we can probably them at the end. The web site. How is the web site normally studied? The web site is normally studied in sort of usability circles. I mean there’s a debate…or maybe the debate’s over, between the “don’t make me think” school of thought versus the poetics of navigation. Actually I mean, it’s a neverending debate—I guess it’s not over.
Also the color… I don’t know if you know that the Web is blue, or predominantly blue, if you do a sort of color analysis of the Web. And it’s interesting because even in sector-specific areas… So, medical sites, environmental sites…you know, you think they would be predominately green but there’s a lot of blue in there.
Eye tracking. I don’t know if people are familiar with this work. This is a very famous heatmap with eye tracking… The more red it is, the more attention to the particular spot. Then you see immediately a sense of the real estate of a web page. This is a Western Web visitor.
Site optimization; SEO. Also trying to detect to optimize sites. Whether or not you can detect. So first of all there’s optimization. And then there’s manipulation. And then whether you can detect manipulation—that’s quite tricky, actually. Whether or not— People say, “Oh you know, search engine results, they’re all manipulated anyway.” Well…show me. It’s quite difficult.
Site features. Now this is a classic from a lot of social science and not even social science, where one makes a sort of code book with a long list of site features and you go through a number of sites and you check off whether or not it has a feature. And then you try to draw conclusions. And some of the ones that I’m most critical of are ideas that the more interactivity a set of sites have, the more participation, and then the more democracy…these sorts of things. Anyway, site feature analysis is one of the more dominant forms of analysis.
Now, I showed you this heatmap. I don’t know if you remember the day when Google moved its menu upper left. I thought that was a sort of concrete outcome of heatmaps.
How else to study the web site? Now, following the digital method sort of principles or protocol, you think okay, what’s the dominant device? And for this one, arguably it’s the Wayback Machine, of the Inter— Arguably it’s the Internet Archive, and the way you get to the Internet Archive is through the Wayback Machine. So if you think about how does the Wayback Machine sort of organize web sites? Well, it organizes web sites…I showed you a picture earlier. It organizes a web sites— You type a URL, hit return. And you see the history of a web site as sort of columns; which ones are available.
And so what strikes the user of the Wayback Machine, for those accustomed to using search engines, is you type in a URL, not keywords. And so you type in a URL, you hit return, and then you get the pages from the past of this particular URL. So in some sense the Wayback Machine has a particular inbuilt historiography. It organizes the history of the Web in single-site history, like a kind of biographical approach if you will.
So what I thought to do was well, how can we follow the medium? How can we learn from the dominant device that treats web sites. And then how can we repurpose it for the purposes of social research? So I’m gonna show you the outcome, it’s a three and a half minute video. What you can do is you can capture a site’s history, and you can replay it like time-lapse photography, showing in the sort of biographical tradition, showing how the life and times of a web site as also encapsulating the life and times of the Web, in the classic biographical approach.
So let me just show you… I’m want to preface this very very briefly by asking you, do you remember the Google Directory? You know what a directory is. A directory is human editors organizing the Web according to subject matters and then per subject matter there are a series of web sites. I mean Yahoo pioneered this, then there was later DMOZ, the Open Directory Project. No? But it’s interesting because as the year go by—and this is the subject of this short sketch, people don’t remember that there were directories, because it’s been taken over by the search engine. So I’m just gonna play this for you.
This is interesting, maybe you saw the Google anniversary timeline, the ten-year timeline. It was something that Google made. Anyway, this was specifically an alternative history to Google’s history. And I wanted to point out something about the rise of the backend just very very briefly. Now, if you go to Yahoo these days, they still have the directory. It’s become increasingly commercialized, less and less robust, and Open Directory Project similarly now becoming commercialized. Fewer and fewer expert volunteers.
Anyway, if you go to Yahoo what’s interesting when talking about or thinking through the impact of the rise of the backend and the rise of algorithmic culture, this is the list of human rights organizations in Yahoo. And you’ll notice that…I don’t know if you can notice but perhaps, by default they’re listed by popularity. By default. Not in alphabetical order. So the egalitarian alphabetical order listing, well-known from the history of library science, encyclopedias, etc., has given way to the algorithm, to the hierarchy based on relevance. However, it in this case is measured.
The engine. Second to last one. How are engines normally studied? Engines initially were studied by the famous articles in 1998 and 1999, Lawrence and Giles, one in Nature, in Science, as being…not that complete in their coverage. I don’t know if you remember these. So they came out, it was on all the news channels, that engines only index something like 30% of the Web. So the result of that was the creation of a few ideas that still pervade us. And one is the Dark Web. So there is this other Web, the Dark Web. Which is also a sad Web, because it’s dark because it’s not linked to. So they’re orphan sites. So there’s all these sort of particular kinds of aesthetics associated with the Dark Web.
But the other one, more a sort of kind of info-political critique was that engines, not only do they provide information, but they also exclude. So they exclude by not including. They exclude by not indexing. And they also bury sites by not listing them very high up.
That’s number one. Number two I mean, oftentimes engines are studied according to—and this is sort of Nicholas Carr and this interesting idea, that they encourage attention deficits. I mean like, yet another thing that does, right. But anyway, the engines in the way they are used encourage attention deficit. Why? Well, if you go to the studies of how engines are used what you’ll find out is that increasingly over the last…I don’t know, eight years now I think, people are looking at fewer and fewer engine result pages, and clicking on higher and higher results. And one of the things that Nicholas Car said—and this was not in The Shallows but in the Is Google Making Us Stupid? piece in The New Yorker—not The New Yorker—asked himself the question whether engines are encouraging this kind of behavior in clicking and all was causing us to no longer be contemplative.
Googlization…so Siva, a sort of colleague of mine whose last name I can never pronounce. V, Siva V., coming out with a book called The Googlization of Everything very. So Googlization is a term that was coined by…well it’s a library science critique. And it was coined right around the time when Google came out with the books project. And that was it. That’s when they crossed the line. You enter the library, and now we’re gonna start talking about you in these kinds of terms, googlization.
Googlization connotes globalization, hegemony, these sorts of ideas. And thus turns the Web more generally and certainly Google in particular, into objects of mass media critiques, right. So suddenly, there’s talk of media concentration. There’s a political economy critique of the Web. There’s a dominant engine. In fact, there’s a dominant algorithm. Bing and Yahoo are basically trying to read your PageRank. And all alternative algorithms are in decline. Even the highly-touted Wolfram Alpha that came out not so long ago when everyone was like okay, this is an old-school kind of 50’s-sounding name, you know, real old information retrieval. No.
Surveillance studies. Google oftentimes, or search engines, interestingly enough as bringing into being a new subject. And that is the data body. I don’t know if this stuff was in the press a lot. The 2006 AOL search engine log data release, at the information retrieval conference in Seattle in 2006, AOL Labs, being good scientists, gave a gift to the scientific community that were logs. A lot of data. 500,000 users over three months—or was it six months?—of all their queries. And then each of the users was anonymized. A number was put to them. So ever since then…in search engine studies this is an example of how not to anonymize, but anyway. They were anonymized with a number.
Now, just to give you a sense of these sorts of— User 3110455: how to change brake pads; Florida state cham—; how to get revenge on an ex; how to get revenge on an ex-girlfriend; how to get revenge on a girlfriend; replacement bumper for Scion xB. The intimacy on the one hand, and all the amateur detective work that was then subsequently— I mean, people were figuring out who these user— I mean, first The New York Times did it most famously. But then lots of other people as well.
So anyway, engines by virtue of saving log files and sometimes releasing them and sometimes not doing so well in their anonymization practices, create another data body. So another collection of data that represents you, or is you, or can stand in for you, can have in some instances greater agency, like in identity theft.
Now, I just want to touch really really briefly on—there’s a couple of sort of solutions to this problem. I don’t know if you’ll ever use them or if you know about them.
Scroogle. Does anyone use Scroogle? Only those real geeky kind of paranoid folks. Scroogle sits on top of Google and you can query it, and it doesn’t place a cookie doesn’t. It doesn’t know your location. It’s sort of a covert user’s Google. And TrackMeNot—this is Helen Nissenbaum and colleagues at NYU in Neil Postman’s former department made a Firefox extension that instead of the queries…in the background, when you’re querying Google sends also random queries to Google.
Okay so, how do we do Google? So I’ve been spending a lot of time building stuff on top of Google. And Google doesn’t like that. And they blocked me, a lot. And I’ll show you why. Apart from in the evidentiary arena, this is I think the first fully-documented case of the apparent removal of a site from Google results. So what you see before you is the PageRank for three web sites. The PageRank being if they’re top site they would get the rank of 1 in a results query. And you know, engines only serve a maximum of a thousand results. So it says 6,700,000 and then someone says oh, it would take thirteen lifetimes to go through those results. No, they only serve a thousand results. So it would take you not very long.
These are the result count, or the rank, of a site in Google for a particular query. The green one is the New York City government. The red one is 911truth.org. And the blue one is The New York Times. The query is “9/11.” So since about early 2007 we’ve been saving Google results for the query “9/11.” And also a bunch of other queries too.
And what you see here on September the 17th—this was in 2007—911truth.org suddenly went from its top-five ranking, to 200, to off the charts. And they stayed there for about two weeks. And then they returned to the top.
So this opens up all sorts of questions. Why did this happen? The interesting thing is if you go to 911truth.org they also noticed. But 911truth.org of course, if you are familiar with them, is sort of quite a conspiracy-style organization. So they come up with this huge conspiracy theory of why it was that they were removed. So it’s quite tricky to enter into that realm when you have all this kind of conspiracy talk around why it is that they were removed.
I think I know why. And that has to do with the web site template and the fact that 911truth.org is a franchise organization. So you could start one up: memphis.911truth.org. And when you start one up, you automatically link to all its other franchises. San Francisco, Boston, whatever. And around this time, around the anniversary of 9⁄11, I surmise that a number of franchises were started. So it look like suddenly, 911truth.org and all their franchises were getting a lot of links, an artificially high count, and so therefore they were demoted. That’s my theory. It is not a conspiracy theory. However we also blogged about it more seriously than 911truth.org. So it could be the Google read our blog.
How else to repurpose Google? I want to just very very briefly show you a new tool. And I built this I think about two years ago, and now it’s pretty stable. It sits on top of Google, it’s called the Lippmannian Device. It’s named after Walter Lippmann. And in fact it answers a call that Lippmann made in quite a famous— Well not his most famous book…so the follow-up to the Public Opinion book, called The Phantom Public which is my personal favorite—for Lippmann fans, it’s probably your personal favorite as well—where he goes on about not only critiquing the means by which public opinion is formed, but also begins to call for what we ended up calling new equipment for interpreting and mapping societal controversies—I don’t want to just…throw around the word “democracy” too easily. New equipment. And in particular to provide a means by which one can get a coarse sense of partisanship. Is an actor partisan or not?
And so we built the Lippmannian Device. What it does is it sits on top of Google. And it measures resonance. So I’ll just show you immediately. So this a source cloud. And what it shows is the number of times a particular source mentions a particular name. And the name in this particular case is Craig Venter. You may know him, he’s the guy who supposedly wants to take out patents on life. The synthetic biology pioneer. Has a few really famous TED Talks. I mean if you get into the hierarchy of TED Talks, Craig Venter is quite close to the top of them.
So what we did is we queried “synthetic biology,” we got all of the sources, the most important sources for synthetic biology. Then we queried each of them individually for this name. So you see a sort huge distribution of who recognizes Venter, who mentions Venter, who purposefully does not. So you get a sense of the extent to which Venter is important, significant, per source.
Let me just show you how to do this. I’m going to show you very very briefly about the climate change skeptics. It’s everyone’s favorite. What we did is we tried to find out what are the most important sources on climate change, and then do these sources recognize the skeptics. Can we figure out whether or not we can detect or diagnose skeptic-friendly sources, quickly. So we queried Google. In fact, we queried Scroogle; this is what Scroogle looks like. The reason why we queried Scroogle is because it doesn’t give you personalized results. It gives you pure Google results, if you will. There’s nothing pure about Google. And there’s nothing organic about the results, nothing natural about them, they’re all very highly synthetic. But anyway. It gives you depersonalized Google results.
And they kinda look like this. So what I did is I copied them. Select all, copy. I pasted them into a tool called The Harvester. The Harvester is a really fantastic tool because you can paste in all this stuff, including URLs. And then hit “harvest,” and it just gives you a clean list of URLs. This is a working tool, which you can just use. You don’t need logins.
[inaudible question from the audience]
Yeah. I’ll tell you at the end. digitalmethods.net. I’ll tell you now.
You take all those URLs, put them in the top box. The bottom box, put the names of the most prominent climate change skeptics. We got these names… You can get them a variety of ways, we triangulated three sources. Those names found in at least two sources we returned. And there you have the graphics, the output.
Sally [Bariones?] you see gets mentioned by hardly any of the top climate change sites. But marshal.org stands out. marshal.org is a major skeptic funder. It funds the skeptic conferences together with the Heartland Institute. I’ll just show you these briefly. This is an interesting one, climatescience.gov jumps out. So you can get a sense of issue commitments, or partisanship quite quickly, per source, using this technique. Using the Lippmannian Device.
Okay, the last one. Social networking sites. How are they are often studied? The number of times Erving Goffman is cited in relation to social media is quite a lot. And Presentation of Self, this kind of thing. That is one of the dominant approaches. Another one is to sort of think of social networking sites as somehow reenacting different sort of cultural clashes. I mean, my favorite one is a story that was told in one of danah boyd’s blog postings, about how the US military banned MySpace and did not ban Facebook. And MySpace was used primarily by the enlisted folks, whereas Facebook was used primarily by the officers. So again, you get this sort of class struggle enacted. There’s also the distinction between friends and friended friends. It’s also the impact of defriending, the amplification effects, these sorts of things.
How else might they be studied? Thinking through, following the digital methods principles of okay, follow the medium. What natively-digital objects are available? How are they treated by the dominant devices? We came up with the notion of post-demographics. The sort of natively-digital object dominant in social media is the profile, if you will. Now what’s interesting about profiles is that they provide all these different interests. Kind of media interests. And then, profiles have friends.
So what we did… I mean this was more of an art project—this was in a few art magazines—is we created a means by which we can see what the interests are of the friends of Obama and McCain, in this particular sense. I mean, you can… We also did what the interests are of the friends of Islam and Christianity, for example. You can do a range. But anyway, just to give you a sense.
So this sat on top of MySpace until MySpace change their query string. We can’t tweak it again. They kind of just shut us down, basically. Nevertheless. What we did is we took in this case the top 1,000 friends of Obama and the top 1,000 friends of McCain. We aggregated their profiles. And we then ranked the interests and provided aggregate profiles of the friends of the politicians.
And then we also did a compatibility check. Will the friends of Obama have similar interests to the friends of McCain. And we call this post-demographics. So it’s in some sense the study of the organization of groups, not according to age, gender, income, level of education. But rather according to the data that’s regularly given online through social media. Interests, movies, books, etc.
Anyway. So I just wanted to mention really really briefly Obama, the friends watch on TV The Office, The Daily Show, Lost. And the friends of McCain are into Family Guy, Project Runway, America’s Next Top Model, Desperate Housewives.
So you get a real sense. And then you can do this… I mean, so it’s like are there divi… I mean, you see quite a divide here, cultural divide, between— And you can do this for other cultures. I mean I did this also for Fatah and Hamas, oddly enough. And you see far more overlap. Same interest, same movies.
Okay just to conclude, the idea of digital methods is to take seriously Web data. And to think about the Web, or the Internet, not as this separate realm, not as the virtual, not as something that has an asterisk—not something that you only study for its culture in and of itself, but rather to take Web data seriously as means by which one can study society and culture more generally.
But how to do that? Well, one way of doing it is not necessarily to import the standard methods or port them onto the Internet, because what you get are only indicators. And you get a lot of problems as well. But rather I propose that a research practice where you actually follow the medium and think about the methods in the media. And I have laid out for you a practice whereby one looks at the natively-digital objects, how dominant devices handle them, and then how you can learn from them and repurpose them in order to undertake social research.
And then the last sort of trick… And it’s going to be endlessly tricky, and endlessly debated, is whether or not you can ground your findings in the online. Or whether you need to go offline in order to ground them. If we have another chance at some other time, venue, place…happy to tell you about approaches to studying these other things. [Severs?], webs, Wikipedia, as well as Twitter. But for now, thank you.
Moderator: Thank you very much for a fascinating talk. Questions? Comments?
Audience 1: Richard, thanks very much. So one… I’m curious, your chronology says around 2007 things changed. And indeed they do in a lot of ways. And the tools you’re showing here are one sign of those changes. The emergence of tools like… Oh, stuff like what, Newsglobe, News Positioning System, MediaCloud—I mean there’s dozens of these things that sort of scrap news, that sort of look at the feeds, whether it’s from the world’s various wire services or whether it’s the destination and targets cities— There’s a lot of really interesting ways to play with the data.
And I wonder if that hasn’t…you know, this is coincident with the rise of this critical discourse of Googleization. That oh, Google’s so flat and so commercial, and so one-size-fits-all. And I wonder if it hasn’t been relieved of a burden to actually be sharper or be…pretend to be more objective or whatever that objectivity would be. In other words, isn’t there a kind of relationship between the rise of all these highly-specialized tools that allow us to make data dance and allow us to… We have quite a bit of independence about where we draw our data from. With on the other hand the kind of both demonization and flattening of something like Google. It’s that relationship I guess I’m interested in.
Rogers: So… Yeah thanks. So I mean I think what’s… I mean…does anyone wanna answer that? I think— Because it’s a really difficult question. I mean, first of all, Google has taken itself off the hook recently. And they’ve done so in an extremely clever way. I wrote a piece called The Inculpable Engine, and it’s about Google. And the reason why I call it the inculpable engine is because now we are coauthors of our results. So, with the rise of personalization, now the results are partly our own, of our own making. And then we studied it empirically and that’s another story. But in any case.
So, there is no longer one set of Google results that one then can critique for the new hierarchies. So I mean, this is how I started my work on information politics, as it is called. The book that came out in 2004, 2005. I started that book with the observation right around—I think it was 2003 I typed “terrorism” into Google. Terrorism. And what I got back was whitehouse.gov, civ.gov, fbi.gov, Heritage Foundation… CNN and Al Jazeera. The top 20. And I said oh gee you know, it’s just like the TV news. So Google is beginning to align itself with the kind of—or output sources which are quite familiar to us. And so then could be critiqued. So no longer was the Web providing a diversity of viewpoints, etc., if one saw the Web as something that was most significantly in some sense organized or even offered by engines.
However, all those interesting critiques that could be made no longer apply as forcefully because of personalized results. So anyway. So I think Google—cleverly—has taken itself off the hook, has become increasingly inculpable in terms of the critique of the results in a sort of infopolitical sense. I mean, it’s become the object of critique in many many other senses, but its core, what it does apart from serve advertising and sort of…you know, info results. So it’s becoming increasingly sort…I don’t know, inculpable is the term that use.
That’s one thing. But then the rise of the… So the other thing that struck me is the rise of the tools and all the visualization. So the rise of infoviz and dataviz, alright. So there’s just huge, really huge areas. And the’re only now beginning to be critiqued. I mean, there are a lot of pent-up, I think, there’s a lot of pent-up critique waiting to burst out. I don’t know, maybe it’s well-developed here. But in a lot of circles that I’m familiar with like, people are dying to hate the rise of infoviz but they don’t really…haven’t formulated it yet, you know.
Well, I mean there are a number of critiques of infoviz. One that is beginning to emerge for me is the amount of spuriousness, or the amount of…it’s the celebration of amateur data analysis is what’s quite interesting. Gapminder. So with Gapminder you can take any two variables…any two. Any two… I may have made my point.
Okay. I mean, but the relation between Google and the rise of datav— I mean, Google also of course does a lot of dataviz and infoviz. I mean, I haven’t thought through that relationship yet, but anyway.
Are there other questions?
Feel free to bring up anything. We have a Internet…guy.
Audience 2: Well probably along the lines of the tools and especially the critique of the visualization tools and infoviz and so, as you mentioned this is has had a huge rise. And last year we had a visualization conference on visual interpretations and actually also the critique of that here—
Rogers: Oh, good.
Audience 2: —at MIT. And Johanna Drucker, you know, offers a very distinct critique of the data that’s being fed into those tools as on the one hand being already at the interpretation, or not making it transparent where the interpretation part comes in. So you know, that’s one of the questions also here, you know. Sort of, to what extent can we see it also in the research, what is the data that’s fed into these tools that then give us those results? That’s one question.
Another question that I had in terms of the dark side of the Web. What do we do with the other 70%?
Rogers: That’s no longer true.
Audience 2: Yes.
Rogers: It’s no longer true. If you talk to Google engineers, they’ll tell you that they’ve basically indexed it all. I mean, that the Web that’s not indexed is only one click away. So it’s…pretty much all indexed. I mean of course that’s just massive. But that’s no longer the case, at least according to the web science I know. But…
Okay so, dataviz, infoviz… So let me just say a little bit about my research practice in relation to what one might think of when one thinks of dataviz and infoviz. So number one is I make bespoke tools, or what Clay Shirky once called—I mean, I thought this is a very clever term some years ago, and people don’t use it: situated software. So, it’s software where the research questions in some sense, and the approach, are all built in. Now, that’s very very very very different from Many Eyes or whatever, where they’re toolboxes, right.
So the standard way of thinking about it is that here all these worl— Go and visualize away. And then— I mean Many Eyes is kind of interesting because it gives everyone a little lesson in the kind of data sets that match with certain visualization types. I think that’s one major contribution of Many Eyes, is actually teaching that. But anyway, so my research practices is very bespoke, is very situated, in the sense that the methods are built in. And then the other thing that’s different is that they do the data collection, the analysis, and the visualization all together. So it doesn’t separate the data collection: go out there and get the leaves and the acorns or whatever; bring ’em back; and lay them out. So that’s the difference. These are all-in-one tools, and not all-purpose. So that’s a really big difference. I mean I only showed two tools. I showed the Issue Crawler, and I didn’t really show you how it worked, I just showed you some output. And I showed you the Lippmannian Device and I showed you how to use it.
But if you go to digitalmethods.net you’ll notice that there are about thirty tools at different… Yeah, some are very simple. In fact I find them all very simple. All very simple things. But anyway, they’re open and usable and we maintain them all.
Audience 3: [indistinct] —you just said if— On the point of your data being used within the tool, the method and the data together. Does that mean that it’s… I’ll make sure I understood it. You could only do that with born-digital materials.
Rogers: Right.
Audience 3: You couldn’t do that with data that you’ve digitized…
Rogers: Correct.
Audience 3: …and applied a tool to afterwards.
Rogers: I’m really glad you said that, yeah.
Audience 3: Is that right?
Rogers: Yeah.
Audience 3: Okay.
Rogers: But I mean that’s very— I mean, maybe that’s something that I should make even more explicit. Thank you for that. So, all of this work that I presented to you is analysis of the natively digital. I mean I use that term, it sounds very provocative or very—“natively digital.” But anyway. But that term does is it makes extremely clear, I hope, that it’s not digitized. So a lot of the digital humanities work, arguably…all of it, or most of it, relies on digitized books and digitized— I mean cultural analytics as an approach. Lev Manovich; you will have heard of his, perhaps. It’s all digitized material. So, Rothko paintings, covers of Time magazine. And then we have this digitized material. And then we use—which is a separate data set. Import that into visualization tools. It’s not the research practice that I do at all. I do the…well I explained it. I hope.
Audience 4: Although increasingly in most cultural sectors, there is digitally-born films and video as opposed to the stuff that’s been ported back over. Causing no end of misery for the folks working on it. So, as a historian maybe just kind of a naive question but… I mean Pentium is what, ’93, ’94, Mosaic. So we’re not even talking about a twenty-year window here. And you’ve mapped a trajectory of kind of steps. Is what we’re seeing here about the affordances of better bandwidth, faster processors, the ability to manipulate more material that we have access to more quickly? Is this about a generational shift? Folks who’ve grown up in this era and have a kind of fluency and facility that some other folks lack? Is it about— I mean it seems to me that it’s taking the contours of a kind of epistemological shift in terms of what constitutes knowledge, the way we’re asking questions about it. But…probably a bunch of other ways to think of it. But were you to sort of look for causalities or factors that help to chart that movement from say the ’93, ’94, the emergence of the Web and where we are now, how would you account for that in broad terms?
Rogers: So, I mean, one of the efforts— I mean I’m not gonna— I could’ve rehearsed the argument I made but one of the initial efforts was to try to show over the last I don’t know, fifteen years what’s changed in thinking about what to do with the Internet in terms of research. And I think there has been a shift and I think that it’s happened fairly recently, whereby we—and it’s taken a long time, whereby we no longer necessarily first think about the Internet or the Web as being the realm of pirates, pornographers…you know…rumormongers…the jungle… You know, all this stuff. But it’s still there, you know. I mean you hear it in the arguments in Congress often about using these sorts of…“Oh, did you get that from some blogger or whatever.” So the historically low epistemological value of the Internet in general, I think that is slowly starting to change. And that’s very very recent. So I would say it’s about the…sort of from a historical point of view, the slow normalization. It still has the asterisk, still a little bit different—I mean, the slow normalization of this technology in the history of technology and the history of individual technologies. I mean, who’s the most famous historian that talked about this, Thomas Hughes maybe? talked about… And then in science it was Kuhn of course, with normal science.
So the slow normalization of the Internet and then suddenly people saying okay, so…the establishment’s online. Online is also very normal so we can use the—you know, we can go online and it’s trustworthy. We can go to a governmental site, it’s trustworthy. So that took a while, and now I’m thinking about okay, we can use the data online. That’s what’s different. So it’s computing power, although it helps. I think that’s different. I think it’s the mindset that’s changed. Or think it’s the— The end—well, I call it the end of cyberspace. Or I call it the death of cyberspace, in fact. I’m a little bit more dramatic about it.
I call it the death of cyberspace, and there’s a number of reasons why it’s died. The fir— I mean it started— I mean, shall I just…briefly? It started when there was a lawsuit by two Jewish groups, Jewish NGOs in Paris against Yahoo, because Yahoo was making available on their web site pages for Nazi memorabilia. This was 2000. And they were sued in France—Yahoo USA, or Yahoo was sued.
And what came out of that was geolocation technology. Like, specifically. So French users would be loca— Okay, you’re French. So you can’t see these pages. And so from there came sort of the rise of what I call the national webs. I didn’t get into it in my talk, but… So what we have is slowly…with the regulatory frameworks, with the legal frameworks, etc. being applied to the Internet, we had the slow and gradual but indeed steady and sudden death of cyberspace. So I think that’s the difference. So cyberspace needs to die before we can use the Web as a data set.
Audience 5: I was wondering how we access the Internet. I was thinking about wireless devices. So how we access the net, would that impact your conceptualization of digital methods?
Rogers: Good question. Um…
I don’t have an answer to that question. I mean, I’ve thought about it a bit—
Audience 5: [indistinct comment]
Rogers: No no, but I mean…um… So… So I mean, since I’m at MIT… So, one thing that I will say it is very interesting to look at, in development studies circles, the debate between the One Laptop Per Child versus the mobile phone, right. So obviously, if you have far more users of mobile phones, which we do, than of computers, then one would think about the need to study the data generated through mobile phone use, broadly conceived, whether it’s kind of mobile Internet-related or not. And you know, thinking through specifically okay, what if we applied these methodological principles to those to that situation. So that’s the challenge. And that’s how I’ll answer your question. It’s a challenge.
Audience 6: There might be another aspect to because there’s currently a debate going on, it was spearheaded by Tim Berners-Lee, that the app-driven mobile phones are destroying the Web. So people are no longer using search engines or the Web in general in order to find information but highly-specialized apps to tap into specific pieces of information that lie on the Internet. So that might have an influence also on the research methods and also how people perceive also what’s out there on the Web.
Audience 7: The commodified side of the tools [indistinct]
Audience 8: Or a different version of the death of cyberspace.
Rogers: But I mean Tim Berners-Lee is the great protectorate of the Web, huh.
Audience 6: Right.
Rogers: So it’s like [indistinct]. Yeah no, I mean that is a… Is that a Wired article or something, “the apps are coming,” or I forget. It was…
Yeah I mean I don’t really have view on that. I mean it’s more like similarly, right. So if apps become dominant, or if particular types of apps, then you can still apply the same principles. You can see the extent to which the principles will work or not, continue to work. So you know, follow the medium, etc. So I’ll follow them, the apps. The rise of the apps.
Any other…anything else?
Moderator: Otherwise, thank you very much Roger for a fascinating talk.
Roger: Yeah, thank you. Thanks.