Reverse Engineering Netflix

If you use Netflix, you may have seen the weird way that they seem to categorize their content, like “Movies starring Gary Busey,” which is not the way I would choose to spend my Saturday night browsing entertainment options. Or “Wacky Cult Films” or “Medical Movies based on Books” or what have you. This is sort of interesting and bizarre. Where do these come from? Not just how does Netflix know what you want or what you might like, but how do they even know that Coma is a medical movie based on a book? Some of these things, they seem like, okay movies starring Gary Busey, that’s just metadata that you’d get from anywhere, but what makes a cult film “wacky?” How would you know it’s wacky? And yet these films seem wacky and cultish, so how did all that happen?

So about a year ago, Alexis Madrigal at the Atlantic and I started thinking about this. Really, Alexis started thinking about it first and then he roped me in. Alexis put together this story and I put together the strange software that lives inside of it. I just want to talk a little bit about why we did this, what we did, what it meant to people. And then I’m going to make some aesthetic judgements about bots deriving those judgements from the experience we had doing this.

One of the things that happened is as Alexis started looking at these genres (Netflix is calling them “altgenres” so I’ll try to use that term.) he was able to just scrape them all down because they were sequentially numbered in URL query string variables. There turned out to be about 76,897 of them. They weren’t all in sequential order, but it was possible to set up a script and his account didn’t get disabled for doing this even after he went down to Netflix and interviewed them about this whole process and how they collected this information.

So they had 76,000 of these altgenres, these ways of describing movies, and in some ways the process of writing this bot became about recreating, to some extent, what Netflix had already done, which is a very weird, kind of seemingly pointless act to do until you do it. But the process was kind of [interesting?]. We talked a little bit about corpora already and this bot (if it’s indeed a bot; it’s really a text generator), we had to create the corpus for it. First that involved gathering all of these altgenres and they’re unstructured; we just get them as a text string. So we ended up using a concordance program which you can get, AntConc, to try to structure that data.

There were some patterns that began to emerge, like we’d see “about” pop up and there’s some kind of subject that’s present, and likewise time periods or “set in Asia,” “set in Europe.” These were data that we were able to extract and structure with this concordance program. Then after all that was done, all we ended up with was a big spreadsheet of these chunks of data. It still wasn’t clear exactly how they would get put together, either in Netflix style or in another style.

Screenshot of the Atlantic article, with a generated Netflix-style genre reading "Blockbuster Dramas Based on Books from the 1910s"

So we started analyzing the actual genres and I developed some grammars for trying to recreate first of all the Netflix-style genre which you can see here. What’s interesting about generating Netflix genre names is that might or might not actually exist. They’re using the same data that we pulled out of Netflix, and then we’re rearranging according to the logic of a grammar that I wrote based on our analysis of what we thought the altgenre structure looked like. But these may not actually correspond with any actual genres that Netflix advertises or any films, for that matter: Witty Werewolf Mysteries, or Quirky Detective Disney Fairy Tales, or Hit-Man Spy Dramas, or whatever.

That was relatively straightfoward, and I’m going to come back to talk about grammars in a second, but from there once we had that it occurred to us, what else can we do? What other ways of intersecting this data are there? You could make these Hollywood pitch room kinds of concepts pretty easily: Heartfelt Tortured Genius Provocative Tearjerkers is not the best Hollywood pitch, but Morality Immigrant-Life Comedies might be, or Prison Post-Apocalyptic Mockumentaries. I would watch that.

But most interesting was just going bonkers with this data in “gonzo mode” [inaudible] and incorporating as much as possible: Viral Plague Sci-Fi Movies Based on Children’s Books Set in Europe for Ages 8 to 10; or First Love Slice of Life Musicals Set in Europe From the 19820s For Hopeless Romantics; Bounty-Hunter Fantasy Movies Based on Books About Cats. This is stuff that you would put in your bot if you were making a bot, and indeed it’s not a bot but you can tweet it, and the gonzo ultraniche genres, these are the ones that people wanted to talk about, or they wanted to reflect on.

So that’s what we did and what it looks like at the end. This article got a lot of reads and tweets. I haven’t counted them all. But I want to go back and say a couple of things about the aesthetics of this project.

The first is that this isn’t a bot, it’s a text generator that you can tweet out as a bot, and I think that’s important in this case because often as bot makers we celebrate the ambiguity of bots, especially on Twitter. Is that real or not, and we don’t know. We love that we don’t know. But sometimes you actually want to know. You want to know this is text generation, and it’s meant to be text generation, or you can see the structure of something else, in this case the Netflix altgenres. That’s one observation I would make.

The second observation is that the generation of your own corpora is sometimes really freeing, and it also forces you to think about a smaller set of data and how you can interact with it programmatically more deliberately. So as much as I love Wordnik and streaming data off of Twitter and just using the essentially infinite amount of content that you get from that channel, I think there’s also a reason to prefer other kinds of methods. There’s nothing new about writing a context-free grammar and operating it on a small data set. That’s all that the gonzo grammar is doing. You can find code like this in Python or anywhere, but this is what I wrote quickly for the project in Javascript. It’s like “10% of the time add a region, and then three or less adjectives and the genre name, and half the time add data from the description,” and because this context-free grammar can recurse I can have things like stars, which build into the thing that we call roles like “starring Gary Busey” but it’s actually “starring #star” or “created by #creator” and then we have three levels of stars based on their popularity or their frequency.

So in this kind of data vs. process mode that we are always in when we’re writing software, I feel like the bot world has been very very data-oriented, and there’s reasons to be more process-oriented. Grammars are just one example of how you might do it, but focusing on different ways of putting together smaller sets of data is also an option that’s available to us.

The final thing I want to note, again I don’t know if this is a bot or just a text generator, but if it is a bot it’s a bot that has a rhetorical function. Someone is meant to interact with this in order to gain purchase on the idea of what these Netflix altgenres are and what they mean. And what they mean for Netflix is they actually sit people down in front of movies and pay them to write down that this is a wacky movie, and then they get that in their database and then they’re using that as a way of trying to reconstruct what these movies are about or how they appeal to people. It’s kind of replacing the recommendation systems that were previously used and also at least supposedly informing the original content development that Netflix is doing now. And seeing into the process is interesting and important maybe, to get a sense of how is it that these kinds of media objects get constructed. This is not the only way that ideas are being considered at Netflix, but it’s one of those ways. So there’s this additional future of sort of rhetorical bots or bots that are trying to show you something about the world rather than trying to be an experience that you have with it.

Further Reference

Mark Sample created a Netflix Genres Twitter bot, inspired by Alexis and Ian’s Atlantic piece.

There’s a (sadly empty) “Ian Bogost Movies” altgenre.

Ian discusses Netflix’s altgenres again as part of “The Cathedral of Computation” for The Atlantic.

Darius Kazemi’s home page for Bot Summit 2014, with YouTube links to individual sessions, and a log of the IRC channel.

Open Transcripts

presented by Ian Bogost
in Bot Summit » Bot Summit 2014
on 11/08/2014

Further Reference

Tags

Citations

Common Tags

Open Transcripts