Fun With Corpora Manipulation

My thing is that I guess my day job is that I’m a grad student and I work on natural language processing stuff. So my favorite kind of bot to make is one that plays with language and does stuff with it. I’m going to talk a little bit about how I choose resources to use in my bots that play with language and little tricks that I have for manipulating this language to make it do things.

I guess the two questions that I ask myself when I’m making a bot are what resources and corpora should I use, and how can I manipulate this in interesting ways to come up with tweets that are more signal than noise. A lot of the time it’s actually really hard to generate language that is signal and not noise. And I aim to get a lot of signal in my bots.

When I’m looking for corpora, I look for texts that have really interesting tones and styles, like when you read it you’re like “okay this is from this genre.” There are few different ways that I do this. Sometimes you can search for lexical patterns. You can go onto Twitter and, like with Rob’s [Dubbin] Olivia Taters bot… I don’t know exactly how he does it, but I have theories, and it looks like there are certain adverbs that he searches for, and these adverbs tend to show up in tweets that are I guess made by teens. And a tone just kind of arises just from searching for certain words, which is really interesting.

I also look at web sites with specific writing styles. For instance my bot @wikisext looks at the web site wikiHow and everything there is an instructional step, and that kind of fit well with the sex template, which sort of leads to this next point which is just texts in general that lend themselves to juxtaposition. Whether that’s juxtaposing it with some kind of template, or mashing it up with some other text, and I sort of call that “meta-templating” for lack of a better term, or hearing anyone else talk about it.

That was pretty vague, so actually finding corpora concretely. Twitter and other social media, obviously. This is where you can do things like these lexical searches to turn up interesting patterns and interesting language. So doing keyword searches or searches for phrases. Somebody mentioned in the IRC chat I made the bot @whatsgamergate, and what that does is it searches Twitter for the phrase “is about” and then some words. I do a bit of filtering on that and some of the filtering involves rejecting any tweets that say “ethics” and “games” then it takes the part that says “is about blah blah blah” and posts in some series of templates “Gamergate is about this thing” which it’s not about, but God knows what is is about.

Then there’s also scraping web sites. There’s so many little things you can get off so many web sites. [Slide lists IMDB, Craigslist posts, Wikipedia abstracts, ingredients from allrecipes, steps from wikiHow, reviews from Yelp.] As I was writing this I was like, “I gotta come up with examples” and there’s just a zillion things you can get out there and twist into whatever you want, really. Some of these I’ve used, some of them I haven’t used, and some of them are things that I want to use and haven’t figured out a good way to use.

And you can also just get books off Project Gutenberg and lots of other places. They recently organized things into genres and categories on Project Gutenberg, so there’s a lot of really really good stuff there.

At the point when you have corpora, you want to know what makes this interesting? What makes you want to make a bot with it? What is the part that will bring out something really cool in this text? So, how to manipulate? I look for what aspect is this text capturing. If it’s capturing things lexically, then you can extract certain parts of speech, like what kind of verbs are used in this text, what kind of nouns are used in this text? Or even go bigger than that. You can grab noun phrases that are being used, or just entire constituents of the sentence and then do something with those. You can either put those in a template or replace other text with these things. If it’s the actual language, the surface words that are interesting, these are strategies that I use for that kind of text.

Sometimes it’s the actual style and way that the text is written which is interesting. If the syntax itself is important to the tone I’m trying to go for, then generally I want to keep around the entire structure of the text and maybe use it as a template. I have a bot called @moonmurmur, and it’s very not safe for work. What it does is it pulls sentences from dirty stories and it replaces the nouns in those sentences with space words, and always at least one of the words would be moon. I think it does interesting stuff. So that’s an example of something where I really want to preserve the structure of the text that I’m getting but sort of twist the meaning of it around by replacing certain phrases in it.

And then I threw this in, so what aspects are captured semantically? For @moonmurmur I want to keep the dirty stuff around, too, because that’s also what makes it interesting. So that’s sort of like whatever’s capturing meaning the best and that sort of bubbled up to the syntax. But sometimes that can be the words. That can be something else entirely. You kind of have to feel around for that. At that point it’s an intuitive thing, like what sticks out to you in this text the most?

The voice of my bots is really important to me. I put a lot of thought into how the tweets themselves are formatted. So you’ll see some of my bots will capitalize things, some of them won’t, and some of them use emoticons. I don’t think I have any that put emoji in things right now. I think that if you’re making a bot and you’re generating this content, it’s also important to think about how you want this bot to be “talking.” I feel like my more intimate bots, I always but them all in lowercase for some reason. I don’t know why. But if I didn’t reformat the text like that, I think they would sound very different and they wouldn’t be as effective.

So how about interaction with bots? This is my slide with the question marks, because I think there’s a lot of really really interesting things that can be done with @-interaction in language bots. And sometimes nothing because there are bots that are just designed to serve one purpose. Sometimes there are stock responses and that works well. In @whatsgamergate, I gave it a list of five words to say to people who actually respond to it, and they’re things like, “What? IDK.” and then ‘Gaters will totally keep talking to it, and explain to my bot. Then some bots just respond by doing what they always do, and in my interaction bots that’s the path that I take right now, for lack of having time to think about this better.

But taking user input into account is a very cool thing, and it’s something I think can be done a lot better with language right now. “_ebooks” bots tend to seed their replies with some word from the tweet, and that’s a thing that can work and sometimes not work. Olivia does interesting things where she’ll take a word from your tweet and then say some nonsense at you about it. A trick that I put into my bots which interact, to generate a reply, usually using the same algorithm that it’s using to generate a general tweet, but it will specifically address the person. So if you talk to @wikisext, for instance, I don’t know if people have noticed this but when it talks back to you it always says that it’s doing something to you or involving you in it in some way. I think that’s actually made it cooler to interact with.

Here’s some other tricks that I use. I take corpora and just replace very specific pieces of it, that takes custom engineering depending on which corpora I’m working with. I also sometimes restructure text a bit. So in my bot @storyofglitch, who is a cat on Twitter, I search for tweets that say “My cat is blahblahblah” or “my cat just blahblahblah” and then I rewrite those so that the tenses of the verbs are in the first person rather than in the third person, and it’s like Glitch is actually doing those things. This is another case where the orthography thing comes in, where Glitch is kind of a bratty, dumb little kitty and the way that I made her talk is how I imagine a bratty, dumb little kitty would tweet. She certainly doesn’t spell things correctly a lot of the time and things like that.

And sometimes there’s no perfect way to actually get the grammar right no things, and people have tweeted at me and they’re like, “This doesn’t make any sense. You have a bug here.” And I’m like, “It’s a bot thing. You know that I’m not writing these myself, okay?”

A lot of the stuff I talked about is still surface and lexical information, and that’s cool. You can get somewhere with that. But it’d be really neat to start using richer linguistic information and richer semantic information in bots. The tools exist out there to be doing this stuff. The problem is that a lot of them are hard to use. There are a few libraries that I know of. I think NLP in Ruby and there’s NLTK and Pattern and TextBlob in Python, which have some of this stuff, or baby versions of this stuff in there. Dependency parsing, for instance, would let you figure out things like the subject and object and indirect objects in the sentence if you’re going through corpora, and interesting things could be done with that.

In the IRC chat earlier there was a side thing about using semantic resources, and I think those are super interesting. Some of the stuff like WordNet information, you can already get through the Wordnik API. I’m not sure about FrameNet. FrameNet’s really interesting. It has a lot of information about verbs that you can play with. Freebase is a gigantic ontology that is ripe for doing bot stuff with. Then there’s a whole bunch of other stuff, and I hope that people are sharing things in the IRC chat that they know about.

Greg Borenstein: It seems like some of my favorite bots, a lot of their power comes from the almost double-exposure quality of seeing the source as well as— Like @wikisext, it reads as a sext, but I can also read it as instructional material and the co-existence of those is where the humor comes from. I wonder if you have any thoughts about that, because one direction is like fully leeching out the source, where the corpora came from, versus how much of the corpora [and the feeling that ends up existing?]

Thrice: That’s really interesting to think about. For @wikisext, in the bio of the bot it says I learned everything it knows from wikiHow, and it’s kind of clear to see where that comes from. In @moonmurmur, I don’t tell people the source of that bot, and I think it’s sort of led to a lot of people seeing retweets of that bot and thinking that it’s just a person running a weird moon account. So that’s one where I want to keep the source hidden, for various reasons. I guess it’s up to the bot maker what they’re going for and whether exposing the text their working with will add to the bot or not.

Audience 2: @wikisext just kind of works out, but do you ever have a bot where maybe 20% of, after you go through transforming—

Thrice: I have like twenty bots that that happens to.

Audience 2: [crosstalk; inaudible] If you just like, you know how you as a person can look at tweets from a bot that doesn’t quite work out and see this one’s good, these eight are bad. How about a way to have a program figure out “this worked out, and these are garbage.” That’d be awesome.

Thrice: Two ways. One of these is coming from me as a natural language processing person who works with machine learning a lot, which is you can just generate a few hundred of those and them manually go through and say “this one’s good, this one’s bad” and then use a machine learning package to see if you could actually get something to discriminate the two. But you’d have to design features for that and stuff. And then the other way would just be write a classifier yourself. By that I mean you could go through manually and mark ones [inaudible] again, and then maybe write a custom scoring function that takes in some string and then says “Is this grammatical enough? Is it using interesting words?” and stuff like that. I’ve never tried that personally, but I think I’ve wanted to do stuff like me ranking tweets or potential tweets [inaudible]

Greg: I do something a lot like that with my Uncanny X‑Bot which is like [?] sheet summaries of X‑Men comics. I was finding that exact same problem that one in every ten is really good, so I wrote a heuristic which is like it gets some points for being short, it gets some points for including better-known characters, that kind of thing. Then I filter ones that make it past a certain number and it got a lot better.

Further Reference

Thrice maintains a Twitter list of their bots.

Darius Kazemi’s home page for Bot Summit 2014, with YouTube links to individual sessions, and a log of the IRC channel.

Open Transcripts

presented by Thrice Dotted
in Bot Summit » Bot Summit 2014
on 11/08/2014

Further Reference

Tags

Common Tags

Open Transcripts