My thing is that I guess my day job is that I’m a grad stu­dent and I work on nat­ur­al lan­guage pro­cess­ing stuff. So my favorite kind of bot to make is one that plays with lan­guage and does stuff with it. I’m going to talk a lit­tle bit about how I choose resources to use in my bots that play with lan­guage and lit­tle tricks that I have for manip­u­lat­ing this lan­guage to make it do things.

I guess the two ques­tions that I ask myself when I’m mak­ing a bot are what resources and cor­po­ra should I use, and how can I manip­u­late this in inter­est­ing ways to come up with tweets that are more sig­nal than noise. A lot of the time it’s actu­al­ly real­ly hard to gen­er­ate lan­guage that is sig­nal and not noise. And I aim to get a lot of sig­nal in my bots.

When I’m look­ing for cor­po­ra, I look for texts that have real­ly inter­est­ing tones and styles, like when you read it you’re like okay this is from this genre.” There are few dif­fer­ent ways that I do this. Sometimes you can search for lex­i­cal pat­terns. You can go onto Twitter and, like with Rob’s [Dubbin] Olivia Taters bot… I don’t know exact­ly how he does it, but I have the­o­ries, and it looks like there are cer­tain adverbs that he search­es for, and these adverbs tend to show up in tweets that are I guess made by teens. And a tone just kind of aris­es just from search­ing for cer­tain words, which is real­ly interesting.

I also look at web sites with spe­cif­ic writ­ing styles. For instance my bot @wikisext looks at the web site wikiHow and every­thing there is an instruc­tion­al step, and that kind of fit well with the sex tem­plate, which sort of leads to this next point which is just texts in gen­er­al that lend them­selves to jux­ta­po­si­tion. Whether that’s jux­ta­pos­ing it with some kind of tem­plate, or mash­ing it up with some oth­er text, and I sort of call that meta-templating” for lack of a bet­ter term, or hear­ing any­one else talk about it.

That was pret­ty vague, so actu­al­ly find­ing cor­po­ra con­crete­ly. Twitter and oth­er social media, obvi­ous­ly. This is where you can do things like these lex­i­cal search­es to turn up inter­est­ing pat­terns and inter­est­ing lan­guage. So doing key­word search­es or search­es for phras­es. Somebody men­tioned in the IRC chat I made the bot @whatsgamergate, and what that does is it search­es Twitter for the phrase is about” and then some words. I do a bit of fil­ter­ing on that and some of the fil­ter­ing involves reject­ing any tweets that say ethics” and games” then it takes the part that says is about blah blah blah” and posts in some series of tem­plates Gamergate is about this thing” which it’s not about, but God knows what is is about.

Then there’s also scrap­ing web sites. There’s so many lit­tle things you can get off so many web sites. [Slide lists IMDB, Craigslist posts, Wikipedia abstracts, ingre­di­ents from all­recipes, steps from wikiHow, reviews from Yelp.] As I was writ­ing this I was like, I got­ta come up with exam­ples” and there’s just a zil­lion things you can get out there and twist into what­ev­er you want, real­ly. Some of these I’ve used, some of them I haven’t used, and some of them are things that I want to use and haven’t fig­ured out a good way to use.

And you can also just get books off Project Gutenberg and lots of oth­er places. They recent­ly orga­nized things into gen­res and cat­e­gories on Project Gutenberg, so there’s a lot of real­ly real­ly good stuff there.


At the point when you have cor­po­ra, you want to know what makes this inter­est­ing? What makes you want to make a bot with it? What is the part that will bring out some­thing real­ly cool in this text? So, how to manip­u­late? I look for what aspect is this text cap­tur­ing. If it’s cap­tur­ing things lex­i­cal­ly, then you can extract cer­tain parts of speech, like what kind of verbs are used in this text, what kind of nouns are used in this text? Or even go big­ger than that. You can grab noun phras­es that are being used, or just entire con­stituents of the sen­tence and then do some­thing with those. You can either put those in a tem­plate or replace oth­er text with these things. If it’s the actu­al lan­guage, the sur­face words that are inter­est­ing, these are strate­gies that I use for that kind of text.


Sometimes it’s the actu­al style and way that the text is writ­ten which is inter­est­ing. If the syn­tax itself is impor­tant to the tone I’m try­ing to go for, then gen­er­al­ly I want to keep around the entire struc­ture of the text and maybe use it as a tem­plate. I have a bot called @moonmurmur, and it’s very not safe for work. What it does is it pulls sen­tences from dirty sto­ries and it replaces the nouns in those sen­tences with space words, and always at least one of the words would be moon. I think it does inter­est­ing stuff. So that’s an exam­ple of some­thing where I real­ly want to pre­serve the struc­ture of the text that I’m get­ting but sort of twist the mean­ing of it around by replac­ing cer­tain phras­es in it.


And then I threw this in, so what aspects are cap­tured seman­ti­cal­ly? For @moonmurmur I want to keep the dirty stuff around, too, because that’s also what makes it inter­est­ing. So that’s sort of like what­ev­er’s cap­tur­ing mean­ing the best and that sort of bub­bled up to the syn­tax. But some­times that can be the words. That can be some­thing else entire­ly. You kind of have to feel around for that. At that point it’s an intu­itive thing, like what sticks out to you in this text the most?

The voice of my bots is real­ly impor­tant to me. I put a lot of thought into how the tweets them­selves are for­mat­ted. So you’ll see some of my bots will cap­i­tal­ize things, some of them won’t, and some of them use emoti­cons. I don’t think I have any that put emo­ji in things right now. I think that if you’re mak­ing a bot and you’re gen­er­at­ing this con­tent, it’s also impor­tant to think about how you want this bot to be talk­ing.” I feel like my more inti­mate bots, I always but them all in low­er­case for some rea­son. I don’t know why. But if I did­n’t refor­mat the text like that, I think they would sound very dif­fer­ent and they would­n’t be as effective.


So how about inter­ac­tion with bots? This is my slide with the ques­tion marks, because I think there’s a lot of real­ly real­ly inter­est­ing things that can be done with @-interaction in lan­guage bots. And some­times noth­ing because there are bots that are just designed to serve one pur­pose. Sometimes there are stock respons­es and that works well. In @whatsgamergate, I gave it a list of five words to say to peo­ple who actu­al­ly respond to it, and they’re things like, What? IDK.” and then Gaters will total­ly keep talk­ing to it, and explain to my bot. Then some bots just respond by doing what they always do, and in my inter­ac­tion bots that’s the path that I take right now, for lack of hav­ing time to think about this better.

But tak­ing user input into account is a very cool thing, and it’s some­thing I think can be done a lot bet­ter with lan­guage right now. _ebooks” bots tend to seed their replies with some word from the tweet, and that’s a thing that can work and some­times not work. Olivia does inter­est­ing things where she’ll take a word from your tweet and then say some non­sense at you about it. A trick that I put into my bots which inter­act, to gen­er­ate a reply, usu­al­ly using the same algo­rithm that it’s using to gen­er­ate a gen­er­al tweet, but it will specif­i­cal­ly address the per­son. So if you talk to @wikisext, for instance, I don’t know if peo­ple have noticed this but when it talks back to you it always says that it’s doing some­thing to you or involv­ing you in it in some way. I think that’s actu­al­ly made it cool­er to inter­act with.

Here’s some oth­er tricks that I use. I take cor­po­ra and just replace very spe­cif­ic pieces of it, that takes cus­tom engi­neer­ing depend­ing on which cor­po­ra I’m work­ing with. I also some­times restruc­ture text a bit. So in my bot @storyofglitch, who is a cat on Twitter, I search for tweets that say My cat is blah­blah­blah” or my cat just blah­blah­blah” and then I rewrite those so that the tens­es of the verbs are in the first per­son rather than in the third per­son, and it’s like Glitch is actu­al­ly doing those things. This is anoth­er case where the orthog­ra­phy thing comes in, where Glitch is kind of a brat­ty, dumb lit­tle kit­ty and the way that I made her talk is how I imag­ine a brat­ty, dumb lit­tle kit­ty would tweet. She cer­tain­ly does­n’t spell things cor­rect­ly a lot of the time and things like that. 

And some­times there’s no per­fect way to actu­al­ly get the gram­mar right no things, and peo­ple have tweet­ed at me and they’re like, This does­n’t make any sense. You have a bug here.” And I’m like, It’s a bot thing. You know that I’m not writ­ing these myself, okay?”

A lot of the stuff I talked about is still sur­face and lex­i­cal infor­ma­tion, and that’s cool. You can get some­where with that. But it’d be real­ly neat to start using rich­er lin­guis­tic infor­ma­tion and rich­er seman­tic infor­ma­tion in bots. The tools exist out there to be doing this stuff. The prob­lem is that a lot of them are hard to use. There are a few libraries that I know of. I think NLP in Ruby and there’s NLTK and Pattern and TextBlob in Python, which have some of this stuff, or baby ver­sions of this stuff in there. Dependency pars­ing, for instance, would let you fig­ure out things like the sub­ject and object and indi­rect objects in the sen­tence if you’re going through cor­po­ra, and inter­est­ing things could be done with that.

In the IRC chat ear­li­er there was a side thing about using seman­tic resources, and I think those are super inter­est­ing. Some of the stuff like WordNet infor­ma­tion, you can already get through the Wordnik API. I’m not sure about FrameNet. FrameNet’s real­ly inter­est­ing. It has a lot of infor­ma­tion about verbs that you can play with. Freebase is a gigan­tic ontol­ogy that is ripe for doing bot stuff with. Then there’s a whole bunch of oth­er stuff, and I hope that peo­ple are shar­ing things in the IRC chat that they know about.

Greg Borenstein: It seems like some of my favorite bots, a lot of their pow­er comes from the almost double-exposure qual­i­ty of see­ing the source as well as— Like @wikisext, it reads as a sext, but I can also read it as instruc­tion­al mate­r­i­al and the co-existence of those is where the humor comes from. I won­der if you have any thoughts about that, because one direc­tion is like ful­ly leech­ing out the source, where the cor­po­ra came from, ver­sus how much of the cor­po­ra [and the feel­ing that ends up existing?]

Thrice: That’s real­ly inter­est­ing to think about. For @wikisext, in the bio of the bot it says I learned every­thing it knows from wikiHow, and it’s kind of clear to see where that comes from. In @moonmurmur, I don’t tell peo­ple the source of that bot, and I think it’s sort of led to a lot of peo­ple see­ing retweets of that bot and think­ing that it’s just a per­son run­ning a weird moon account. So that’s one where I want to keep the source hid­den, for var­i­ous rea­sons. I guess it’s up to the bot mak­er what they’re going for and whether expos­ing the text their work­ing with will add to the bot or not.

Audience 2: @wikisext just kind of works out, but do you ever have a bot where maybe 20% of, after you go through transforming—

Thrice: I have like twen­ty bots that that hap­pens to.

Audience 2: [crosstalk; inaudi­ble] If you just like, you know how you as a per­son can look at tweets from a bot that does­n’t quite work out and see this one’s good, these eight are bad. How about a way to have a pro­gram fig­ure out this worked out, and these are garbage.” That’d be awesome.

Thrice: Two ways. One of these is com­ing from me as a nat­ur­al lan­guage pro­cess­ing per­son who works with machine learn­ing a lot, which is you can just gen­er­ate a few hun­dred of those and them man­u­al­ly go through and say this one’s good, this one’s bad” and then use a machine learn­ing pack­age to see if you could actu­al­ly get some­thing to dis­crim­i­nate the two. But you’d have to design fea­tures for that and stuff. And then the oth­er way would just be write a clas­si­fi­er your­self. By that I mean you could go through man­u­al­ly and mark ones [inaudi­ble] again, and then maybe write a cus­tom scor­ing func­tion that takes in some string and then says Is this gram­mat­i­cal enough? Is it using inter­est­ing words?” and stuff like that. I’ve nev­er tried that per­son­al­ly, but I think I’ve want­ed to do stuff like me rank­ing tweets or poten­tial tweets [inaudi­ble]

Greg: I do some­thing a lot like that with my Uncanny X‑Bot which is like [?] sheet sum­maries of X‑Men comics. I was find­ing that exact same prob­lem that one in every ten is real­ly good, so I wrote a heuris­tic which is like it gets some points for being short, it gets some points for includ­ing better-known char­ac­ters, that kind of thing. Then I fil­ter ones that make it past a cer­tain num­ber and it got a lot better.

Further Reference

Thrice main­tains a Twitter list of their bots.

Darius Kazemi’s home page for Bot Summit 2014, with YouTube links to indi­vid­ual ses­sions, and a log of the IRC channel.