My thing is that I guess my day job is that I’m a grad stu­dent and I work on nat­u­ral lan­guage pro­cess­ing stuff. So my favorite kind of bot to make is one that plays with lan­guage and does stuff with it. I’m going to talk a lit­tle bit about how I choose resources to use in my bots that play with lan­guage and lit­tle tricks that I have for manip­u­lat­ing this lan­guage to make it do things.

I guess the two ques­tions that I ask myself when I’m mak­ing a bot are what resources and cor­po­ra should I use, and how can I manip­u­late this in inter­est­ing ways to come up with tweets that are more sig­nal than noise. A lot of the time it’s actu­al­ly real­ly hard to gen­er­ate lan­guage that is sig­nal and not noise. And I aim to get a lot of sig­nal in my bots.

When I’m look­ing for cor­po­ra, I look for texts that have real­ly inter­est­ing tones and styles, like when you read it you’re like okay this is from this gen­re.” There are few dif­fer­ent ways that I do this. Sometimes you can search for lex­i­cal pat­terns. You can go onto Twitter and, like with Rob’s [Dubbin] Olivia Taters bot… I don’t know exact­ly how he does it, but I have the­o­ries, and it looks like there are cer­tain adverbs that he search­es for, and the­se adverbs tend to show up in tweets that are I guess made by teens. And a tone just kind of aris­es just from search­ing for cer­tain words, which is real­ly inter­est­ing.

I also look at web sites with speci­fic writ­ing styles. For instance my bot @wikisext looks at the web site wik­i­How and every­thing there is an instruc­tion­al step, and that kind of fit well with the sex tem­plate, which sort of leads to this next point which is just texts in gen­er­al that lend them­selves to jux­ta­po­si­tion. Whether that’s jux­ta­pos­ing it with some kind of tem­plate, or mash­ing it up with some oth­er text, and I sort of call that meta-templating” for lack of a bet­ter term, or hear­ing any­one else talk about it.

That was pret­ty vague, so actu­al­ly find­ing cor­po­ra con­crete­ly. Twitter and oth­er social media, obvi­ous­ly. This is where you can do things like the­se lex­i­cal search­es to turn up inter­est­ing pat­terns and inter­est­ing lan­guage. So doing key­word search­es or search­es for phras­es. Somebody men­tioned in the IRC chat I made the bot @whatsgamergate, and what that does is it search­es Twitter for the phrase is about” and then some words. I do a bit of fil­ter­ing on that and some of the fil­ter­ing involves reject­ing any tweets that say ethics” and games” then it takes the part that says is about blah blah blah” and posts in some series of tem­plates Gamergate is about this thing” which it’s not about, but God knows what is is about.

Then there’s also scrap­ing web sites. There’s so many lit­tle things you can get off so many web sites. [Slide lists IMDB, Craigslist posts, Wikipedia abstracts, ingre­di­ents from all­recipes, steps from wik­i­How, reviews from Yelp.] As I was writ­ing this I was like, I got­ta come up with exam­ples” and there’s just a zil­lion things you can get out there and twist into what­ev­er you want, real­ly. Some of the­se I’ve used, some of them I haven’t used, and some of them are things that I want to use and haven’t fig­ured out a good way to use.

And you can also just get books off Project Gutenberg and lots of oth­er places. They recent­ly orga­nized things into gen­res and cat­e­gories on Project Gutenberg, so there’s a lot of real­ly real­ly good stuff there.


At the point when you have cor­po­ra, you want to know what makes this inter­est­ing? What makes you want to make a bot with it? What is the part that will bring out some­thing real­ly cool in this text? So, how to manip­u­late? I look for what aspect is this text cap­tur­ing. If it’s cap­tur­ing things lex­i­cal­ly, then you can extract cer­tain parts of speech, like what kind of verbs are used in this text, what kind of nouns are used in this text? Or even go big­ger than that. You can grab noun phras­es that are being used, or just entire con­stituents of the sen­tence and then do some­thing with those. You can either put those in a tem­plate or replace oth­er text with the­se things. If it’s the actu­al lan­guage, the sur­face words that are inter­est­ing, the­se are strate­gies that I use for that kind of text.


Sometimes it’s the actu­al style and way that the text is writ­ten which is inter­est­ing. If the syn­tax itself is impor­tant to the tone I’m try­ing to go for, then gen­er­al­ly I want to keep around the entire struc­ture of the text and may­be use it as a tem­plate. I have a bot called @moonmurmur, and it’s very not safe for work. What it does is it pulls sen­tences from dirty sto­ries and it replaces the nouns in those sen­tences with space words, and always at least one of the words would be moon. I think it does inter­est­ing stuff. So that’s an exam­ple of some­thing where I real­ly want to pre­serve the struc­ture of the text that I’m get­ting but sort of twist the mean­ing of it around by replac­ing cer­tain phras­es in it.


And then I threw this in, so what aspects are cap­tured seman­ti­cal­ly? For @moonmurmur I want to keep the dirty stuff around, too, because that’s also what makes it inter­est­ing. So that’s sort of like whatever’s cap­tur­ing mean­ing the best and that sort of bub­bled up to the syn­tax. But some­times that can be the words. That can be some­thing else entire­ly. You kind of have to feel around for that. At that point it’s an intu­itive thing, like what sticks out to you in this text the most?

The voice of my bots is real­ly impor­tant to me. I put a lot of thought into how the tweets them­selves are for­mat­ted. So you’ll see some of my bots will cap­i­tal­ize things, some of them won’t, and some of them use emoti­cons. I don’t think I have any that put emo­ji in things right now. I think that if you’re mak­ing a bot and you’re gen­er­at­ing this con­tent, it’s also impor­tant to think about how you want this bot to be talk­ing.” I feel like my more inti­mate bots, I always but them all in low­er­case for some rea­son. I don’t know why. But if I didn’t refor­mat the text like that, I think they would sound very dif­fer­ent and they wouldn’t be as effec­tive.


So how about inter­ac­tion with bots? This is my slide with the ques­tion marks, because I think there’s a lot of real­ly real­ly inter­est­ing things that can be done with @-interaction in lan­guage bots. And some­times noth­ing because there are bots that are just designed to serve one pur­pose. Sometimes there are stock respons­es and that works well. In @whatsgamergate, I gave it a list of five words to say to peo­ple who actu­al­ly respond to it, and they’re things like, What? IDK.” and then Gaters will total­ly keep talk­ing to it, and explain to my bot. Then some bots just respond by doing what they always do, and in my inter­ac­tion bots that’s the path that I take right now, for lack of hav­ing time to think about this bet­ter.

But tak­ing user input into account is a very cool thing, and it’s some­thing I think can be done a lot bet­ter with lan­guage right now. _ebooks” bots tend to seed their replies with some word from the tweet, and that’s a thing that can work and some­times not work. Olivia does inter­est­ing things where she’ll take a word from your tweet and then say some non­sense at you about it. A trick that I put into my bots which inter­act, to gen­er­ate a reply, usu­al­ly using the same algo­rithm that it’s using to gen­er­ate a gen­er­al tweet, but it will specif­i­cal­ly address the per­son. So if you talk to @wikisext, for instance, I don’t know if peo­ple have noticed this but when it talks back to you it always says that it’s doing some­thing to you or involv­ing you in it in some way. I think that’s actu­al­ly made it cool­er to inter­act with.

Here’s some oth­er tricks that I use. I take cor­po­ra and just replace very speci­fic pieces of it, that takes cus­tom engi­neer­ing depend­ing on which cor­po­ra I’m work­ing with. I also some­times restruc­ture text a bit. So in my bot @storyofglitch, who is a cat on Twitter, I search for tweets that say My cat is blah­blah­blah” or my cat just blah­blah­blah” and then I rewrite those so that the tens­es of the verbs are in the first per­son rather than in the third per­son, and it’s like Glitch is actu­al­ly doing those things. This is anoth­er case where the orthog­ra­phy thing comes in, where Glitch is kind of a brat­ty, dumb lit­tle kit­ty and the way that I made her talk is how I imag­ine a brat­ty, dumb lit­tle kit­ty would tweet. She cer­tain­ly doesn’t spell things cor­rect­ly a lot of the time and things like that. 

And some­times there’s no per­fect way to actu­al­ly get the gram­mar right no things, and peo­ple have tweet­ed at me and they’re like, This doesn’t make any sense. You have a bug here.” And I’m like, It’s a bot thing. You know that I’m not writ­ing the­se myself, okay?”

A lot of the stuff I talked about is still sur­face and lex­i­cal infor­ma­tion, and that’s cool. You can get some­where with that. But it’d be real­ly neat to start using richer lin­guis­tic infor­ma­tion and richer seman­tic infor­ma­tion in bots. The tools exist out there to be doing this stuff. The prob­lem is that a lot of them are hard to use. There are a few libraries that I know of. I think NLP in Ruby and there’s NLTK and Pattern and TextBlob in Python, which have some of this stuff, or baby ver­sions of this stuff in there. Dependency pars­ing, for instance, would let you fig­ure out things like the sub­ject and object and indi­rect objects in the sen­tence if you’re going through cor­po­ra, and inter­est­ing things could be done with that.

In the IRC chat ear­lier there was a side thing about using seman­tic resources, and I think those are super inter­est­ing. Some of the stuff like WordNet infor­ma­tion, you can already get through the Wordnik API. I’m not sure about FrameNet. FrameNet’s real­ly inter­est­ing. It has a lot of infor­ma­tion about verbs that you can play with. Freebase is a gigan­tic ontol­ogy that is ripe for doing bot stuff with. Then there’s a whole bunch of oth­er stuff, and I hope that peo­ple are shar­ing things in the IRC chat that they know about.


Greg Borenstein: It seems like some of my favorite bots, a lot of their power comes from the almost double-exposure quality of seeing the source as well as— Like @wikisext, it reads as a sext, but I can also read it as instructional material and the co-existence of those is where the humor comes from. I wonder if you have any thoughts about that, because one direction is like fully leeching out the source, where the corpora came from, versus how much of the corpora [and the feeling that ends up existing?]

Thrice: That's really interesting to think about. For @wikisext, in the bio of the bot it says I learned everything it knows from wikiHow, and it's kind of clear to see where that comes from. In @moonmurmur, I don't tell people the source of that bot, and I think it's sort of led to a lot of people seeing retweets of that bot and thinking that it's just a person running a weird moon account. So that's one where I want to keep the source hidden, for various reasons. I guess it's up to the bot maker what they're going for and whether exposing the text their working with will add to the bot or not.

Audience 2: @wikisext just kind of works out, but do you ever have a bot where maybe 20% of, after you go through transforming—

Thrice: I have like twenty bots that that happens to.

Audience 2: [crosstalk; inaudible] If you just like, you know how you as a person can look at tweets from a bot that doesn't quite work out and see this one's good, these eight are bad. How about a way to have a program figure out "this worked out, and these are garbage." That'd be awesome.

Thrice: Two ways. One of these is coming from me as a natural language processing person who works with machine learning a lot, which is you can just generate a few hundred of those and them manually go through and say "this one's good, this one's bad" and then use a machine learning package to see if you could actually get something to discriminate the two. But you'd have to design features for that and stuff. And then the other way would just be write a classifier yourself. By that I mean you could go through manually and mark ones [inaudible] again, and then maybe write a custom scoring function that takes in some string and then says "Is this grammatical enough? Is it using interesting words?" and stuff like that. I've never tried that personally, but I think I've wanted to do stuff like me ranking tweets or potential tweets [inaudible]

Greg: I do something a lot like that with my Uncanny X-Bot which is like [?] sheet summaries of X-Men comics. I was finding that exact same problem that one in every ten is really good, so I wrote a heuristic which is like it gets some points for being short, it gets some points for including better-known characters, that kind of thing. Then I filter ones that make it past a certain number and it got a lot better.

Further Reference

Thrice maintains a Twitter list of their bots.

Darius Kazemi's home page for Bot Summit 2014, with YouTube links to individual sessions, and a log of the IRC channel.

Help Support Open Transcripts

If you found this useful or interesting, please consider supporting the project monthly at Patreon or once via Square Cash, or even just sharing the link. Thanks.