http://youtu.be/4CsYtensv94?t=56m36s

My thing is that I guess my day job is that I’m a grad stu­dent and I work on nat­ur­al lan­guage pro­cess­ing stuff. So my favorite kind of bot to make is one that plays with lan­guage and does stuff with it. I’m going to talk a lit­tle bit about how I choose resources to use in my bots that play with lan­guage and lit­tle tricks that I have for manip­u­lat­ing this lan­guage to make it do things.

I guess the two ques­tions that I ask myself when I’m mak­ing a bot are what resources and cor­po­ra should I use, and how can I manip­u­late this in inter­est­ing ways to come up with tweets that are more sig­nal than noise. A lot of the time it’s actu­al­ly real­ly hard to gen­er­ate lan­guage that is sig­nal and not noise. And I aim to get a lot of sig­nal in my bots.

When I’m look­ing for cor­po­ra, I look for texts that have real­ly inter­est­ing tones and styles, like when you read it you’re like okay this is from this genre.” There are few dif­fer­ent ways that I do this. Sometimes you can search for lex­i­cal pat­terns. You can go onto Twitter and, like with Rob’s [Dubbin] Olivia Taters bot… I don’t know exact­ly how he does it, but I have the­o­ries, and it looks like there are cer­tain adverbs that he search­es for, and these adverbs tend to show up in tweets that are I guess made by teens. And a tone just kind of aris­es just from search­ing for cer­tain words, which is real­ly inter­est­ing.

I also look at web sites with spe­cif­ic writ­ing styles. For instance my bot @wikisext looks at the web site wikiHow and every­thing there is an instruc­tion­al step, and that kind of fit well with the sex tem­plate, which sort of leads to this next point which is just texts in gen­er­al that lend them­selves to jux­ta­po­si­tion. Whether that’s jux­ta­pos­ing it with some kind of tem­plate, or mash­ing it up with some oth­er text, and I sort of call that meta-templating” for lack of a bet­ter term, or hear­ing any­one else talk about it.

That was pret­ty vague, so actu­al­ly find­ing cor­po­ra con­crete­ly. Twitter and oth­er social media, obvi­ous­ly. This is where you can do things like these lex­i­cal search­es to turn up inter­est­ing pat­terns and inter­est­ing lan­guage. So doing key­word search­es or search­es for phras­es. Somebody men­tioned in the IRC chat I made the bot @whatsgamergate, and what that does is it search­es Twitter for the phrase is about” and then some words. I do a bit of fil­ter­ing on that and some of the fil­ter­ing involves reject­ing any tweets that say ethics” and games” then it takes the part that says is about blah blah blah” and posts in some series of tem­plates Gamergate is about this thing” which it’s not about, but God knows what is is about.

Then there’s also scrap­ing web sites. There’s so many lit­tle things you can get off so many web sites. [Slide lists IMDB, Craigslist posts, Wikipedia abstracts, ingre­di­ents from all­recipes, steps from wikiHow, reviews from Yelp.] As I was writ­ing this I was like, I got­ta come up with exam­ples” and there’s just a zil­lion things you can get out there and twist into what­ev­er you want, real­ly. Some of these I’ve used, some of them I haven’t used, and some of them are things that I want to use and haven’t fig­ured out a good way to use.

And you can also just get books off Project Gutenberg and lots of oth­er places. They recent­ly orga­nized things into gen­res and cat­e­gories on Project Gutenberg, so there’s a lot of real­ly real­ly good stuff there.

vlc-01_02_30-2015-05-18-05h30m50s935

At the point when you have cor­po­ra, you want to know what makes this inter­est­ing? What makes you want to make a bot with it? What is the part that will bring out some­thing real­ly cool in this text? So, how to manip­u­late? I look for what aspect is this text cap­tur­ing. If it’s cap­tur­ing things lex­i­cal­ly, then you can extract cer­tain parts of speech, like what kind of verbs are used in this text, what kind of nouns are used in this text? Or even go big­ger than that. You can grab noun phras­es that are being used, or just entire con­stituents of the sen­tence and then do some­thing with those. You can either put those in a tem­plate or replace oth­er text with these things. If it’s the actu­al lan­guage, the sur­face words that are inter­est­ing, these are strate­gies that I use for that kind of text.

vlc-01_02_50-2015-05-18-05h30m54s826

Sometimes it’s the actu­al style and way that the text is writ­ten which is inter­est­ing. If the syn­tax itself is impor­tant to the tone I’m try­ing to go for, then gen­er­al­ly I want to keep around the entire struc­ture of the text and maybe use it as a tem­plate. I have a bot called @moonmurmur, and it’s very not safe for work. What it does is it pulls sen­tences from dirty sto­ries and it replaces the nouns in those sen­tences with space words, and always at least one of the words would be moon. I think it does inter­est­ing stuff. So that’s an exam­ple of some­thing where I real­ly want to pre­serve the struc­ture of the text that I’m get­ting but sort of twist the mean­ing of it around by replac­ing cer­tain phras­es in it.

vlc-01_03_50-2015-05-18-05h30m58s901

And then I threw this in, so what aspects are cap­tured seman­ti­cal­ly? For @moonmurmur I want to keep the dirty stuff around, too, because that’s also what makes it inter­est­ing. So that’s sort of like what­ev­er’s cap­tur­ing mean­ing the best and that sort of bub­bled up to the syn­tax. But some­times that can be the words. That can be some­thing else entire­ly. You kind of have to feel around for that. At that point it’s an intu­itive thing, like what sticks out to you in this text the most?

The voice of my bots is real­ly impor­tant to me. I put a lot of thought into how the tweets them­selves are for­mat­ted. So you’ll see some of my bots will cap­i­tal­ize things, some of them won’t, and some of them use emoti­cons. I don’t think I have any that put emo­ji in things right now. I think that if you’re mak­ing a bot and you’re gen­er­at­ing this con­tent, it’s also impor­tant to think about how you want this bot to be talk­ing.” I feel like my more inti­mate bots, I always but them all in low­er­case for some rea­son. I don’t know why. But if I did­n’t refor­mat the text like that, I think they would sound very dif­fer­ent and they would­n’t be as effec­tive.

vlc-01_06_01-2015-05-18-05h44m09s657

So how about inter­ac­tion with bots? This is my slide with the ques­tion marks, because I think there’s a lot of real­ly real­ly inter­est­ing things that can be done with @-interaction in lan­guage bots. And some­times noth­ing because there are bots that are just designed to serve one pur­pose. Sometimes there are stock respons­es and that works well. In @whatsgamergate, I gave it a list of five words to say to peo­ple who actu­al­ly respond to it, and they’re things like, What? IDK.” and then Gaters will total­ly keep talk­ing to it, and explain to my bot. Then some bots just respond by doing what they always do, and in my inter­ac­tion bots that’s the path that I take right now, for lack of hav­ing time to think about this bet­ter.

But tak­ing user input into account is a very cool thing, and it’s some­thing I think can be done a lot bet­ter with lan­guage right now. _ebooks” bots tend to seed their replies with some word from the tweet, and that’s a thing that can work and some­times not work. Olivia does inter­est­ing things where she’ll take a word from your tweet and then say some non­sense at you about it. A trick that I put into my bots which inter­act, to gen­er­ate a reply, usu­al­ly using the same algo­rithm that it’s using to gen­er­ate a gen­er­al tweet, but it will specif­i­cal­ly address the per­son. So if you talk to @wikisext, for instance, I don’t know if peo­ple have noticed this but when it talks back to you it always says that it’s doing some­thing to you or involv­ing you in it in some way. I think that’s actu­al­ly made it cool­er to inter­act with.

Here’s some oth­er tricks that I use. I take cor­po­ra and just replace very spe­cif­ic pieces of it, that takes cus­tom engi­neer­ing depend­ing on which cor­po­ra I’m work­ing with. I also some­times restruc­ture text a bit. So in my bot @storyofglitch, who is a cat on Twitter, I search for tweets that say My cat is blah­blah­blah” or my cat just blah­blah­blah” and then I rewrite those so that the tens­es of the verbs are in the first per­son rather than in the third per­son, and it’s like Glitch is actu­al­ly doing those things. This is anoth­er case where the orthog­ra­phy thing comes in, where Glitch is kind of a brat­ty, dumb lit­tle kit­ty and the way that I made her talk is how I imag­ine a brat­ty, dumb lit­tle kit­ty would tweet. She cer­tain­ly does­n’t spell things cor­rect­ly a lot of the time and things like that.

And some­times there’s no per­fect way to actu­al­ly get the gram­mar right no things, and peo­ple have tweet­ed at me and they’re like, This does­n’t make any sense. You have a bug here.” And I’m like, It’s a bot thing. You know that I’m not writ­ing these myself, okay?”

A lot of the stuff I talked about is still sur­face and lex­i­cal infor­ma­tion, and that’s cool. You can get some­where with that. But it’d be real­ly neat to start using rich­er lin­guis­tic infor­ma­tion and rich­er seman­tic infor­ma­tion in bots. The tools exist out there to be doing this stuff. The prob­lem is that a lot of them are hard to use. There are a few libraries that I know of. I think NLP in Ruby and there’s NLTK and Pattern and TextBlob in Python, which have some of this stuff, or baby ver­sions of this stuff in there. Dependency pars­ing, for instance, would let you fig­ure out things like the sub­ject and object and indi­rect objects in the sen­tence if you’re going through cor­po­ra, and inter­est­ing things could be done with that.

In the IRC chat ear­li­er there was a side thing about using seman­tic resources, and I think those are super inter­est­ing. Some of the stuff like WordNet infor­ma­tion, you can already get through the Wordnik API. I’m not sure about FrameNet. FrameNet’s real­ly inter­est­ing. It has a lot of infor­ma­tion about verbs that you can play with. Freebase is a gigan­tic ontol­ogy that is ripe for doing bot stuff with. Then there’s a whole bunch of oth­er stuff, and I hope that peo­ple are shar­ing things in the IRC chat that they know about.


Greg Borenstein: It seems like some of my favorite bots, a lot of their pow­er comes from the almost double-exposure qual­i­ty of see­ing the source as well as— Like @wikisext, it reads as a sext, but I can also read it as instruc­tion­al mate­r­i­al and the co-existence of those is where the humor comes from. I won­der if you have any thoughts about that, because one direc­tion is like ful­ly leech­ing out the source, where the cor­po­ra came from, ver­sus how much of the cor­po­ra [and the feel­ing that ends up exist­ing?]

Thrice: That’s real­ly inter­est­ing to think about. For @wikisext, in the bio of the bot it says I learned every­thing it knows from wikiHow, and it’s kind of clear to see where that comes from. In @moonmurmur, I don’t tell peo­ple the source of that bot, and I think it’s sort of led to a lot of peo­ple see­ing retweets of that bot and think­ing that it’s just a per­son run­ning a weird moon account. So that’s one where I want to keep the source hid­den, for var­i­ous rea­sons. I guess it’s up to the bot mak­er what they’re going for and whether expos­ing the text their work­ing with will add to the bot or not.

Audience 2: @wikisext just kind of works out, but do you ever have a bot where maybe 20% of, after you go through trans­form­ing—

Thrice: I have like twen­ty bots that that hap­pens to.

Audience 2: [crosstalk; inaudi­ble] If you just like, you know how you as a per­son can look at tweets from a bot that does­n’t quite work out and see this one’s good, these eight are bad. How about a way to have a pro­gram fig­ure out this worked out, and these are garbage.” That’d be awe­some.

Thrice: Two ways. One of these is com­ing from me as a nat­ur­al lan­guage pro­cess­ing per­son who works with machine learn­ing a lot, which is you can just gen­er­ate a few hun­dred of those and them man­u­al­ly go through and say this one’s good, this one’s bad” and then use a machine learn­ing pack­age to see if you could actu­al­ly get some­thing to dis­crim­i­nate the two. But you’d have to design fea­tures for that and stuff. And then the oth­er way would just be write a clas­si­fi­er your­self. By that I mean you could go through man­u­al­ly and mark ones [inaudi­ble] again, and then maybe write a cus­tom scor­ing func­tion that takes in some string and then says Is this gram­mat­i­cal enough? Is it using inter­est­ing words?” and stuff like that. I’ve nev­er tried that per­son­al­ly, but I think I’ve want­ed to do stuff like me rank­ing tweets or poten­tial tweets [inaudi­ble]

Greg: I do some­thing a lot like that with my Uncanny X‑Bot which is like [?] sheet sum­maries of X‑Men comics. I was find­ing that exact same prob­lem that one in every ten is real­ly good, so I wrote a heuris­tic which is like it gets some points for being short, it gets some points for includ­ing better-known char­ac­ters, that kind of thing. Then I fil­ter ones that make it past a cer­tain num­ber and it got a lot bet­ter.

Further Reference

Thrice main­tains a Twitter list of their bots.

Darius Kazemi’s home page for Bot Summit 2014, with YouTube links to indi­vid­ual ses­sions, and a log of the IRC chan­nel.


Help Support Open Transcripts

If you found this useful or interesting, please consider supporting the project monthly at Patreon or once via Cash App, or even just sharing the link. Thanks.