Carl Malamud: Internet Talk Radio, flame of the Internet.

Malamud: We’re talk­ing to Mike Schwartz, who’s a pro­fes­sor at the University of Colorado in Boulder, and Mike is the author of a resource dis­cov­ery tool called netfind. Why don’t you tell us what netfind does, Mike.

Mike Schwartz: Well netfind is a user direc­to­ry ser­vice on the Internet, and basi­cal­ly what it does is let you search for peo­ple wher­ev­er they hap­pen to be on the Internet by spec­i­fy­ing basi­cal­ly who you’re look­ing for and some­thing about where they are—maybe the name of their com­pa­ny or the city they live in.

Malamud: So this sounds like X.500.

Schwartz: Well, X.500 also address­es this prob­lem. Actually X.500 address­es of a much larg­er prob­lem than just search­ing for peo­ple on the Internet. Netfind takes sort of a more prac­ti­cal but more short-term approach to the prob­lem. And in fact it was nev­er intend­ed to sort of pro­vide glob­al ubiq­ui­tous direc­to­ry ser­vice in the long term. I’ve always seen sort of the user search­ing part of it as a short-term solu­tion. However the ideas behind netfind I think have much broad­er impact. And one of the things that’s miss­ing from X.500 that’s present in netfind is basi­cal­ly a data-gathering archi­tec­ture, a way to col­lect infor­ma­tion from lots of dif­fer­ent sites so that you can pro­vide a much more ubiq­ui­tous ser­vice, some­thing that lets peo­ple find oth­er peo­ple all over the place instead of just a few sites that hap­pen to be run­ning servers.

Malamud: So as a user I might log into netfind and type the words Malamud radio Washington.” 

Schwartz: That’s right.

Malamud: What hap­pens behind the scenes? How does netfind turn around and feed me back an elec­tron­ic mail address?

Schwartz: Well what hap­pens is actu­al­ly two phas­es. One part is col­lect­ing infor­ma­tion about all the dif­fer­ent sites that can be found on the Internet through a vari­ety of dif­fer­ent means. And this sort of hap­pens in the back­ground con­tin­u­ous­ly, mon­i­tor­ing lots of dif­fer­ent sources of infor­ma­tion and build­ing up a data­base that I call the seed data­base.” That’s intend­ed to imply seed­ing a search. Okay, so that hap­pens con­tin­u­ous­ly and then when the user spec­i­fies a search, the descrip­tion of where the user that you’re try­ing to find is is used to probe the seed data­base to find some promis­ing sites to search and then you go off and con­tact some remote ser­vices there—finger, SMTP, DNS—and try to locate the user using what­ev­er exist­ing infor­ma­tion hap­pens to be at those sites.

Malamud: So what is the seed data­base? How do you col­lect that?

Schwartz: I col­lect it from a num­ber of dif­fer­ent sources. Originally it was just col­lect­ed from Usenet news head­ers. So I could essen­tial­ly build a map­ping between key­words that describe sites and the domain names for those sites. So for exam­ple Boulder, Colorado University, com­put­er sci­ence,” all those key­words will appear on a record that when you search will then let you select cs​.col​orado​.edu, the domain name for the University of Colorado com­put­er sci­ence depart­ment, which then you can go off at run­time and try to con­tact some of the ser­vices there are find people.

Malamud: So you just grab the orga­ni­za­tion” line out of the Usenet post­ing and stick it in a database?

Schwartz: That’s right. The orga­ni­za­tion line from Usenet, and ever since that time I’ve actu­al­ly added a lot of oth­er tech­niques. I gath­er infor­ma­tion from UUCP maps from var­i­ous net­work infor­ma­tion cen­ters, whois data­bas­es, from var­i­ous ser­vice logs so I can—you know, when­ev­er some­one for exam­ple uses netfind I’ll dis­cov­er their site exists because there’s a log of my netfind usage log that I can feed back into the data­base. Not who used it—I don’t actu­al­ly record that, I just record the name of the site so that I can actu­al­ly con­tin­ue to grow the database.

Malamud: How big is the seed database?

Schwartz: It’s about 55,000 sites, or actu­al­ly domains in there. So cs​.col​orado​.edu is an exam­ple of one domain. It spans I think about sev­en­ty coun­tries, the num­ber of dif­fer­ent places that you can reach people.

Malamud: Do you have any idea of how many peo­ple you could find giv­en the right keywords?

Schwartz: My esti­mates— These are real rough because it’s hard to know exact­ly how many users are at any site, you know, so these are just real rough esti­mates based on some mea­sure­ments and then some guessti­ma­tion; around 10 million. 

Malamud: Around 10 mil­lion. Now…that’s fair­ly large. If I remem­ber my num­bers for X.500 we’re look­ing at more like a mil­lion to two mil­lion peo­ple that are con­tained there.

Schwartz: That’s right. X.500 can’t find near­ly as many peo­ple. On the oth­er hand X.500 will find a lot more detailed, more struc­tured infor­ma­tion. Netfind often­times will just essen­tial­ly return pret­ty sim­ple infor­ma­tion like fin­ger, which is unstruc­tured tex­tu­al out­put that might have a phone num­ber and it might have oth­er things. And pret­ty much the only thing that you always are going to get back is an email address.

Malamud: Is this a tran­si­tion path to X.500? Will netfind go away then as X.500 becomes a glob­al ubiq­ui­tous sin­gle direc­to­ry service?

Schwartz: Well it depends on whether you believe X.500 actu­al­ly is going to become a glob­al ubiq­ui­tous ser­vice. I think it’s going to take its place as one of the ser­vices. There’s a work­ing group with­in the IETF that’s work­ing on WHOIS++ and var­i­ous oth­er con­tenders that’re look­ing at basi­cal­ly pro­vid­ing direc­to­ry ser­vice. And what I see, netfind—the ideas behind netfind is pro­vid­ing is essen­tial­ly an inter­op­er­a­tion frame­work. I think the cur­rent user search phase of netfind will go away as peo­ple put up fire­wall gate­ways, etc., and you can’t fin­ger peo­ple any­more. But the whole idea of gath­er­ing a data­base of all the dif­fer­ent sites that can be searched and essen­tial­ly mak­ing use of what­ev­er ser­vice they decide to export, I think that’ll be around.

Malamud: So when netfind does a search, it essen­tial­ly says Okay I’m pret­ty sure I’m look­ing at cs​.col​orado​.edu, and I’m look­ing for Schwartz,’ and so maybe I’ll go into the DNS tables and find a bunch of hosts and then go to each of those and try to con­nect in and ask whether Schwartz’ exists and then I’ll try to fin­ger Schwartz.’ ” It looks like when you’re doing a netfind search there’s a whole bar­rage of queries going into a site. Is that impact­ing their resources? Is that a secu­ri­ty risk? Might they think that they’re being invad­ed, for example?

Schwartz: Well first of all there’s a mea­sured approach that netfind takes. It starts off try­ing to use a fair­ly non-invasive approach where we first start off by look­ing for mail for­ward­ing infor­ma­tion about the user by using SMTP EXPN queries to locate a mail­box for the user, and if that’s suc­cess­ful then go try to fin­ger that host. So, there’s actu­al­ly a num­ber of steps that netfind takes to try to make use of com­mon ways that peo­ple set up their sites to try to fig­ure out the right places to search. If those fail then we fall back on things like a num­ber of par­al­lel fin­gers, but there’s also a lim­it on how many it’ll use of those. 

So there’s real­ly two issues that you brought up now that I want to address. One is how much load does it impose on the remote site. And the sec­ond one is security.

In terms of the first issue, load, the amount of load that netfind gen­er­ates is small com­pared to a lot of dif­fer­ent oth­er types of ser­vices. For exam­ple if you look at the num­ber of con­nec­tions that you get at a site per day from netfind for peo­ple all over this place search­ing for peo­ple at your site, it might be a cou­ple of dozen fin­gers or some­thing like that, which is tiny com­pared with how much you know, news and mail and oth­er sorts of traffic.

In terms of the secu­ri­ty impli­ca­tions, netfind­’s mak­ing use of publicly-deployed and eas­i­ly dis­abled ser­vices. If peo­ple don’t want you to search their fin­ger servers, then it’s up to them to not run those ser­vices. But netfind­’s doing no more than mak­ing use of those, and albeit not—you know, at the time when these ser­vices were defined, peo­ple did­n’t imag­ine that it would be used in this par­tic­u­lar way but netfind­’s mak­ing use of the infor­ma­tion exact­ly that was decid­ed to be put for­ward in those services.

Malamud: You’re decid­ing to make that search. And you’re impos­ing that load. You’re say­ing that it’s a neg­li­gi­ble load and it’s not a big deal. But is that your deci­sion to make? Shouldn’t it be up to the tar­get site to decide whether or not you can use their resources?

Schwartz: Well, in fact I would say it em up to the tar­get site. They can always run gate­ways to decide—you know, a secu­ri­ty gate­ways to decide which pack­ets are allowed in and out of their site, just as they can restrict any­body from tel­net­ing into their site or any oth­er ser­vice that’s being used by remote users. 

I also pro­vide a mech­a­nism in netfind in the client so that you can dis­able probes to any site. You know, if a site real­ly decides they don’t want to be searched by netfind I can list them in my con­fig file and then it won’t be searched from my server.

Malamud: But there’s oth­er netfind servers out there. In fact this brings up the scal­ing issue of if there’s one netfind the load obvi­ous­ly is not very large. Is netfind going to be able to scale? Can there be a 10,000 netfinds out there all look­ing for information?

Schwartz: Well like I said, I’ve seen the user search part of netfind, which is not the seed data­base col­lec­tion but the part of actu­al­ly going out and doing fin­ger­ing and such, as being just sort of a short-term solu­tion and real­ly what it was orig­i­nal­ly defined as was a research pro­to­type just exper­i­ment­ing with some of these ideas. And it became as suc­cess­ful and pop­u­lar as it is because is pro­vides a use­ful ser­vice. And I think in time, that part of netfind­’s gonna be replaced by bet­ter mech­a­nisms. For exam­ple just recent­ly I was talk­ing with some of the peo­ple at the IETF here about a way to essen­tial­ly instead of just using fin­ger, etc., let a site reg­is­ter in the DNS tables for their site a set of records that would basi­cal­ly say what the ser­vices are that they’re export­ing, whether it’s whois or X.500 or what­ev­er, to say what are the ser­vices that I’m will­ing to export user infor­ma­tion from, and netfind could con­tact that site, grab that infor­ma­tion to decide then to search cer­tain ser­vices and skip the user fin­ger­ing, etc.

Malamud: Mark Schwartz, you’ve been active in a research project called Essence. Maybe you can tell us a lit­tle bit about what that is.

Schwartz: Sure. Essence is a file sys­tem index­ing tool. And basi­cal­ly what it does is it extracts infor­ma­tion from files in a fash­ion that’s spe­cif­ic to the type of file. So for exam­ple it knows about the struc­ture of troff source doc­u­ments so that it can find authors and titles and abstracts, and extract those key­words rather than for exam­ple extract­ing every key­word in the entire doc­u­ment, which is what some­thing like WAIS, the Wide Area Information Server sys­tem does. 

So, what it’s try­ing to do is extract a small sub­set to save space. And also it’s try­ing to do it in a fash­ion that is smart about the struc­ture of the doc­u­ment so that we’ll hope­ful­ly get more pre­cise answers to queries rather than every sin­gle doc­u­ment that hap­pens to have the key­word in it.

Malamud: This sounds like Archie on steroids. It sounds like an Archie that knows more about the inside of the data.

Schwartz: You might say that. I think what Archie and netfind and Essence and a lot of these things have in com­mon is this basic notion of gath­er­ing togeth­er use­ful data into a cen­tral­ized or, not nec­es­sary cen­tral­ized, but into a com­mon repos­i­to­ry that can then be searched with much more pow­er­ful tools than you could pre­vi­ous­ly when the infor­ma­tion was all sort of scat­tered wher­ev­er it hap­pened to be. And in fact the Archie guys are look­ing more— They just recent­ly announced on one of their mail­ing lists a tech­nique that would allow them to gath­er more struc­tured infor­ma­tion and more detailed infor­ma­tion than just file­names, which is what Archie cur­rent­ly gathers.

Malamud: Is it good to have mul­ti­ple efforts out there? Essence is a research pro­to­type, but it gets deployed. And Archie was a lit­tle research pro­to­type and it got deployed. Should there be some kind of a uni­fied effort on file locators?

Schwartz: I think in time there should be. I think right now we’re a lit­tle bit too ear­ly to real­ly con­clu­sive­ly close the book and say this is the one way to do it. I think these are all dif­fer­ent. I think what’s hap­pened the last few years is that Gopher, Archie, WAIS, World Wide Web—lots of dif­fer­ent sys­tems have been deployed and each has demon­strat­ed a cer­tain set of ideas. And now what you see going on is peo­ple start­ing to con­verge on what’re good ideas on what’re bad ideas and start­ing to throw things away.

Malamud: In the WAIS sys­tem, you basi­cal­ly take every word and index it. With CPUs get­ting cheap and disk dri­ves get­ting cheap, is it worth the has­sle of going into a troff doc­u­ment and look­ing for .author field? Why don’t you just put every field in there and let every­body search on everything?

Schwartz: Well that’s a good ques­tion. I think a cou­ple answers. One is it’s not just the end site that has a CPU etc. being loaded down with lots of extra work to try to index every­thing. If you’re going to start talk­ing about build­ing an index­ing ser­vice that col­lects infor­ma­tion from lots and lots of sites, all of a sud­den you’re going to want to start pass­ing lots of data. And the small­er the… You know, the whole name behind Essence is you boil some­thing down to the lit­tlest piece you can and send that across the net­work. And if you can put lots and lots of those togeth­er, you can pro­vide a much more use­ful index­ing ser­vice. So, that’s one part of the rea­son why you don’t want to do it.

The oth­er part of the rea­son is that I think you’re going to get bet­ter pre­ci­sion if you can just select the key­words that are real­ly in the fields that peo­ple want rather than every sin­gle key­word in the doc­u­ment. And you know, this remains to be proven. One of the things I’m cur­rent­ly work­ing, I have a stu­dent who’s look­ing at actu­al­ly doing some mea­sure­ments of pre­ci­sion and recall, com­par­ing these dif­fer­ent sys­tems try­ing to get a lit­tle bit bet­ter han­dle on you know, does full-text index­ing pro­vide the best pos­si­ble pre­ci­sion and recall, or would a selected-text index­ing sys­tem like Essence do it. 

One oth­er com­ment I want to make, by the way. The way Essence does it is not— Essence isn’t the only sys­tem that does this. MIT had the seman­tic file sys­tem that they built actu­al­ly a lit­tle bit before us and so it has some sim­i­lar­i­ties and some dif­fer­ences. But I just want to point out that this isn’t only my idea.

Malamud: Essence and the MIT sys­tem require some seman­tic knowl­edge. They know that this is a troff file, they know that this is a C pro­gram. Is that gonna scale as the net­work gets big­ger and we get many many more file types? Is a tool like Essence gonna be able to keep up with the new kinds of infor­ma­tion out there?

Schwartz: I think the way to answer that ques­tion is to look at the kinds of peo­ple who’re going to be build­ing these sorts of tools that could you know, look for the seman­tics of a par­tic­u­lar file type. If a file is real­ly real­ly uncom­mon and you know, it just hap­pens to be my own local incan­ta­tion that I hap­pen to use and no one else uses it, unless I go to the trou­ble of doing it it’s obvi­ous­ly not going to get done. 

On the oth­er hand if there’s a real­ly widely-popular type of file and it’s got some real val­ue, some­body’s going to build this thing. And you know, who builds it depends on your mod­el of how the Internet moves for­ward. There’s a com­mer­cial mod­el in which you know, if it’s use­ful enough func­tion­al­i­ty some­body might build this thing and sell it. There’s also a research mod­el that hey, this is a neat thing, let me try it out and see if it works. There’s a vari­ety of mod­els under which devel­op­ing these things could happen.

Malamud: And which is going to happen?

Schwartz: Um… Boy, it’s hard to know exact­ly how things are going to evolve. I think some com­bi­na­tion of all those, but I real­ly don’t know.

Malamud: That’s an…answer from a researcher.

Schwartz: [laughs] Better than an answer from a politi­cian, right?

Malamud: Mike Schwartz, you’ve been heav­i­ly involved in mea­sur­ing the Internet. One of your stud­ies was the great dis­con­nect study. Maybe you could tell us about that.

Schwartz: Sure. So part of this was moti­vat­ed by the obser­va­tion that netfind gets less and less use­ful as peo­ple turn off fin­ger or put secu­ri­ty gate­ways in place. And more gen­er­al­ly I became inter­est­ed in the ques­tion of how much are sites becom­ing more and more con­cerned about secu­ri­ty to the extent that they’re shut­ting off use­ful con­nec­tiv­i­ty in order to achieve the secu­ri­ty. And the whole name for that study derived from a term that Dave Clark termed back in a pub­lic ses­sion on secu­ri­ty at an Interop a cou­ple years ago where he basi­cal­ly referred to this as The Great Disconnection. 

So I was inter­est­ed in mea­sur­ing the extent to which the Internet’s shut­ting itself off from use­ful con­nec­tiv­i­ty because of secu­ri­ty con­cerns. And at the same time, as I start­ed doing that I became inter­est­ed in essen­tial­ly what’re the right met­rics for fig­ur­ing out how big the Internet is and how fast it’s grow­ing. One of the most com­mon ones is Mark Lottor’s mea­sure­ments of host counts on the Internet. And I believe that’s a mis­lead­ing esti­mate because for exam­ple, just to take one spe­cif­ic exam­ple, sun​.com is on the Internet and it has some­thing like on the order of 10,000—I don’t know exact­ly how many—10 to 100,000 hosts, but only one of those hosts is actu­al­ly direct­ly con­nect­ed to the Internet; the oth­er ones are all essen­tial­ly on an inter­nal cor­po­rate net­work that’s shield­ed from the Internet by a cor­po­rate gateway.

Malamud: So what Mark does with his study is looks at the hosts that are reach­able in the DNS tables on the Internet. He basi­cal­ly counts every­one that’s got an A address?

Schwartz: That’s right. 

Malamud: Now, what do you do to count how many hosts there are on the net? Are there bet­ter ways of doing this?

Schwartz: Yes, I think there are. Essentially what I’m look­ing at is what sites are reach­able by com­mon TCP services—and what I should say is what I was doing. I fin­ished the study; it’s no longer going on. So for exam­ple how many sites can I tel­net to, how many sites can I FTP to, etc.—I had actu­al­ly a list of about a dozen ser­vices. And I had a large list of dif­fer­ent sites. I would try to probe each of those ser­vices at a small num­ber of hosts at each of 12,000 sites around the world and basi­cal­ly come up with— And I did this four times over the course of 1992 and came up with mea­sures of trends of how sites are dis­con­nect­ing as a func­tion of whether they’re com­mer­cial, or gov­ern­ment, or edu­ca­tion­al, as a func­tion of inter­na­tion­al dis­tinc­tions, as a func­tion of what kinds of ser­vices they’re run­ning, etc.

And I also came up with essen­tial­ly mod­els of growth. That you could use math­e­mat­i­cal mod­els to say how fast is this domain grow­ing, how fast is this domain grow­ing. And those might be of inter­est for exam­ple to com­mer­cial ser­vice providers who’re inter­est­ed in see­ing you know, where’s the mar­ket and when is going to hap­pen big enough for me to be inter­est­ed and etc.

Malamud: So are peo­ple dis­con­nect­ing from the net? Are they putting fire­walls in?

Schwartz: Yeah, they are. In the short term it’s not going to be a big impact on the set of reach­able sites because the Internet’s grow­ing much faster than the rate at which sites are dis­con­nect­ing, but at some point expo­nen­tial growth we hope has to stop oth­er­wise the world runs out of oxy­gen for all the peo­ple on the plan­et. But, in the mean­time the net­work’s going quite a bit faster than the rate at which peo­ple are dis­con­nect­ing. The point when the growth slows down, that’s when the secu­ri­ty mea­sures are gonna have a big­ger impact. 

One com­ment I want to make, by the way, is that I com­plete­ly under­stand and sym­pa­thize with the rea­sons why peo­ple have put secu­ri­ty gate­ways in. I’m not just say­ing that nobody should have secu­ri­ty. I under­stand the impli­ca­tions of hav­ing secu­ri­ty vio­la­tions. I was sim­ply inter­est­ed in mea­sur­ing the extent of the phenomenon.

Malamud: You were talk­ing about growth rates and your mod­els for that. How fast is the Internet grow­ing? Or pieces of it.

Schwartz: Um, well, the num­bers that you see peo­ple cit­ing about 20% per month growth, and the num­ber of hosts, the num­ber of—you see this for a lot of dif­fer­ent aspects of it. I think those glob­al num­bers are actu­al­ly cor­rect. What I’m say­ing is a skew about look­ing at host counts ver­sus reach­able ser­vices is where the growth is hap­pen­ing and what kinds of func­tion­al­i­ty can be deployed there. There’s no doubt that when a site dou­bles its num­ber of hosts in some sense the net­work becomes big­ger even if you can’t direct­ly tel­net into those hosts. At the same time, how­ev­er, you know, none of the hosts behind that fire­wall gate­way can deploy their own World Wide Web serv­er and let peo­ple con­nect in and get their infor­ma­tion. You end up hav­ing a sit­u­a­tion where all their inter­est­ing ser­vices, from the out­side world’s’ per­spec­tive, are lined up on the secu­ri­ty gate­way, and the only way to get to those and the only way to deploy them is to con­tact that secu­ri­ty gate­way. So you’ve essen­tial­ly decreased the net­work’s use­ful reach from the glob­al col­lab­o­ra­tion per­spec­tive by a hop, or sev­er­al hops in some cases. 

Malamud: You looked at the ques­tion of what kind of data is on the net­work, and more specif­i­cal­ly you looked at FTP traf­fic on the back­bone. Could you tell us a lit­tle bit about that study?

Schwartz: Yeah. So what I was inter­est­ed in there is there’s this phe­nom­e­non of when some pop­u­lar piece of infor­ma­tion becomes avail­able, or some use­ful piece of infor­ma­tion becomes avail­able, it can become so pop­u­lar that every­body tries to FTP it and grab it at once and can sat­u­rate net­work links. And we’ve seen this hap­pen a few times. You know, for exam­ple when MIT released its win­dows sys­tem X11R5, all of a sud­den every­body want­ed to grab that soft­ware. Now, they actu­al­ly went to some trou­ble to pre­dis­trib­ute it to try to bal­ance the load around the net­work, put it in twen­ty sites around the net. But for exam­ple, a cou­ple years ago you made the ITU stan­dards doc­u­ments avail­able on bruno​.cs​.col​orado​.edu. And when that hap­pened essen­tial­ly we clogged the transat­lantic and South American links try­ing to get into our site. 

Malamud: How sig­nif­i­cant was that? I mean was it real­ly— Did it take all the band­width on those links, or was it just a cou­ple blips of a per­cent or.

Schwartz: Um… The num­bers that I heard, or the peo­ple that I heard talk about it, basi­cal­ly it sucked up a lot of the band­width for a sig­nif­i­cant peri­od of time. I don’t know exact­ly how much for how long. But for the heavily-overloaded inter­con­ti­nen­tal links, at least those got pret­ty heav­i­ly loaded. And also I saw some mea­sure­ments that NSFNET for exam­ple put togeth­er and you know, Colorado’s con­tri­bu­tion to glob­al Internet load that month was clear­ly high­er than it had been by quite a bit. 

And you know, to some extent you could claim that the prob­lem was that that infor­ma­tion, the ITU doc­u­ments, weren’t pre­dis­trib­uted like MIT had done. On the oth­er hand, what I would claim is if you’re going to be able to pub­lish real­ly widely-useful infor­ma­tion what you real­ly want is a mech­a­nism that will auto­mat­i­cal­ly sor­ta dis­trib­ute the infor­ma­tion and let it be cached around the net­work, okay. So I did a mea­sure­ment study where I was inter­est­ed in see­ing how much this sort of thing hap­pens with maybe not quite so pop­u­lar data. I mean, for exam­ple netfind, the source code is avail­able on my machine at Colorado. And I see every day you know, forty or fifty times some­body’ll grab a copy of the data­base or of the source code. Okay, so this is a small­er scale. But if you mul­ti­ply this by the num­ber of dif­fer­ent sites that’re deploy­ing some piece of source code—traceroute, whatever—it starts adding up. So I actu­al­ly want­ed to do a mea­sure­ment study to see how much the back­bone traf­fic for exam­ple is being wast­ed by dupli­cate trans­mis­sions that could be eas­i­ly cached, and then in fact how much could you cache and reduce the load.

Malamud: And?

Schwartz: And the result was that some­thing like 45% of all the FTPs that’re are going on are trans­mit­ting data mul­ti­ple times that had already been sent before, and if you had cached it, the over­all num­ber is that I could reduce the total back­bone traf­fic, not just FTP but fac­tor­ing out what pro­por­tion of the traf­fic is FTP, I could reduce that by one fifth, okay. So I could get rid of one fifth of all the back­bone traf­fic just by putting caches around the periph­ery of the backbone.

Malamud: So our 45-megabit back­bone would now be 20% faster, in effect.

Schwartz: 20% less loaded. Making it faster depends on the time the day— I mean it’s…pretty com­pli­cat­ed, but yeah.

Malamud: So how do you go about doing that? How would you get every­one to do…legitimate FTP caching? Do you come up with a new FTP RFC? Do you write some source code? How do we get that from research into operation?

Schwartz: Well, the research study was sim­ply a trace-based sim­u­la­tion. So there’s no code that you could actu­al­ly deploy to solve this prob­lem, it basi­cal­ly just took traces and said if we were to put caches at var­i­ous places what would it do to load. Right now, anoth­er stu­dent of mine is work­ing on design­ing and try­ing to build a first shot at a gener­ic object caching archi­tec­ture. So we’re look­ing at more than just FTP, we’d like to be able to cache any object that’s retrieved. And on one of the ideas that kind of intrigues me at this point is it seems to me it’d be pret­ty dif­fi­cult to deploy this into FTP clients or servers because it’s so wide­ly dis­trib­uted now and owned by essen­tial­ly every ven­dor who pro­duces the soft­ware mod­i­fies it, etc. 

Not only that, but I see FTP as a user access method going away. I think it’s just going to be a trans­port pro­to­col. Nobody’s gonna type ftp blah­blah­blah host­name” any­more, instead what they gonna do is open up a nice XMosaic win­dow or whatever—Mosaic hap­pens to be sort of the pret­ti­est appli­ca­tion out there right now or user inter­face to all these things—and retrieve data, some of which is going to come from FTP, some of which is going to come from World Wide Web hyper­text infor­ma­tion. And what I think would be a nice way doing it is to retro­fit those clients, the Mosaic clients, with knowl­edge of a gener­ic object caching mech­a­nism that’s in the Internet and let them retrieve the infor­ma­tion through those caches.

Malamud: As a researcher mea­sur­ing the Internet, you have to do things like look at dupli­cate FTP traf­fic, and in doing so you’re essen­tial­ly peek­ing inside files. How do you bal­ance the pri­va­cy needs of an oper­a­tional Internet with the legit­i­mate needs of the sci­en­tif­ic com­mu­ni­ty to mea­sure and change the Internet?

Schwartz: That’s a real­ly good ques­tion. Privacy is always a prob­lem in any of these sorts of mea­sure­ment stud­ies, and also in these resource dis­cov­ery and infor­ma­tion sys­tems. There’s lots of ways that pri­va­cy comes up, any­where from look­ing at log files to you know, what’s sorts of infor­ma­tion you can find about peo­ple, to peek­ing inside pack­ets in order to do these mea­sure­ment studies. 

I think essen­tial­ly you need to…in order to strike that bal­ance, you know, be able to get good infor­ma­tion back at the same time in regard to pri­va­cy needs, you have to— I think it’s a case-by-case sort of basis. You need to try to estab­lish pro­ce­dures by which you’re not vio­lat­ing pri­va­cy too much. For exam­ple, in the FTP study we did a few things to guard pri­va­cy. One is we only col­lect­ed… We looked at only a sam­pling of the pack­ets in order to deter­mine file iden­ti­ty rather than record­ing all the file pack­ets that were going by. Second of all, once we actu­al­ly got the infor­ma­tion we need­ed we essen­tial­ly did a trans­for­ma­tion to gen­er­ate num­bers from the data that we could­n’t map back into IP address­es of any sort and threw away the orig­i­nal data. So we actu­al­ly have no way of say­ing who’s FTPing what data. And in fact when we put togeth­er the pro­ce­dures for col­lect­ing this data we passed it by a num­ber of dif­fer­ent peo­ple at the National Science Foundation and our local region­al net­works at the National Center for Atmospheric Research which is a con­nec­tion to the NSFNET back­bone. We talked to a lot­ta peo­ple about it, and I think a big key regard­ing pri­va­cy issues is basi­cal­ly to get a lot of inputs from peo­ple who know about a lot of dif­fer­ent aspects of the problem. 

Malamud: Should there be some kind of a peer review pan­el, a for­mal Internet Society group that looks at your research and decides whether you should be able to do it or not?

Schwartz: Um, well—

Malamud: Is this like exper­i­ment­ing on humans?

Schwartz: [chuck­les] Gee. I feel a lit­tle bit unqual­i­fied to say this because I’m not a legal expert and I’m not all these sorts of oth­er things. I can give you my basic gut-level hunch on it, which is any­time you insert a big for­mal process, you slow things down a lot and essen­tial­ly stop some things from hap­pen­ing. At the same time you know, if you don’t have a for­mal process on things you can get into trouble.

I think the Internet’s gonna to get to the point where it’s going to be an impor­tant enough, for­mal enough piece of infra­struc­ture where some sort of pan­el like this will come into being. I mean, the Internet Architecture Board already put togeth­er a—you know, a rec­om­men­da­tion for how you should car­ry out these mea­sure­ment stud­ies that talked about some of these sorts of prob­lems. But it’s not near­ly so for­mal as to say that there’s this pan­el you have to go through and get offi­cial approval, etc.

Malamud: This is Internet Talk Radio, flame of the Internet. You’ve been lis­ten­ing to Geek of the Week. You may copy this pro­gram to any medi­um and change the encod­ing, but may not alter the data or sell the con­tents. To pur­chase an audio cas­sette of this pro­gram, send mail to radio@​ora.​com.

Support for Geek of the Week comes from Sun Microsystems. Sun, The Network is the Computer. Support for Geek of the Week also comes from O’Reilly & Associates, pub­lish­ers of the Global Network Navigator, your online hyper­text mag­a­zine. For more infor­ma­tion, send mail to info@​gnn.​com. Network con­nec­tiv­i­ty for the Internet Multicasting Service is pro­vid­ed by MFS DataNet and by UUNET Technologies.

Executive pro­duc­er for Geek of the Week is Martin Lucas. Production Manager is James Roland. Rick Dunbar and Curtis Generous are the sysad­mins. This is Carl Malamud for the Internet Multicasting Service, town crier to the glob­al village.