Carl Malamud: Internet Talk Radio, flame of the Internet.
Malamud: We’re talking to Mike Schwartz, who’s a professor at the University of Colorado in Boulder, and Mike is the author of a resource discovery tool called netfind. Why don’t you tell us what netfind does, Mike.
Mike Schwartz: Well netfind is a user directory service on the Internet, and basically what it does is let you search for people wherever they happen to be on the Internet by specifying basically who you’re looking for and something about where they are—maybe the name of their company or the city they live in.
Malamud: So this sounds like X.500.
Schwartz: Well, X.500 also addresses this problem. Actually X.500 addresses of a much larger problem than just searching for people on the Internet. Netfind takes sort of a more practical but more short-term approach to the problem. And in fact it was never intended to sort of provide global ubiquitous directory service in the long term. I’ve always seen sort of the user searching part of it as a short-term solution. However the ideas behind netfind I think have much broader impact. And one of the things that’s missing from X.500 that’s present in netfind is basically a data-gathering architecture, a way to collect information from lots of different sites so that you can provide a much more ubiquitous service, something that lets people find other people all over the place instead of just a few sites that happen to be running servers.
Malamud: So as a user I might log into netfind and type the words “Malamud radio Washington.”
Schwartz: That’s right.
Malamud: What happens behind the scenes? How does netfind turn around and feed me back an electronic mail address?
Schwartz: Well what happens is actually two phases. One part is collecting information about all the different sites that can be found on the Internet through a variety of different means. And this sort of happens in the background continuously, monitoring lots of different sources of information and building up a database that I call the “seed database.” That’s intended to imply seeding a search. Okay, so that happens continuously and then when the user specifies a search, the description of where the user that you’re trying to find is is used to probe the seed database to find some promising sites to search and then you go off and contact some remote services there—finger, SMTP, DNS—and try to locate the user using whatever existing information happens to be at those sites.
Malamud: So what is the seed database? How do you collect that?
Schwartz: I collect it from a number of different sources. Originally it was just collected from Usenet news headers. So I could essentially build a mapping between keywords that describe sites and the domain names for those sites. So for example “Boulder, Colorado University, computer science,” all those keywords will appear on a record that when you search will then let you select cs.colorado.edu, the domain name for the University of Colorado computer science department, which then you can go off at runtime and try to contact some of the services there are find people.
Malamud: So you just grab the “organization” line out of the Usenet posting and stick it in a database?
Schwartz: That’s right. The organization line from Usenet, and ever since that time I’ve actually added a lot of other techniques. I gather information from UUCP maps from various network information centers, whois databases, from various service logs so I can—you know, whenever someone for example uses netfind I’ll discover their site exists because there’s a log of my netfind usage log that I can feed back into the database. Not who used it—I don’t actually record that, I just record the name of the site so that I can actually continue to grow the database.
Malamud: How big is the seed database?
Schwartz: It’s about 55,000 sites, or actually domains in there. So cs.colorado.edu is an example of one domain. It spans I think about seventy countries, the number of different places that you can reach people.
Malamud: Do you have any idea of how many people you could find given the right keywords?
Schwartz: My estimates— These are real rough because it’s hard to know exactly how many users are at any site, you know, so these are just real rough estimates based on some measurements and then some guesstimation; around 10 million.
Malamud: Around 10 million. Now…that’s fairly large. If I remember my numbers for X.500 we’re looking at more like a million to two million people that are contained there.
Schwartz: That’s right. X.500 can’t find nearly as many people. On the other hand X.500 will find a lot more detailed, more structured information. Netfind oftentimes will just essentially return pretty simple information like finger, which is unstructured textual output that might have a phone number and it might have other things. And pretty much the only thing that you always are going to get back is an email address.
Malamud: Is this a transition path to X.500? Will netfind go away then as X.500 becomes a global ubiquitous single directory service?
Schwartz: Well it depends on whether you believe X.500 actually is going to become a global ubiquitous service. I think it’s going to take its place as one of the services. There’s a working group within the IETF that’s working on WHOIS++ and various other contenders that’re looking at basically providing directory service. And what I see, netfind—the ideas behind netfind is providing is essentially an interoperation framework. I think the current user search phase of netfind will go away as people put up firewall gateways, etc., and you can’t finger people anymore. But the whole idea of gathering a database of all the different sites that can be searched and essentially making use of whatever service they decide to export, I think that’ll be around.
Malamud: So when netfind does a search, it essentially says “Okay I’m pretty sure I’m looking at cs.colorado.edu, and I’m looking for ‘Schwartz,’ and so maybe I’ll go into the DNS tables and find a bunch of hosts and then go to each of those and try to connect in and ask whether ‘Schwartz’ exists and then I’ll try to ‘finger Schwartz.’ ” It looks like when you’re doing a netfind search there’s a whole barrage of queries going into a site. Is that impacting their resources? Is that a security risk? Might they think that they’re being invaded, for example?
Schwartz: Well first of all there’s a measured approach that netfind takes. It starts off trying to use a fairly non-invasive approach where we first start off by looking for mail forwarding information about the user by using SMTP EXPN queries to locate a mailbox for the user, and if that’s successful then go try to finger that host. So, there’s actually a number of steps that netfind takes to try to make use of common ways that people set up their sites to try to figure out the right places to search. If those fail then we fall back on things like a number of parallel fingers, but there’s also a limit on how many it’ll use of those.
So there’s really two issues that you brought up now that I want to address. One is how much load does it impose on the remote site. And the second one is security.
In terms of the first issue, load, the amount of load that netfind generates is small compared to a lot of different other types of services. For example if you look at the number of connections that you get at a site per day from netfind for people all over this place searching for people at your site, it might be a couple of dozen fingers or something like that, which is tiny compared with how much you know, news and mail and other sorts of traffic.
In terms of the security implications, netfind’s making use of publicly-deployed and easily disabled services. If people don’t want you to search their finger servers, then it’s up to them to not run those services. But netfind’s doing no more than making use of those, and albeit not—you know, at the time when these services were defined, people didn’t imagine that it would be used in this particular way but netfind’s making use of the information exactly that was decided to be put forward in those services.
Malamud: You’re deciding to make that search. And you’re imposing that load. You’re saying that it’s a negligible load and it’s not a big deal. But is that your decision to make? Shouldn’t it be up to the target site to decide whether or not you can use their resources?
Schwartz: Well, in fact I would say it em up to the target site. They can always run gateways to decide—you know, a security gateways to decide which packets are allowed in and out of their site, just as they can restrict anybody from telneting into their site or any other service that’s being used by remote users.
I also provide a mechanism in netfind in the client so that you can disable probes to any site. You know, if a site really decides they don’t want to be searched by netfind I can list them in my config file and then it won’t be searched from my server.
Malamud: But there’s other netfind servers out there. In fact this brings up the scaling issue of if there’s one netfind the load obviously is not very large. Is netfind going to be able to scale? Can there be a 10,000 netfinds out there all looking for information?
Schwartz: Well like I said, I’ve seen the user search part of netfind, which is not the seed database collection but the part of actually going out and doing fingering and such, as being just sort of a short-term solution and really what it was originally defined as was a research prototype just experimenting with some of these ideas. And it became as successful and popular as it is because is provides a useful service. And I think in time, that part of netfind’s gonna be replaced by better mechanisms. For example just recently I was talking with some of the people at the IETF here about a way to essentially instead of just using finger, etc., let a site register in the DNS tables for their site a set of records that would basically say what the services are that they’re exporting, whether it’s whois or X.500 or whatever, to say what are the services that I’m willing to export user information from, and netfind could contact that site, grab that information to decide then to search certain services and skip the user fingering, etc.
Malamud: Mark Schwartz, you’ve been active in a research project called Essence. Maybe you can tell us a little bit about what that is.
Schwartz: Sure. Essence is a file system indexing tool. And basically what it does is it extracts information from files in a fashion that’s specific to the type of file. So for example it knows about the structure of troff source documents so that it can find authors and titles and abstracts, and extract those keywords rather than for example extracting every keyword in the entire document, which is what something like WAIS, the Wide Area Information Server system does.
So, what it’s trying to do is extract a small subset to save space. And also it’s trying to do it in a fashion that is smart about the structure of the document so that we’ll hopefully get more precise answers to queries rather than every single document that happens to have the keyword in it.
Malamud: This sounds like Archie on steroids. It sounds like an Archie that knows more about the inside of the data.
Schwartz: You might say that. I think what Archie and netfind and Essence and a lot of these things have in common is this basic notion of gathering together useful data into a centralized or, not necessary centralized, but into a common repository that can then be searched with much more powerful tools than you could previously when the information was all sort of scattered wherever it happened to be. And in fact the Archie guys are looking more— They just recently announced on one of their mailing lists a technique that would allow them to gather more structured information and more detailed information than just filenames, which is what Archie currently gathers.
Malamud: Is it good to have multiple efforts out there? Essence is a research prototype, but it gets deployed. And Archie was a little research prototype and it got deployed. Should there be some kind of a unified effort on file locators?
Schwartz: I think in time there should be. I think right now we’re a little bit too early to really conclusively close the book and say this is the one way to do it. I think these are all different. I think what’s happened the last few years is that Gopher, Archie, WAIS, World Wide Web—lots of different systems have been deployed and each has demonstrated a certain set of ideas. And now what you see going on is people starting to converge on what’re good ideas on what’re bad ideas and starting to throw things away.
Malamud: In the WAIS system, you basically take every word and index it. With CPUs getting cheap and disk drives getting cheap, is it worth the hassle of going into a troff document and looking for .author field? Why don’t you just put every field in there and let everybody search on everything?
Schwartz: Well that’s a good question. I think a couple answers. One is it’s not just the end site that has a CPU etc. being loaded down with lots of extra work to try to index everything. If you’re going to start talking about building an indexing service that collects information from lots and lots of sites, all of a sudden you’re going to want to start passing lots of data. And the smaller the… You know, the whole name behind Essence is you boil something down to the littlest piece you can and send that across the network. And if you can put lots and lots of those together, you can provide a much more useful indexing service. So, that’s one part of the reason why you don’t want to do it.
The other part of the reason is that I think you’re going to get better precision if you can just select the keywords that are really in the fields that people want rather than every single keyword in the document. And you know, this remains to be proven. One of the things I’m currently working, I have a student who’s looking at actually doing some measurements of precision and recall, comparing these different systems trying to get a little bit better handle on you know, does full-text indexing provide the best possible precision and recall, or would a selected-text indexing system like Essence do it.
One other comment I want to make, by the way. The way Essence does it is not— Essence isn’t the only system that does this. MIT had the semantic file system that they built actually a little bit before us and so it has some similarities and some differences. But I just want to point out that this isn’t only my idea.
Malamud: Essence and the MIT system require some semantic knowledge. They know that this is a troff file, they know that this is a C program. Is that gonna scale as the network gets bigger and we get many many more file types? Is a tool like Essence gonna be able to keep up with the new kinds of information out there?
Schwartz: I think the way to answer that question is to look at the kinds of people who’re going to be building these sorts of tools that could you know, look for the semantics of a particular file type. If a file is really really uncommon and you know, it just happens to be my own local incantation that I happen to use and no one else uses it, unless I go to the trouble of doing it it’s obviously not going to get done.
On the other hand if there’s a really widely-popular type of file and it’s got some real value, somebody’s going to build this thing. And you know, who builds it depends on your model of how the Internet moves forward. There’s a commercial model in which you know, if it’s useful enough functionality somebody might build this thing and sell it. There’s also a research model that hey, this is a neat thing, let me try it out and see if it works. There’s a variety of models under which developing these things could happen.
Malamud: And which is going to happen?
Schwartz: Um… Boy, it’s hard to know exactly how things are going to evolve. I think some combination of all those, but I really don’t know.
Malamud: That’s an…answer from a researcher.
Schwartz: [laughs] Better than an answer from a politician, right?
Malamud: Mike Schwartz, you’ve been heavily involved in measuring the Internet. One of your studies was the great disconnect study. Maybe you could tell us about that.
Schwartz: Sure. So part of this was motivated by the observation that netfind gets less and less useful as people turn off finger or put security gateways in place. And more generally I became interested in the question of how much are sites becoming more and more concerned about security to the extent that they’re shutting off useful connectivity in order to achieve the security. And the whole name for that study derived from a term that Dave Clark termed back in a public session on security at an Interop a couple years ago where he basically referred to this as The Great Disconnection.
So I was interested in measuring the extent to which the Internet’s shutting itself off from useful connectivity because of security concerns. And at the same time, as I started doing that I became interested in essentially what’re the right metrics for figuring out how big the Internet is and how fast it’s growing. One of the most common ones is Mark Lottor’s measurements of host counts on the Internet. And I believe that’s a misleading estimate because for example, just to take one specific example, sun.com is on the Internet and it has something like on the order of 10,000—I don’t know exactly how many—10 to 100,000 hosts, but only one of those hosts is actually directly connected to the Internet; the other ones are all essentially on an internal corporate network that’s shielded from the Internet by a corporate gateway.
Malamud: So what Mark does with his study is looks at the hosts that are reachable in the DNS tables on the Internet. He basically counts everyone that’s got an A address?
Schwartz: That’s right.
Malamud: Now, what do you do to count how many hosts there are on the net? Are there better ways of doing this?
Schwartz: Yes, I think there are. Essentially what I’m looking at is what sites are reachable by common TCP services—and what I should say is what I was doing. I finished the study; it’s no longer going on. So for example how many sites can I telnet to, how many sites can I FTP to, etc.—I had actually a list of about a dozen services. And I had a large list of different sites. I would try to probe each of those services at a small number of hosts at each of 12,000 sites around the world and basically come up with— And I did this four times over the course of 1992 and came up with measures of trends of how sites are disconnecting as a function of whether they’re commercial, or government, or educational, as a function of international distinctions, as a function of what kinds of services they’re running, etc.
And I also came up with essentially models of growth. That you could use mathematical models to say how fast is this domain growing, how fast is this domain growing. And those might be of interest for example to commercial service providers who’re interested in seeing you know, where’s the market and when is going to happen big enough for me to be interested and etc.
Malamud: So are people disconnecting from the net? Are they putting firewalls in?
Schwartz: Yeah, they are. In the short term it’s not going to be a big impact on the set of reachable sites because the Internet’s growing much faster than the rate at which sites are disconnecting, but at some point exponential growth we hope has to stop otherwise the world runs out of oxygen for all the people on the planet. But, in the meantime the network’s going quite a bit faster than the rate at which people are disconnecting. The point when the growth slows down, that’s when the security measures are gonna have a bigger impact.
One comment I want to make, by the way, is that I completely understand and sympathize with the reasons why people have put security gateways in. I’m not just saying that nobody should have security. I understand the implications of having security violations. I was simply interested in measuring the extent of the phenomenon.
Malamud: You were talking about growth rates and your models for that. How fast is the Internet growing? Or pieces of it.
Schwartz: Um, well, the numbers that you see people citing about 20% per month growth, and the number of hosts, the number of—you see this for a lot of different aspects of it. I think those global numbers are actually correct. What I’m saying is a skew about looking at host counts versus reachable services is where the growth is happening and what kinds of functionality can be deployed there. There’s no doubt that when a site doubles its number of hosts in some sense the network becomes bigger even if you can’t directly telnet into those hosts. At the same time, however, you know, none of the hosts behind that firewall gateway can deploy their own World Wide Web server and let people connect in and get their information. You end up having a situation where all their interesting services, from the outside world’s’ perspective, are lined up on the security gateway, and the only way to get to those and the only way to deploy them is to contact that security gateway. So you’ve essentially decreased the network’s useful reach from the global collaboration perspective by a hop, or several hops in some cases.
Malamud: You looked at the question of what kind of data is on the network, and more specifically you looked at FTP traffic on the backbone. Could you tell us a little bit about that study?
Schwartz: Yeah. So what I was interested in there is there’s this phenomenon of when some popular piece of information becomes available, or some useful piece of information becomes available, it can become so popular that everybody tries to FTP it and grab it at once and can saturate network links. And we’ve seen this happen a few times. You know, for example when MIT released its windows system X11R5, all of a sudden everybody wanted to grab that software. Now, they actually went to some trouble to predistribute it to try to balance the load around the network, put it in twenty sites around the net. But for example, a couple years ago you made the ITU standards documents available on bruno.cs.colorado.edu. And when that happened essentially we clogged the transatlantic and South American links trying to get into our site.
Malamud: How significant was that? I mean was it really— Did it take all the bandwidth on those links, or was it just a couple blips of a percent or.
Schwartz: Um… The numbers that I heard, or the people that I heard talk about it, basically it sucked up a lot of the bandwidth for a significant period of time. I don’t know exactly how much for how long. But for the heavily-overloaded intercontinental links, at least those got pretty heavily loaded. And also I saw some measurements that NSFNET for example put together and you know, Colorado’s contribution to global Internet load that month was clearly higher than it had been by quite a bit.
And you know, to some extent you could claim that the problem was that that information, the ITU documents, weren’t predistributed like MIT had done. On the other hand, what I would claim is if you’re going to be able to publish really widely-useful information what you really want is a mechanism that will automatically sorta distribute the information and let it be cached around the network, okay. So I did a measurement study where I was interested in seeing how much this sort of thing happens with maybe not quite so popular data. I mean, for example netfind, the source code is available on my machine at Colorado. And I see every day you know, forty or fifty times somebody’ll grab a copy of the database or of the source code. Okay, so this is a smaller scale. But if you multiply this by the number of different sites that’re deploying some piece of source code—traceroute, whatever—it starts adding up. So I actually wanted to do a measurement study to see how much the backbone traffic for example is being wasted by duplicate transmissions that could be easily cached, and then in fact how much could you cache and reduce the load.
Malamud: And?
Schwartz: And the result was that something like 45% of all the FTPs that’re are going on are transmitting data multiple times that had already been sent before, and if you had cached it, the overall number is that I could reduce the total backbone traffic, not just FTP but factoring out what proportion of the traffic is FTP, I could reduce that by one fifth, okay. So I could get rid of one fifth of all the backbone traffic just by putting caches around the periphery of the backbone.
Malamud: So our 45-megabit backbone would now be 20% faster, in effect.
Schwartz: 20% less loaded. Making it faster depends on the time the day— I mean it’s…pretty complicated, but yeah.
Malamud: So how do you go about doing that? How would you get everyone to do…legitimate FTP caching? Do you come up with a new FTP RFC? Do you write some source code? How do we get that from research into operation?
Schwartz: Well, the research study was simply a trace-based simulation. So there’s no code that you could actually deploy to solve this problem, it basically just took traces and said if we were to put caches at various places what would it do to load. Right now, another student of mine is working on designing and trying to build a first shot at a generic object caching architecture. So we’re looking at more than just FTP, we’d like to be able to cache any object that’s retrieved. And on one of the ideas that kind of intrigues me at this point is it seems to me it’d be pretty difficult to deploy this into FTP clients or servers because it’s so widely distributed now and owned by essentially every vendor who produces the software modifies it, etc.
Not only that, but I see FTP as a user access method going away. I think it’s just going to be a transport protocol. Nobody’s gonna type “ftp blahblahblah hostname” anymore, instead what they gonna do is open up a nice XMosaic window or whatever—Mosaic happens to be sort of the prettiest application out there right now or user interface to all these things—and retrieve data, some of which is going to come from FTP, some of which is going to come from World Wide Web hypertext information. And what I think would be a nice way doing it is to retrofit those clients, the Mosaic clients, with knowledge of a generic object caching mechanism that’s in the Internet and let them retrieve the information through those caches.
Malamud: As a researcher measuring the Internet, you have to do things like look at duplicate FTP traffic, and in doing so you’re essentially peeking inside files. How do you balance the privacy needs of an operational Internet with the legitimate needs of the scientific community to measure and change the Internet?
Schwartz: That’s a really good question. Privacy is always a problem in any of these sorts of measurement studies, and also in these resource discovery and information systems. There’s lots of ways that privacy comes up, anywhere from looking at log files to you know, what’s sorts of information you can find about people, to peeking inside packets in order to do these measurement studies.
I think essentially you need to…in order to strike that balance, you know, be able to get good information back at the same time in regard to privacy needs, you have to— I think it’s a case-by-case sort of basis. You need to try to establish procedures by which you’re not violating privacy too much. For example, in the FTP study we did a few things to guard privacy. One is we only collected… We looked at only a sampling of the packets in order to determine file identity rather than recording all the file packets that were going by. Second of all, once we actually got the information we needed we essentially did a transformation to generate numbers from the data that we couldn’t map back into IP addresses of any sort and threw away the original data. So we actually have no way of saying who’s FTPing what data. And in fact when we put together the procedures for collecting this data we passed it by a number of different people at the National Science Foundation and our local regional networks at the National Center for Atmospheric Research which is a connection to the NSFNET backbone. We talked to a lotta people about it, and I think a big key regarding privacy issues is basically to get a lot of inputs from people who know about a lot of different aspects of the problem.
Malamud: Should there be some kind of a peer review panel, a formal Internet Society group that looks at your research and decides whether you should be able to do it or not?
Schwartz: Um, well—
Malamud: Is this like experimenting on humans?
Schwartz: [chuckles] Gee. I feel a little bit unqualified to say this because I’m not a legal expert and I’m not all these sorts of other things. I can give you my basic gut-level hunch on it, which is anytime you insert a big formal process, you slow things down a lot and essentially stop some things from happening. At the same time you know, if you don’t have a formal process on things you can get into trouble.
I think the Internet’s gonna to get to the point where it’s going to be an important enough, formal enough piece of infrastructure where some sort of panel like this will come into being. I mean, the Internet Architecture Board already put together a—you know, a recommendation for how you should carry out these measurement studies that talked about some of these sorts of problems. But it’s not nearly so formal as to say that there’s this panel you have to go through and get official approval, etc.
Malamud: This is Internet Talk Radio, flame of the Internet. You’ve been listening to Geek of the Week. You may copy this program to any medium and change the encoding, but may not alter the data or sell the contents. To purchase an audio cassette of this program, send mail to radio@ora.com.
Support for Geek of the Week comes from Sun Microsystems. Sun, The Network is the Computer. Support for Geek of the Week also comes from O’Reilly & Associates, publishers of the Global Network Navigator, your online hypertext magazine. For more information, send mail to info@gnn.com. Network connectivity for the Internet Multicasting Service is provided by MFS DataNet and by UUNET Technologies.
Executive producer for Geek of the Week is Martin Lucas. Production Manager is James Roland. Rick Dunbar and Curtis Generous are the sysadmins. This is Carl Malamud for the Internet Multicasting Service, town crier to the global village.