Carl Malamud: Internet Talk Radio, flame of the Internet.
This is Geek of the Week. We’re talking to Tim Berners-Lee, who’s the originator of the World Wide Web, one of the most exciting resource discovery systems out there. It’s a hypertext-based system, a way of navigating the network. Welcome to Geek of the Week, Tim.
Tim Berners-Lee: Thanks. It’s so great to be a geek.
Malamud: We all wish we could be, right. Maybe you can start by telling us a little bit about the World Wide Web. What is it? What’s it do?
Berners-Lee: Okay. I’ll tell you two things. I’ll tell you what it is at the moment, and what it originally was supposed to be when it started off about two or three years ago.
The World Wide Web initially was designed to be a collaborative system. It was supposed to be a collaborative hypertext system to allow a group of people to work together without having to be in the same office. In fact I started originally getting interesting in hypertext when I arrived at CERN and I found that the place, full of creative people, was quite a mess, quite a web of people, programs, people who had written programs, who used programs, programmers who used programs. All sorts of relationships there. And when I was first there for six months thirteen years ago, I just had six months to find out about all this, all these relationships. So I wrote a little program to do that. And later I realized that if other people could access the same information, this would save me telling everybody about it.
So the idea was that it should be what has now been termed computer-supported cooperative work, CSCW. And that was the original idea. And then when we produced it, the first thing which we produce for general consumption was a browser which would allow you to look at this information, and as a result has become an information system which has a lot of people browsing and very few people disseminating information. So now it is, as you described it, an information system, resource discovery system which allows a lot of people to get the information but only a few people to produce it. So a bit more like a radio station. So, we still plan for it to become a collaborative system.
Malamud: And what does the browser look like? Do you have to be running on a Sun, or does this work on many different platforms?
Berners-Lee: Well originally when we started, the problem was we had a prototype on the NeXTSTEP, which was fun to build and very quick to build. But now, on pretty much any platform—you have it on Sun for example on X you have maybe five browsers now. XMosaic is the most popular one which a lot of people have heard of.
Malamud: That comes from University of Illinois?
Berners-Lee: That’s from NCSA, yeah. The same people that produced NCSA Telnet. They have a very strong team producing not only Mosaic for X, but also they’re coming up with the same thing for PC with Windows, and the same thing for the Mac. Meanwhile on the mac, at CERN there’s a fairly straightforward, simple browser available for the Mac from CERN. And there’s something called Cello which is available from the Legal Information Institute at Cornell, who have produced that for windows.
So, Windows, Mac and X have got graphic user interfaces. The are also a couple of browsers which are very much…much used in fact, although they’re not so exciting. There’s a very plain Line Mode Browser which you can use a from hard-copy teletype which we distribute from CERN. And there is a browser called Lynx which comes from the University of Kansas. Lou Montulli produced that, and that uses a full-screen VT100 emulation. These are in fact pretty useful because we’re interested in getting to everybody. We’re interested in getting to everybody who’s got just a VT100, or whatever terminal they have.
Malamud: So what kind of information do you get out of the Web? What might I be browsing when I’m sitting here in one of these interfaces? You’re a physics laboratory, CERN. Is this a bunch of nuclear physics information, for example?
Berners-Lee: We’re a high-energy physics lab, yeah. So, my salary is paid to make high-energy physics available to people who are working at CERN, or who are collaborating with CERN from a distance. But in fact the system is usable for all kinds of data, and absolutely all kinds of data. One of the things we discovered very early on is that the hypertext model we have, that just by clicking on highlighted words you can jump to something else, allows you in fact to present hypertext view of existing databases, existing information systems. So whereas we started off inventing as a hypertext system, we realized that we could incorporate all sorts of other information.
Malamud: In what sense? When I click on a word I’m actually FTPing a file in? Is that what you mean? Or…
Berners-Lee: Well, first of all there’s a problem when you start a new system, is if you say that everybody, “Hey, why don’t you you put your data in here,” they’ll say, “Well who’s reading it?” And you say well nobody yet because there’s no data in it. And conversely, nobody wants to read it because there’s no data in it, nobody wants to put data in it because there’s nobody reading it.
So we realized we’d have to bootstrap ourselves off the ground. And the way that we did this was to allow the W3 clients as well as talking the W3 protocol to W3 servers, to talk to FTP servers, as you say. They can read news articles using the regular NNTP protocol. They can talk to Gopher servers using the Gopher protocol. And when they do this, the sort of thing that they put on the screen, when you put up a news article, or a newsgroup, is in fact just hypertext. So if you think about it, when you’re using the clients for all these individual protocols, what you’re seeing on the screen is just sensitive things that you can click on; titles of news articles.
So in fact, with hypertext capability we could and interface to all these things, and we could produce just one interface to all these things. And then we could go out and say well, everybody’s using this because they’re using it to read their news, they’re using it to read FTP sites, they’re using it to read Gopher. And so there is already a public out there, an audience out there. So you can put up on the W3 servers. And people now more and more are putting up— We have…I think it went up last week from when I counted, from something like a hundred to a hundred and fifty servers out there with all kinds of Information, all kinds of topics. We have nice multimedia stuff—XMosaic handles embedded graphics very nicely and that’s been used for some beautiful work by for example the people who put the Vatican exhibition of Renaissance culture online. Beautiful pictures of Medieval manuscripts, with a text associated with them, a complete introduction to Roman life in the 13th century. At one end. At the other end there are stellar spectra for astrophysicists. There’s no end to what you can put on there, it’s multimedia.
Malamud: What’s it take to make a W3 server? What’s it take to take an archive of information we want to put online and make it accessible to the W3 world?
Berners-Lee: At the base level, if you have a directory and you have some files in it, you run the W3 server—httpd. “HTTP” is Hypertext Transfer Protocol in the same way that FTP is File Transfer Protocol. And the httpd is just like the ftpd. It’s just a program You run it under the inet demon, for example, and you point it a directory. You say my /pub, which I’m serving at the moment, I would like to be available using HTTP. This gives you the immediate advantage that people can pick things up more quickly. Because HTTP is faster. It’s a stateless protocol which doesn’t involve all the logon that you get with FTP. So people so people can browse through your directories, they can follow a link from somebody else’s directory into your directory quickly.
And what the httpd daemon does in this case is it builds a little hypertext web, on the fly. So that when people look at your directory what they see is a list of your files with, at the top if you have a readme file it’ll be inserted at the top or the bottom depending on how you select the option flags. And they’ll see the title of the directory will be used for the title of the document. And it’ll do its best to make— or just a very straight directory system full of plaintext files, or image, or graphic files.
Malamud: So if one of the files is a subdirectory you click on it and then the next page is a list of the files in that subdirectory.
Berners-Lee: That’s right. In fact what you’ve got is a web— it’s a hypertext web—but in fact it’s been built just out of your directory tree. That is the simplest way to do it. So, using that, you can put up anything which you’ve already got in an FTP directory.
Malamud: So those are automatic links. Can I decide that I want something more sophisticated? Can I look inside of a file and say I want this word, let’s say the word “First Amendment” linked over to Cornell’s version of the law library that has more information on the First Amendment? Can I begin tailoring my system?
Berners-Lee: You bet. [crosstalk] You bet.
Malamud: What’s it take to do that?
Berners-Lee: To do that is fairly straightforward, starting at the base level. You can add pieces very incrementally. So for example, let’s suppose that you do have a large directory space, which is tree-structured, and you’re pointing a W3 server at it, and you haven’t done anything else. Then of course the thing that people read most often is the root-level directory. And if you look at the average FTP server’s root-level directory, it’s pretty dry. So as that’s the introduction page, that’s what people see of your institute, that’s the first thing you replace with a hypertext document. So what you do typically is you pick up the Line Mode Browser, for example, and you use it to read that directory, and you output the result in hypertext markup format. This is a markup. You output it onto a file, and then you play with that file.
Now, the Hypertext Markup Language, HTML is at Internet Draft stage, or it will be at Internet Draft stage very soon. And the documentation of course for all of this is available on the Web. So if I miss things out, just go onto the Web—get XMosaic, go onto the Web, get all the information. What you do typically, though, is you go and find something on the Web which you like, and using XMosaic you can just look at the source. You pull down the file menu, say “look at the source,” and you’ll see what it looks like marked up. And in there you’ll see the little angle brackets and the format for writing out a link to another document.
Malamud: What is that languages? Is that SGML that you’re using or…
Berners-Lee: HTML is the markup language. SGML is a meta-markup language. SGML describes how you define a markup language, HTML is defined using SGML. So for SGML buffs, HTML is an SGML application. It has a DTD. And the DTD is—and the spec, is all available on the Web, of course. So for example, if you are an SGML person, then you can take an HTML file and you can parse it using the public domain SGML parers, SGMLS for example, by prefixing it with the DTD.
Malamud: Does that mean for example when the IEEE has advertised that they’re taking all their documents and they’re marking ’em up in SGML format internally, if they decide they wanna join the Web, is that gonna be a no-brainer for them? They just point to this directory of SGML-marked-up documents and say there they are? Or is it gonna be— IS your HGML [sic] gonna be compatible with the other versions of SGML that publishers are beginning to use?
Berners-Lee: Well, to put it by simply—and any SGML buffs will…flay me for this—SGML basically says that you should put the control information in angle brackets in a concrete syntax, and you should use…you should do things in certain ways. But it doesn’t actually say what your control words are.
Malamud: Mm hm.
Berners-Lee: So, HTML for example has an A tag meeting “this is an anchor which is one end of a link.” And you use the “href” attribute of that tag to say where it’s going. If other people use SGML typically they’ll use completely different tags, completely different element names. So, it won’t be a no-brainer. You won’t just be able to point an HTTP server at the data and have everybody read it.
We’ll talk about format negotiation later. I think. But, basically if it is marked up in SGML or anything which is basically structured, like LaTeX[?], then because you’ve got the information there, it’s that you need a very small [indistinct] to actually convert it into HTML. So typically for example, you’re going to want to preserve the information about headings, and about different paragraph styles. And all that information can be probably very easily converted into HTML.
Malamud: It sounds like HGML has a lot more emphasis on the network, on links and things like that than a typical SGML DDT [sic] would.
Berners-Lee: Well, in fact it only has two elements. It has a link element which is a link from the document to another document, and it has an anchor which defines a part of the document, which is the beginning or endpoint of a link. And those are the only two. Then there are attributes of those which describe where they go to and what relationship is involved, where there is a semantic relationship between the two things which are linked.
Malamud: Let’s think about that link a second. You said that W3 is able to link out to Gopher servers and link out to FTP servers. And you’re able to do all that with one basic “here’s what I’m pointing to” command?
Berners-Lee: Right. And you’re leading to one of the fundamental things which W3 hangs on. In fact W3 is often described as being a system which is based on hypertext, and which is hypertext hypertext hypertext. In fact hypertext is not the most important thing behind W3, it’s not the most important concept.
Perhaps the most important concept is that any object out there on the network should be addressable. This implies that there should be some universal addressing scheme. Now, we called these things initially Universal Document Identifies. And then when we brought it to the IETF, the “universal” became “uniform., the “document” became “resource,” and the “identifier” became, in the case of the actual identifiers we’re using at the moment, “locator.” So we now have URLs. URLs are things which are discussed at IETF and there’s a spec about them. And the URL is basically the address of a network object.
Malamud: Mm hm.
Berners-Lee: The nice thing about a URL is it starts off with a prefix which defines what sort of a URL it is, and we can add later extra prefixes to define all sorts of other URLs. So typical URLs are FTP URLs which contain all the information you need for extracting something by FTP.
Malamud: Which is…“ftp”…the domain name…the pathname…and the filename.
Malamud: That’s it.
Berners-Lee: In fact, rather than separating it with some chat such as, “What you need to do is FTP to…ftp.whatever.edu, and then ‘cd’ to…and then ‘get file dah dah dah’ ” we then use a little bit of punctuation and we say it’s “ftp colon slash slash, domain name, slash pathname.” And similarly for Gopher, it looks very similar. “gopher colon, slash slash domain name, slash selector string.” And for HTTP we have “http colon, slash slash domain name, slash” and then some opaque string which could be anything in fact which the sever understands, as defined by the server.
So, those are three important types of URL. And we can extend that. We have a few other things. We can put pointers to telnet sessions, for example, so that if there is a site which is only accessible through telnet, it’s very nice to be able to include it in the Web so that within a documents you can say, “Hey,” for example, “for more information see our library system.” And you would link the words “our library system” to a telnet session to the library system because all you’ve got from your automated library system is a telnet session. You haven’t yet got a World Wide Web server.
Malamud: So what happens is your user goes to the edge of the Web and then escapes into telnet land and uses whatever syntax they have for that library system, and when they’re done they’re back in the Web.
Berners-Lee: Right. And that’s very suboptimal, obviously, because the user interface changes. Suddenly, when you’re in telnet land and you have to suddenly learn what sort of library system is this, how do I get out, what’s the “quit” command, how do I find my way around.
Malamud: Couldn’t you do that for the user? Why do they have to go out into telnet land? Why doesn’t the Web do that negotiation for you and bring the information back?
Berners-Lee: Why doesn’t the Web bring the information back. Why can’t we make something to automatically run a telnet server. The problem is that—
Malamud: And make it look like hypertext.
Berners-Lee: The answer is for a general interactive session, you can’t. It changes. In principle you could. In principle you can write a script which will attack a human interface as a machine and extract information, but in practice of course this is very very hairy and horrible. It would also have to be done individually for every system.
What is very very much easier is if you actually have to the program there, is if you have the sources of the program, then you can write— Or even if you have the binary of the program—you have some library access program, for example, which has a shell command for accessing data. You can write a W3 server which when it gets a request for “show me about the library” because of your little help file with a few pointers to some things, and then one of those pointers is run as the program and get the list of sections of the library for example out of it. And it reformats that as HTML.
So the thing to do in this case is for the guy who runs the library system to write what is very often just a ten-line Perl script, typically, which runs the library software whenever somebody comes in with a W3 request. Now, what he’s producing a gateway between W3 world and his own database. So he’s adding a whole new world which was previously inaccessible to the Web. And people can be incredibly creative about that. There are some beautiful examples of virtual worlds which have be created from relational databases, from bunches of files…
Malamud: What kind of worlds? Give me an example.
Berners-Lee: Well there are a number of nice ones. For example at CERN, Mike Sendall has put together a database about software technology in general. And you can throw any World Wide Web index— Some documents and are indexed and some aren’t. [indistinct phrase] You can throw it at some words to do a text search. So, in this case you doing something very much like WAIS, you know, WAIS functionality. So when you’re browsing around the Web and you come to Software Technology Interest Group (STING) page, you notice that this index function is enabled—You can throw some words at it. And it will do a search in its glossary, and it will do a search in some news articles, and it’ll do a search in some documents. And it’ll produce you a little executive report saying “Well, we found some information in the glossary. Would you like to see or would you like to see some news articles?” And these things are linked, so you click on “I’ll see the stuff in the glossary.”
And all this is being generated completely by a little program. I’d like to see the things in the glossary, so you click on a link which leads to a virtual document whose name is sting-slash-glossary-slash…whatever it was you asked for. I asked for “objective” and I got back “Objective C” and “Objective Pascal.”
And the glossary of course is hypertext. So it said “objective languages.” Objective C is an object-oriented language, “object-oriented” is a link. You click on it. Now, I may be talking rubbish with the particular links, but you click on “object-oriented,” you get a definition of object-oriented. You’re browsing around the glossary recursively, and sometimes you find links which take you into documents. And the whole thing, all the links have been added automatically by a program.
Malamud: Now, if I’m doing that from the United States and your W3 server is in Geneva, for example, we’re bringing an awful lot of documents back to the US. Each screen this is a document, that document internally has been marked up. What kind of bandwidth do I need to play in this world effectively? What do I need on my end of the network pipe to be able to do W3 work?
Berners-Lee: When it comes to bandwidth, then 14.4 kilobaud is fine. In fact we’ve done a demonstration very nicely at a conference on the end of an ISDN line. And with an ISDN 64-kilobit connection you really didn’t notice too much the delay when picking up hypertext. For hypertext, excluding videos, then you’re not in fact transferring very much data. Not only that but you’re not keeping connections up so the load in general all round is very small. At CERN we happen to have a T1 to the States. So, we’re lucky. But in general in Europe, if you’re fairly close to a major center, you’re not on the end of a piece of wet string, then people are generally amazed by the speed.
What we’re talking about, if you’re looking at an XMosaic document and you click on it, if it’s local it should come back within a few hundred milliseconds. And otherwise within a second or two. Internationally. And we really want to keep these response times down below the second if we can. Because that’s the way people work most efficiently. When they can follow a link, have a look at it, hit the back button because it’s not what they wanted, and go some other way.
Malamud: Is there a provision in the World Wide Web for data replication or data caching so that information doesn’t have to go over long, thin pipes many different times? I’m thinking for example the Internet Talk Radio.
Berners-Lee: I bet you are, Carl. I think this is a very interesting area because it’s not just radio. It’s gonna be all kinds of files. I ha—
Malamud: IETF archives, for example, same thing.
Berners-Lee: Yes. It’s large files and it’s also files which are just in use all over. For example we have a file we call the virtual library. It’s a subject catalog. We keep pointers in it, in a subject tree, a little bit like the Dewey system. We keep the pointers to everything that we found which has a specific subject matter interests, a particular subject like astronomy, astrophysics, high-energy physics, biology.
Now, the top of this tree, obviously, is one single file. We keep a copy at CERN. If everybody in the world wants to read that, and that’s a very good place to start looking for things, then it’d be very nice to have replicated copies, so replication is certainly an issue. We haven’t put anything in place. We’ve looked at a few interesting things. Obviously the two things that we we need to have here is one, we need to have good mirroring software so that we can make sure that updates get propagated rapidly. And when we have mirror sites, then we need to be able to rapidly from a client find them. And we’ve played with the idea of having for example a dummy domain, but .web, so that if you look at “library.web,” this is a virtual machine which has a number of IP addresses which are in fact on different continents. This can be regarded as an abuse of the Domain Name System, or it could be regarded as a neat use of the [crosstalk] Domain Name System.
Malamud: Oh I’m a big fan of abusing the Domain Name System.
Berners-Lee: [indistinct; “So I’ve heard.”?] And that sort of thing combined initially with something very simple such as a ping…if you get back five IP addresses, ping them all and see what you get back, may mean that we’ll have something which is fairly scalable which we can use for the large central—I hate to use the word central—but a large heavily-used collections of data. Though I think it’s got to be very flexible as well, because things becoming heavily-used very suddenly when people find out about them. I don’t know whether you’ve found this, that you have a particular program, you put it out, and then for some reason a pointer to it gets put on some news group everybody’s very excited about it. Suddenly they dive in there.
Malamud: Oh absolutely, and in fact they’ll all dive to the same place, which is why I’m curious about that. Even though we’ll have thirty copies a file around the world, if one newsgroup says go to UUNET and grab the file…
Malamud: Everybody goes to UUNET and grabs that file.
Berners-Lee: Yeah. If they’ve got one pointer they don’t want to go out and use Archie to find out where the nearest copy is, and of course when they’ve used Archive they might not even know which of those copies of the nearest one in network terms. So it would be very nice to have something which is flexible—automatic, maybe a little bit controllable by network managers, and will allow people to optimize it. I’d like to see it be very very automatic, in fact, myself.
Malamud: What other things would you like to see the Web evolve to? Are there other things you’d like to see brought into that system?
Berners-Lee: Well as I said when we first started, at the moment in fact it is practically used for dissemination. And one of the reasons is that the people who have actually put in the development work very often are systems managers. They’re very often people who have got information disseminate, and they’ve realized that if they disseminate they’ll save the phone ringing. This also gives them a high profile.
But, we really want it to become a collaborative system. The XMosaic team have playing with this with a very interesting group annotation server. There’s a trial running at the moment whereby you can set Mosaic to point to a particular server which just stores a list of comments on individual documents. So, when you’re reading a document anywhere in the world, you’re hooked up to a group annotation server. And if you happen to be reading a document somebody else in the same group has commented, on at the bottom appears a little link to his contribution. Now this I think is a stage in making something which blurs all the boundaries between data retrieval, data retrieval with front end update, news, whereby information is spread around in a sort of flooding algorithm. And mail where it’s sent to specific people. We need to use all these different protocols for different situations but they’ve all got to merge together. So that from the user’s point of view when he’s reading a piece of information, whether he’s retrieved it from a W3 server, or he’s reading a news article, or he’s reading something which went to a mailing list, when he hits the respond button, he gets a little window and he can type stuff in, and it’ll be processed accordingly.
Malamud: If we want to find out more about W3, is there an email address people can send to? Is there someplace they should be FTPing into? How do people learn more?
Berners-Lee: Basic way to start—please don’t send email until you’ve done this. The simplest thing to do is you telnet to info.cern.ch And then you get the very very simplest browser. And it’ll give you some information about CERN, and it’ll give you some information about the World Wide Web.
Now this browser is very very crude. Do not be put off by this. To follow hypertext links you type in the number of the link, and number’s there in little square brackets after the terms. But you can in fact access everything, everything which is available through Gopher, and news, and the World Wide Web, from that. And in particular you can find information about the World Wide Web, information about the client, information on the client that you choose, about how to FTP it.
Now, I say use this because you will then get the most recent information. I can tell you some FTP sites—I will tell you some FTP sits—but obviously, if you go use the Web in some way you’ll get the up-to-date information.
Malamud: What’s the main FTP site for getting the programs and the documentation and things like that?
Berners-Lee: Two important things. I guess the bulk of listeners are gonna have workstations where they can run Mosaic. If you can run Mosaic you ought to have XMosaic on your—that is, they call it NCSA’s Mosaic for X on your workstation. You can get it by FTP to ftp.ncsa.uiuc.edu. And—
Malamud: And that’s mirrored all over the world, too. You could go to you UUNET or many other sites [crosstalk] and get the software.
Berners-Lee: I guess— All kinds of places. That’s the central site where you get the most recent version. We get new versions out perhaps every couple of weeks. It went to from beta to 1.0 a few months ago. That is currently the preferred version for X for a lot of people. There’s another very exciting client for X, which is TKWW. It runs using the Tk/Tcl toolkit which you also have to get. The interesting thing about that is it’s a hypertext editor. So you can make your own hypertext files.
If you’re running a NeXT, then there’s a hypertext editor you can pick up from CERN. There’s the Line Mode Browser you can get from CERN. The daemon stuff you get from CERN. So the other important FTP site is info.cern.ch.
Malamud: Okay, great.
Berners-Lee: Same thing, if you telnet to it, you can ftp into it.
Malamud: Okay, thank you very much. This is Carl Malamud. We’ve been talking to Tim Berners-Lee, the originator of the World Wide Web, and this has been Geek of the Week.
This is Internet Talk Radio, flame of the Internet. You’ve been listening to Geek of the Week. You may copy this program to any medium and change the encoding, but may not alter the data or sell the contents. To purchase an audio cassette of this program, send mail to firstname.lastname@example.org.
Support for Geek of the Week comes from Sun Microsystems. Sun, The Network is the Computer. Support for Geek of the Week also comes from O’Reilly & Associates, publishers of the Global Network Navigator, your online hypertext magazine. For more information, send email to email@example.com. Network connectivity for the Internet Multicasting Service is provided by MFS DataNet and by UUNET Technologies.
Executive producer for Geek of the Week is Martin Lucas. Production Manager is James Roland. Rick Dunbar and Curtis Generous are the sysadmins. This is Carl Malamud for the Internet Multicasting Service, town crier to the global village.