- Introduction
- Reasons to Learn SGML
- References Online
- Reference Books
- SGML/HTML Tools
Introduction
SGML is a system for defining markup languages. Authors mark up their documents by representing structural, presentational, and semantic information alongside content. HTML is one example of a markup language. On SGML and HTML, W3C
Most people know that HTML is an application of SGML, the Standard Generalized Markup Language. Sadly, too few see the benefit in learning SGML, when they only care about HTML authoring for the World Wide Web. I feel that such an attitude is unfortunate, because it closes authors to a world of options to fully utilize their work.
I wrote the previous paragraph almost two years ago. Time has only vindicated my words! Two new ways of browsing the Web have come along: handheld computers (like the Palm Pilot from 3Com), and Cellular phones. If you use one of these little toys for web browsing, then I think you're nuts. But be that as it may, you will notice that you can't read most web pages with handheld devices! You can only go to certain web sites which are "palm enabled" or "cell-phone enabled".
If only we'd learn the lesson of history! HTML was invented precisely to make the choice of viewer a non-issue. Back then most people used text-only dumb terminals, or even teletypes, to view HTML documents. Other people used graphics-enabled browsers. Either way, it didn't matter: HTML was designed to be completely browser independent--it was equally adaptable to graphical displays, dump-terminals, teletypes, whatever.
If we had stuck to the original standard, web pages today would still be just as useful, whether we used a dumb terminal, Netscape, or Internet Explorer--or a Palm Pilot, cell-phone, or a voice-synthesizer for the blind. All without sacrificing the "bells and whistles" for those who could appreciate them.
So what happened? Lazy authors made smug comments like, "Who uses a dumb terminal anymore? Nobody I give a darn about!" Well, guess what! A cell-phone or Palm Pilot is essentially identical to a dumb terminal. And guess what kind of person browses the web using a cell phone? The kind of person who signs paychecks, that's who! If we'd stuck to the standards, then our paycheck-signing, cell-phone-browsing friend could have viewed almost every document on the entire World Wide Web, without anyone making any special effort. Instead he's stuck with the pitiful few pages which are "cellphone enabled".
What is SGML?
SGML was invented to provide a method of marking up documents to be shared by humans and computers. SGML markup is strange to humans, but can be read easily enough. It can also be read easily by computers! So SGML applications can do amazing things like "Find all documents with markup language in the title," or, "Jump to the section about penguins." SGML makes document structure transparent to the computer, so that it can use documents as "intelligently" as a human.
HTML, as an application of SGML, was picked for the World Wide Web exactly for these features. The original problem was to provide information, regardless of the access method. It was a tremendous success. Today, many authors deride this ideal--and publish web pages with prominent disclaimers like, "If you aren't using Internet Exploder 9.4, go away! Come back with a real browser!" Once I was browsing the Internet from a public library, using Netscape 3.0. When I went to Netscape's "What's Cool" page, I found a message telling me to get a real browser. Hrmph.
This essay is about the payoff for the author from learning and using SGML. As I mentioned above, the first big payoff is that properly authored documents will automatically be readable by new browsers, of whatever type they may be. Think! Vast, untapped audiences using who-cares-what fantastic new technology, automatically being able to use your site fully!
On top of that there are two major benefits in using SGML. First, since SGML focuses on correct document structure, SGML authoring tools are the best. Second, SGML lives for document interchange, where HTML worries about document viewing. As a result, HTML browsers may be better, but SGML converters and translators are simply amazing.
Better Authoring Tools
Unfortunately, neither SGML nor HTML have good WYSIWYG editors. Add-ons for Word, and the editing function in Netscape, do a very poor job. They don't really help the author produce good documents. Although the situation is changing rapidly, for now we will not consider WYSIWYG editors.
It's a crying shame that so many Web authors still work in notepad. In notepad every tag, every character must be typed in, one by one. Needless to say, notepad provides no assistance in producing correct HTML. Most HTML editors are no better--even ones which automatically insert tags for the author. So-called "HTML editors" often insert tags using hot keys or a menu, but they will gladly insert tags in places where they don't belong.
SGML editors, on the other hand, help and support your authoring. An SGML-aware editor will only let you insert tags where they are permitted. It will tell you, in any context, what tags you are allowed to use. When it inserts tags for you, it will automatically prompt you for required attributes. It will let you edit attributes using a form-like interface, which shows you all the attributes you are allowed to use. These features combine to free you from worry about HTML syntax, and allows you to focus on your message. Notepad makes it hard for you to write correct HTML. An SGML editor makes your work easy. In fact, it makes it hard for you to break the rules.
Of today's choices, I recommend Emacs as the best SGML editor, using the optional PSGML package. Emacs is extremely powerful and flexible; it provides extras like spell-checking and transparent validation of your HTML syntax. Both Emacs and PSGML are available for Windows as well as UNIX. The Online Resources section of this document provides pointers to these tools.
Syntax Checking and Validation
This is an arena where HTML and SGML are nearly tied. SGML offers much better tools for checking syntax, but HTML offers better tools for checking style. Each type of tool finds problems the other is likely to miss. A thorough author should probably use at least one of each.
I recommend James Clark's nsgmls parser for checking your syntax, and Neil Bowers's weblint for checking your style. You should also check out the many online validators which operate free of charge.
Information Reuse
My job includes web authoring for coworkers as well as customers. For coworkers, I write things like "How To" documents, project plans, and policy documents. Many of these documents have rich applications other than as web pages. Supervisors ask me to email them documents, expecting to read them with Microsoft Word. Outside people need hard copies of documents. My documents have even ended up in employee handbooks.
How do I deal with these demands? By remembering that HTML is good for browsing, but SGML is good for publishing. When I need a quick hardcopy to email somewhere, I can type lynx -dump http://intranet/doc.html and be done. When my boss first demanded "Word documents" from me, I spent a couple of days (some years ago, now) writing a DSSSL style sheet with a pleasing layout. Now I author documents in HTML, and quickly create a copy in Rich Text Format, suitable for loading into Word, for my boss. Plus PostScript for printing. Plus plain text for "normal" emails. Everybody is happy.
When I want hardcopy of a document, Netscape's "print" button just won't cut it. A Netscape printout has no page numbers; it has atrocious page breaks; and worst of all, it has no URLs! Most hyperlinks appear as the words "click here," underlined and bold. Where is the URL? It's gone. The solution is simple: another DSSSL style sheet, which formats HTML more suitably for hard copy. Of course, the style sheet won't work on invalid HTML...which is where SGML validators come in!
What about you? Do you have a big, complicated site that changes often? Do you offer a site map to assist your visitors? Are you tired of maintaining it? Well, have no fear! In a couple of days, a trusty Perl programmer can whip up a program which examines your site and produces a site map, just like the one you are sick of maintaining. Run it once or twice a week, and your worries are over. It will only work, of course, if your HTML is of high quality and well organized...which is where SGML validators come in!
In all of these ways, and more, you can use and reuse the information in your HTML documents. To do so, you will need to learn about SGML, because SGML tools make all this possible.
Better Web Applications
There are dozens of packages out there which promise to help you build web applications. They connect a set of web pages to a database, using HTML forms and some bits of Java and Javascript code. Every package I have seen works in one of two ways. Some imbed queries and other code in HTML comments, and process them while they serve up the page. Others implement some proprietary macro language, and authors stick macros in their HTML documents. As the pages are served up, the macros are expanded dynamically.
Why do people keep writing such ad hoc tools, which usually don't even work? SGML already lets you do all of these things, and more! Simply modify the Document Type Definition (DTD) for HTML, and add an element structure to support your desired functions. Then build a simple SGML translator (using any of a number of free packages) which responds to extended markup by performing database queries, or whatever. That's it! The result is usually easier to use and maintain than any monolithic IDE.
Why don't IDE builders know all this? Beats me. Not only are they unaware of the power of SGML, but typically they don't even know how to write correct HTML. I've seen packages which produce web pages containing empty tables, empty paragraphs, long strings of <BR> tags, and many other abominations. The result is unreadable, except in one or two lucky browsers. It's also unreadable to computer programs--like search engines, spiders, translators, indexing tools...Incompetent programming like this obstructs communication, as surely as if they encrypted their pages to lock you out.
An Example
SGML can help you even without fancy tricks like extended markup. I once built an application for collecting survey data. Users would fill out, review, and edit a large and changing collection of surveys. My application let you fill out a survey, edit your answers, view your answers, and print them out. Best of all, I could create dozens of new surveys in a single day, and deploy them in seconds!
What amazing tool was I using? Nothing but ordinary SGML utilities--just good twenty-year-old technology. To create a new survey, I authored an HTML form using Emacs and PSGML. Then I used nsgmls, to make sure that the syntax of my form was completely valid. Finally, I dropped it into the "surveys" folder. That's it!
The application took the HTML form and served it to users for them to fill out. If a user wanted to edit his filled-out survey, a simple SGML translator took the HTML form and the user's answers, and merged them to create a pre-filled-out HTML form for editing. If the user wanted to view his survey for printing, another simple translator took the form and answers, and converted them into a pleasing HTML page. The result was an infinitely extensible application, which could handle changing customer demands, with almost no maintenance, for years.
Summary
Using SGML an IDE, a smart editor, notepad, databases, rendering engines and typesetting engines can all work together from a single document source. That's what SGML was designed for: a valid document is a valid document is a valid document, and anyone (computer programs included) can read valid SGML with ease. Most integrated development environments are not like that; they require specialized input, and produce specialized output, which other tools do not understand. With non-SGML tools, the incompatibility is built in.
SGML applications can be made robust, flexible and featureful, quickly and at low cost. The reason is simple: SGML was invented to be understood easily by computers and humans alike. A simple toolkit can do magic with any conceivable SGML document, both now and for all time to come.
References Online
- James Clark's DSSSL Page contains many useful pointers for writing DSSSL style sheets.
- On SGML and HTML is a reasonable, technically-oriented discussion of HTML as an SGML application.
- The SGML/XML Web Page at the Summer Institute of Linguistics provides an encyclopedic collection of SGML pointers.
- The Web Accessibility Initiative at W3C
- What You See Is Not What Others Get on the Web, by Stephen Traub, discusses browser-independent design.
Reference Books
- SGML CD by Bob Ducharme and Mark Taub provides a complete toolkit for starting with SGML. It is written for Windows NT users, but every tool discussed also has a UNIX version.
- SGML: The Billion Dollar Secret by Chet Ensign. Show this one to your pointy-haired boss. It answers the question, "What do Grolier encyclopedias and Sikorsky helicopters have in common?"
SGML/HTML Tools
- The NT Emacs FAQ includes pointers to a free Windows NT port of the Emacs editor.
- Jade, James Clark's implementation of the DSSSL style sheet language
- Earl Hood's perlSGML package provides facilities for handling DTDs as well as documents, without nsgmls.
- The PSGML emacs editing mode for SGML by Lennart Staflin. This is now included in most Emacs distributions by default.
- David Megginson's SGMLSpm package is an excellent Perl library for writing translators using nsgmls. It has been largely supplanted by XML tools since then, but is still excellent.
- James Clark's SP system, including the parser nsgmls
- Weblint, an HTML verifier which also checks some stylistic matters.
