|
January 1999
Bringing Docs to the Web with... XML
Harvey Spencer
So you want to post docs on the Web. What's the best approach? XML is a breakthrough technology that's adding flexiblity, versatility and intelligence across media.
There is a lot of discussion currently about the future of the Internet. Much of that discussion revolves around XML (extensible markup language). XML is a subset of SGML (standardized general markup language), which was a formatting language defined some years ago to promote the portability of printed documents. SGML has been extensively used by the government to publish its manuals etc.
Mark-up languages are used to define layouts, fonts and structures of documents. To understand a simple mark-up language, type the following in Word:
This is a demonstration of a mark-up language using RTF.
Save this as an RTF file. Next, open the file using Notepad. You will see the following:
Default Paragraph Font;}}{\info {\title This is a demonstration of a mark-up language using RTF}{\author your name}{\operator your name}{\creatim\yr1999\ mo1\ dyNN\hrNN\minNN}
What is happening here is that Word is automatically tagging the document with style information. XML performs similarly but uses a textual syntax that is compatible with HTML (hypertext markup language). HTML is a simple formatting syntax that is used throughout the World Wide Web and is embedded in each document. This set of rules, which is decoded by your browser, tells the computer how to format the page and provides linkages to connecting pages or sites.
Using these commands, you can easily display text documents and embed or connect to other objects, such as video or voice. Each tag (or label) is sandwiched between < and > signs as in <title>. The contents of that tag are then sandwiched between a ýstart ofý label, as in <title>, and an end, which contains a / symbol. A ýPý defines a paragraph break. For example: this article published on the Internet might be coded as <head>Bringing Docs to the Web With XML </head></p>By<author>Harvey Spencer</author ></p>).
While HTML embeds its formatting information in the document that is being displayed, XML expands the capability to define hierchical blocks of information. XML also moves that information one level above the document.
The heart of this system is known as the ýelement.ý The document pictured (on the previous page) is a simple newsletter containing a number of textual and graphical elements. XML provides the ability to label the document and tag the individual fields or blocks of fields. These tag elements can contain information such as fonts, colors, location and size of specific fields, blocks of data or paragraphs. They can also carry information regarding when the information was created and by whom.
An overall controlling area known as a DTD (or Document Type Definition) defines the fields and element rules associated with the document and can be internal or external.
In simple terms, the document type is a newsletter, the elements in that newsletter include a headline, various sub-heads, a table of stocks, an industry focus article, an advertisement block and a table of contents.
Each of these elements can have multiple characteristics associated with them. To convert an HTML page to XML, you add an XML declaration usually followed by a DTD. In this case, you would do this by adding a <!Doctype News ýnewsletter.dtdý >statement, which would provide a reference to an external DTD. The newsletter.dtd is a file that contains the overall rules of the newsletter.
Under XML, you will also be able to carry information regarding the format using ýstyle sheets.ý This formatting information is separate from the main XML effort and uses the name XSL (extensible style language).
Anybody who has used Word is familiar with style sheets, which are labeled as .dot files. These files, which provide an overall formatting set of rules, allow the user to tag the paragraphs with a name, which then relates to formatting information such as font type, font size, spacing, indents etc.
Although the data on individual pages can carry its own style information embedded in the document as with HTML today, XSL lets you separate this information. Each paragraph can then be given a style name (e.g. title). Then this label can have an external file associated with it that carries the format information. It's a quick and efficient way to carry soft formatting information. Note, though, that XSL has yet to be finalized by the XML standards committee.
What's the Benefit?
Using XML as a base, great things can be achieved:
Using the content information:
A researcher can create intelligent queries, such as ýPlease search the complete Internet for all statistics or market size information on data entry services and put them into this spread sheet or database.ý Alternatively, you could look for all articles across the Internet authored by a certain person.
Documents can be controlled and routed based on their content, date or source.
-- Document management. Data such as COLD/ERM reports can be coded and manipulated together with other Internet data.
-- Forms Processing. Forms can be understood and users can build mathematical models, validations and/or import capabilities.
-- Data Management. Databases can be structured and queried on the basis of XML tags.
Using the style information:
A document format can be dynamically changed (without changing the core document) to reflect its environment, whether it be
-- Published on screen, on paper or on CD or DVD.
-- Displayed multi-nationally in different countries using different languages.
Of course, all these items will depend on sites coding their pages in a meaningful way. But as with HTML, high-level tools are appearing to assist with this. Microsoft, too, seems committed to providing XML formatting in its future Office releases. Also, IBM has announced that it will have understanding for XML within future releases of its database software.
Current browsers do not support XML, although you can download a Beta of Microsoft's Explorer 5.0 to try it out. (Note though that you have to first upgrade to Explorer 4.01 -- an 11MB file that took me 3 hours, then you have to download again to get the beta of 5.0).
The leaders in XML software tools, such as Aeneid, Interleaf, Hynet Technologies, Enigma, Chrystal Software and Reachcast, are mostly coming from the electronic publishing and engineering world (since many used SGML to encode manuals and other documents).
Because XML is a subset of SGML, this gives these companies an enormous advantage. But look to the document management world to embrace this technology enthusiastically in the next 12 months.U
Harvey Spencer is a document technologies consultant based in East Northport, NY. He can be reached at 516-368-8393 or a harvey@harveyspencer.com
|