This is a W3C Working Draft for review by W3C members and other interested parties. It is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress". A list of current W3C tech reports can be found at: http://www.w3.org/pub/WWW/TR/.
Please direct comments and questions to [email protected], an open discussion forum. Include the keyword "sgml-lex" in the subject.
The Standard Generalized Markup Language (SGML) is a complex system for developing markup languages. It is used to define the Hypertext Markup Language (HTML) used in the World Wide Web, as well as several other hypermedia document representations.
Systems with interactive performance constraints use only the simplest features of SGML. Unfortunately, the specification of those features is subtly mixed into the specification of SGML in all its generality. As a result, a number of ad-hoc SGML lexical analyzers have been developed and deployed on the Internet, and reliability has suffered.
We present a self-contained specification of a lexical analyzer that uses automated parsing techniques to handle SGML document types limited to a tractable set of SGML features. An implementation is available as well.
The hypertext markup language is an SGML format. --Tim Berners-Lee, in "About HTML"
The result of that design decision is something of a collision between the World Wide Web development community and the SGML community -- between the quick-and-dirty software community and the formal ISO standards community. It also creates a collision between the interactive, online hypermedia technology and the bulk, batch print publications technology.
SGML, Standard General Markup Language, is a complex, mature, stable technology. The international standard, ISO 8879:1986[SGML], is nearly ten years old, and GML-based systems pre-date the standard by years. On the other hand, HTML, Hypertext Markup Lanugage, is a relatively simple, new and rapidly evolving technology.
SGML has a number of degrees of freedom which are bound in HTML. SGML is a system for defining markup languages, and HTML is one such language; in standard terminology, HTML is an SGML application.
The degrees of freedom in SGML which the HTML 2.0 specification[HTML2.0] binds can be separated into high-level, document structure considerations on the one hand, and low-level, lexical details on the other. The document structure issues are specific to the domain of application of HTML, and they are evolving rapidly to reflect new features in the web.
The lexical properties of HTML 2.0 are very stable by comparison. HTML documents fit into a category termed basic SGML documents in the SGML standard, with a few exceptions (see below). These properties are independent of the domain of application of HTML. They are shared by a number of contemporary SGML applications, such as TEI[TEI], DocBook[DocBook], HTF[HTF], and IBM-IDDOC[IBM-IDDOC].
The specification of this straightforward category of SGML documents is, unfortunately, subtly mixed into the specification of SGML in all its generality.
An unfortunate result is that a number of lexically incompatible HTML parser implementations have been developed and deployed.[REF! Mosaic 2.4, Cern libwww parser].
The objectives of the document are to:
While this report focuses on the SGML features necessary for HTML 2.0 user agents, it should be applicable to future HTML versions and to extensions of the HTML standard[HTMLDIALECT], as well as other SGML applications used on the internet[SGMLMEDIA]. See the "Future Work" section for discussion.
An SGML document is a sequence of characters organized as one or more entities for storage and transmission, with a logical hierarchy of elements imposed.
The organization of an SGML document into entities is analagous to the organization of a C program into source files[KnR2]. This report does not formally address entity structure. We restrict our discussion to documents consisting of a single entity.
The element hierarchy of an SGML document is actually the last of three parts. The first two are the SGML declaration and the prologue.
The SGML declaration binds certain variables such as the character strings that serve delimiter roles, and the optional features used. The SGML declaration also specifies the document character set -- the set of characters allowed in the document and their corresponding character numbers. For a discussion of the SGML declaration, see [SGMLDECL].
The prologue, or DTD, declares the element types allowed in the document, along with their attributes and content models. The content models express the order and occurence of elements in the hierarchy.
SGML facilitates the development of document types, or specialized markup languages. An SGML application is a set of rules for using one or more document types. Typically, a community such as an industry segment, after identifying a need to interchange data in a rigorous method, develops an SGML application suited to their practices.
The document type definition includes two parts: a formal part, expressed in SGML, called a document type declaration or DTD, and a set of application conventions. An overview of the syntax of a DTD follows. For a more complete discussion, see [SGMLINTRO].
The DTD essentially gives a grammar for the element structure of the specialized markup language: the start symbol is the document element name; the productions are specified in element declarations, and the terminal symbols are start-tags, end-tags, and data characters. For example:
<!doctype Memo [ <!element Memo - - (Salutation, P*, Closing?)> <!element Salutation O O (Date & To & Address?)> <!element (P|Closing|To|Address) - O (#PCDATA)> <!element Date - O EMPTY> <!attlist Date numeric CDATA #REQUIRED ]>
These four element declarations specify that a Memo consists of a Salutation, zero or more P elements, and an optional Closing. The Salutation is a Date, To, and optionally, an Address.
The notation "- -" specifies that both start and end tags are required; "O O" specifies both are optional, and "- O" specifies that the start tag is required, but the end tag is optional. The notation #PCDATA refers to parsed character data -- data characters with auxiliary markup such as comments mixed in. An element declared EMPTY has no content and no end-tag.
The ATTLIST declaration specifies that the Date element has an attribute called numeric. The #REQUIRED notation says that each Date start-tag must specify a value for the Date attribute.
The following is a sample instance of the memo document type:
<!doctype memo system> <Memo> <Date numeric="1994-06-12"> <To>Third Floor <p>Please limit coffee breaks to 10 minutes. <Closing>The Management </Memo>
The following left-derivation shows the nearly self-evident structure of SGML documents when viewed at this level:
Memo -> <Memo>, Salutation, P, Closing, </Memo> Salutation -> Date, To Date -> <Date numeric="1994-06-12"> To -> <To>, "Third Floor" P -> <P>, "Please limit coffee breaks to 10 minutes." Closing -> <Closing>, "The Management"
This lexical analyzer in this report reports events at this level: start-tags, end-tags, and data.
Basic SGML documents are like ordinary text files, but the text is enhanced with certain constructs called markup. The markup constructs add structure to documents.
The lexical analyzer separates the characters of a document into markup and data characters. Markup is separated from data charcters by delimiters. The SGML delimiter recognition rules include a certain amount of context information. For example, the delimiter string "</" is only recognized as markup when it is followed by a letter.
For a formal specification of the language constructs, see the lex specification (which is part of the implementation source distribution[DIST]). The following is an informal overview.
Each SGML document begins with a document type declaration. Comment declarations and marked section delcarations are other types of markup declarations.
The string <! followed by a name begins a markup declaration. The name is followed by parameters and a >. A [ in the parameters opens a declaration subset, which is a construct prohibited by this report.
The string <!-- begins a comment declaration. The -- begins a comment, which continues until the next occurrence of --. A comment declaration can contain zero or more comments. The string <!> is an empty comment declaration.
The string <![ begins a marked section declaration, which is prohibited by this report.
For example:
<!doctype foo> <!DOCTYPE foo SYSTEM> <!doctype bar system "abcdef"> <!doctype BaZ public "-//owner//DTD description//EN"> <!doctype BAZ Public "-//owner//DTD desc//EN" "sysid"> <!> another way to escape < and &: <<!>xxx &<!>abc; <!-- xyz --> <!-- xyz -- --def--> <!---- ---- ----> <!------------> <!doctype foo --my document type-- system "abc">
The following examples contain no markup. They illustrate that "<!" does not always signal markup.
<! doctype> <!,doctype> <!23> <!- xxx -> <!-> <!-!>
The following are errors:
<!doctype xxx,yyy> <!usemap map1> <!-- comment-- xxx> <!-- comment -- -> <!----->
The following are errors, but they are not reported by this lexical analyzer.
<!doctype foo foo foo> <!doctype foo 23 17> <!junk decl>
The following are valid SGML constructs that are prohibited by this report:
<!doctype doc [ <!element doc - - ANY> ]> <![ IGNORE [ lkjsdflkj sdflkj sdflkj ]]> <![ CDATA [ lskdjf lskdjf lksjdf ]]>
Tags are used to delimit elements. Most elements have a start-tag, some content, and end-tag. Empty elements have only a start-tag. For some elements, the start-tag and/or end-tag are optional. Empty elements and optional tags are structural constructs specified in the DTD, not lexical issues.
A start-tag begins with < followed by a name, and ends with >. The name refers to an element declaration in the DTD. An end-tag is similar, but begins with </.
For example:
<x> yyy </X> <abc.DEF > ggg </abc.def > <abc123.-23> <A>abc def <b>xxx</b>def</a> <A>abc def <b>xxxdef</a>
The following examples contain no markup. They illustrate that a the < and </ strings do not always signal markup.
< x > <324 </234> <==> < b> <%%%> <---> <...> <--->
The following examples are errors:
<xyz!> <abc/> </xxx/> <xyz&def> <abc_def>
These last few examples illustrate valid SGML constructs that are prohibited in the languages described by this report:
<> xyz </> <xxx<yyy> </yyy</xxx> <xxx/content/
A name is a name-start characer -- a letter -- followed by any number of name characters -- letters, digits, periods, or hyphens. Entity names are case sensitive, but all other names are not.
Start tags may contain attribute specifications. An attribute specification consists of a name, an "=" and a value specification. The name refers to an item in an ATTLIST declaration.
The value can be a name token or an attribute value literal. A name token is one or more name characters. An attribute value literal is a string delimited by double-quotes (") or a string delimited by single-quotes ('). Interpretation of attribute value literals is covered in the discussion of the lexical analyzer API.
If the ATTLIST declaration specifies an enumerated list of names, and the value specification is one of those names, the attribute name and "=" may be omitted.
For example:
<x attr="val"> <x ATTR ="val" val> <y aTTr1= "val1"> <yy attr1='xyz' attr2="def" attr3='xy"z' attr4="abc'def"> <xx abc='abc"def'> <xx aBC="fred & barney"> <z attr1 = val1 attr2 = 23 attr3 = 'abc'> <xx val1 val2 attr3=.76meters> <a href=foo.html> ..</a> <a href=foo-bar.html>..</a>
The following examples illustrate errors:
<x attr = abc$#@> <y attr1,attr2> <tt =xyz> <z attr += 2> <xx attr=50%> <a href=http://foo/bar/> <a href="http://foo/bar/> ... </a> ... <a href="xyz">...</a> <xx "abc"> <xxx abc=>
Characters in the document character set can be referred to by numeric character references. Entities declared in the DTD can be referred to by entity references.
An entity reference begins with "&" followed by a name, followed by an optional semicolon.
A numeric character reference begins with "&#" followed by a number followed by an optional semicolon. (The string "&#" followed by a name is a construct prohibited by this report.) A number is a sequence of digits.
The following examples illustrate character references and entity references:
& È & ö & È,xxx & &abc() &xy12/..\ To illustrate the X tag, write <X>
These examples contain no markup. They illustrate that "&" does not always signal markup.
a & b, a &# b a &, b &. c a &#-xx &100
These examples are errors:
� .7 -35 x;
The following are valid SGML, but prohibited by this report:
&#SPACE; &#RE;
Processing instructions are a mechanism to capture platform-specific idioms. A processing instruction begins with <? and ends with >.
For example:
<?>
<?style tt = font courier>
<?page break>
<?experiment> ... <?/experiment>
An implementation of this specification is available[DIST], in the form of an ANSI C library. This section documents the API to the library. Note that the library is undergoing testing and revision. The API is expected to change.
The client of the lexical analyzer creates a data structure to hold the state of the lexical analyzer with a call to SGML_newLexer, and uses calls to SGML_lex to scan the data. Constructs are reported to the caller via three callback functions. SGML_lexNormis used to set case folding of names and whitespace normalization, and SGML_lexLine can be used to get the number of lines the lexer has encountered.
The output of the lexical analyzer, for each construct, is an an array of strings, and an array of enumerated types in one-to-one correspondence with the strings.
Data characters are passed to the primary callback function as an array of one single string containing the data characters and SGML_DATA as the type.
Note that the output contains all newlines (record end characters) from the input verbatim. Implementing the rules for ignoring record end characters as per section 7.6.1 of SGML is left to the client.
Start-tags and end-tags are also passed to the primary callback function.
For a start-tag, the first element of the output array is a string of the form <name with SGML_START as the corresponding type. If requested (via SGML_lexNorm), the name is folded to lower-case. The remaining elements of the array give the attributes; see below. For an end tag, the first element of the array is a case-folded string of the form </name with SGML_END as the type.
The output for attributes is included with the tag in which they appear. Attributes are reported as name/value pairs. The attribute name is output as a string of the form name and SGML_ATTRNAME as the type. An ommitted name is reported as NULL.
An attribute value literal is output as a string of the form "xxx" or 'xxx' including the quotes, with SGML_LITERAL as the type . Other attribute values are returned as a string with SGML_NMTOKEN as the type. For example:
<xX val1 val2 aTTr3=".76meters">
is passed as an array of six strings:
[Tag/Data] Start Tag: `<xx' Attr Name: `' Name: `val1' Attr Name: `' Name: `val2' Attr Name: `attr3' Name Token: `.76meters' Tag Close: `>'
Note that attribute value literals are output verbatim. Interpretation is left to the client. Section 7.9.3 of SGML says that an attribute value literal is interpreted as an attribute value by:
A character reference refers to the character in the document character set whose number it specifies. For example, if the document character set is ISO 646 IRV (aka ASCII), then A is another way to write "A".
A numeric character reference is passed to the primary callback as an event whose first token type is SGML_NUMCHARREF and whose string takes the form ϧ. The second token, if present, has type SGML_REFC, and consists of a ; or a newline.
A general entity reference is passed as an event whose first token is of the form &name with SGML_GEREF as its type. The second token, if present, has type SGML_REFC, and consists of a semicolon or a newline.
The reference should be checked against those declared in the DTD by the client.
Other markup is passed to the second callback function.
A comment declaration is reported the string <! with type SGML_MARKUP_DECL, followed by zero or more strings of the form -- comment -- with SGML_COMMENT as the type, followed by > with type MDC.
Other markup declarations are output as a string of the form <!doctype followed by strings of type SGML_NAME, SGML_NUMBER, SGML_LITERAL, and/or SGML_COMMENT, followed by TAGC.
For example:
<!Doctype Foo --my document type-- System "abc">
is reported as
[Aux Markup] Markup Decl: `<!doctype' Name: `foo' Comment: `--my document type--' Name: `system' Literal: `"abc"' Tag Close: `>'
A processing instructions is passed as a string of the form <?pi stuff> with type SGML_PI.
Errors are passed to the third callback function. Two strings and two types are passed. For errors, the first string is a descriptive message, and the type is SGML_ERROR. The second string is the offending data, the the type is SGML_DATA.
Limitations imposed in this report are output similarly, but with type SGML_LIMITATION instead of SGML_ERROR. The lexical analyzer skips to a likely end of the error construct before continuing.
For example:
<tag xxx=yyy ?>xxx <![IGNORE[ a<b>c]]> zzz
causes six callbacks:
[Err/Lim] !!Error!!: `bad character in tag' Data: `?' [Tag/Data] Start Tag: `<tag' Attr Name: `xxx' Name Token: `yyy' Tag Close: `>' [Tag/Data] Data: `xxx ' [Err/Lim] !!Limitation!!: `marked sections not supported' Data: `<![' [Err/Lim] !!Limitation!!: `declaration subset: skipping' Data: `IGNORE[ a<b>c' [Tag/Data] Data: ` zzz'
In section 15.1.1 of the SGML standard, a Basic SGML document is defined as an SGML document that uses the reference concrete syntax and the SHORTTAG and OMITTAG features. A concrete syntax is a binding of the SGML abstract syntax to concrete values. The reference concrete syntax binds the delimiter role stago to the string <, the role of etago to </, and so on. The OMITTAG feature allows documents to omit tags in certain cases that do not introduce ambiguity -- without OMITTAG, every element's start and end tags must occur in the document. The SHORTTAG feature allows for some short-hand syntax in attributes and tags.
Some of these exceptions are likely to be reflected in the ongoing revision of SGML [SGMLREV].
The reference concrete syntax includes certain limitations (capacities and quantities, in the language of the standard). For most purposes, these limitations are unnecessary. We remove them:
We require the SGML declaration to be implicit and the DTD to be included by reference only:
Named Character Reference &#SPACE; &#RS &#RE;
Some constructs save typing, but add no expressive capability to the languages. And while they technically introduce no ambiguity, they reduce the robustness of documents, especially when the language is enhanced to include new elements. The SHORTTAG constructs related to attributes are widely used and implemented, but those related to tags are not.
These are relatively straightforward to support, but they are not widely deployed. While documents that use them are conforming SGML documents, they will not work with the deployed HTML tools. This lexical analyzer signals a limitation when encountering these constructs.
NET tags <name/.../ Unclosed Start Tag <name1<name2> Empty Start Tag <> Empty End Tag </>
In addition, the lexical analyzer assumes no short references are used.
This report presents technology that is usable, but not complete. Work is ongoing in the folloing areas. Contributions are welcome. Send a note to [email protected] with "sgml-lex" in the subject.
Support for marked sections is an integral part of a strategy for interoperability among HTML user agents supporting different HTML dialects[HTMLDIALECT]. It has other valueable applicatoins, and it is a straightforward addition to the lexical analyzer in this report.
Support for character encodings and coded character sets other than ASCII is a requirement for production use. Support for the X Windows compound text encoding (related to ISO-2022) and the UTF-8 or perhaps UCS-2 encoding of Unicode (ISO-10646), with extensibility for other character encodings seems most desirable.
Internal declaration subsets are not expected to become a part of HTML. But the technology in this report is applicable to other SGML applications, and internal declaration subsets are a straightfoward addition to this lexical analyzer. Relavent mechanisms include:
While they may increase the complexity of the lexical analyzer, short references may be necessary to support math markup in HTML. Empty end-tags are not likely to be used in HTML, as they interact badly with conventions for handling undeclared element tags. But in other SGML applications, they are a useful feature.
A formal specification of the lexical analyzer discussed in this report is given in the form of a [flex] input file.
The flex input file is part of the sgml-lex source distribution, which contains an implementation of the API discussed above, and some test materials.
The source distribution is provided under the W3C copyright, which allows unlimited redistribution for any purpose.
MD5 Checksum Filename 21f7b70ec7135531bc84fd4c5e3cdf3d sgml-lex-19960207.tar.gz (pgp sig) 083e21759d223b1005402120cdbf8169 sgml-lex-19960207.zip (pgp sig)
Message-Id: <[email protected]> Date: Mon, 20 Mar 1995 18:21:23 -0500 To: [email protected], [email protected] From: [email protected] (Steven J. DeRose) Subject: SGML Open recommendations on HTML 3