Specifying XML Structures Using Schema
OVERVIEW
We've already seen that XML documents can be described using Document Type Definitions, DTDs. DTDs originated with SGML and show those origins all too visibly. XML documents are far more complex and varied than their SGML cousins because XML is used in far more ways than SGML. This creates a problem. While DTDs are perfectly suitable for SGML, where they have been used successfully for many years, they are inappropriate for the newer technology of XML. DTDs cannot be processed by XML-only applications. Developers need to learn two relatively complex languages to use DTDs and they cannot be validated using XML validators. XML has more data types than can be expressed in DTDs, and is generally far richer. Basically DTDs cannot be used to express XML documents.
To remedy this situation, W3C has created a language called XML Schema which can be used to define XML structures. A number of different schema languages exist. In this chapter I will be writing specifically about XML Schema because it is a Recommendation of W3C. I'll be using the terms XML Schema and schema interchangeably – my choice being based purely upon which reads better in a given context. If I wanted to be precise all of the time I would use XML Schema when referring to the language and Recommendation, and schema when referring to a particular document that uses the language.
As I write this, far more tools exist to handle DTDs than XML Schema. This situation is changing rapidly since everyone sees the advantages of using schemas. DTDs are really a technical dead-end, although understanding them will remain important since so many exist. It's likely that when you are using older documents, they'll continue to be described using DTDs. New documents should always be described using XML Schema.[1]
The most important omission in the DTD is the idea of a data type. SGML documents tend to contain mostly plain text. Almost all data in an SGML application can be treated as strings of characters in definitions and applications. XML documents require a far richer set of data types, including strings of characters, numbers, both whole and decimal, and complex types such as dates and times. XML Schema introduces data types which, in turn, leads to more tightly defined XML structures which can be used with current database technologies or in conventional applications written in general-purpose programming languages. Other new, and useful, features in the XML Schema Recommendation include:
a simple pattern matching grammar which might be used, for example, to define the structure of an order code,
defined ordering of subelements so that document structure can be tightly controlled,
selection between different elements so that documents can share a schema without having identical structure.
DTDs are described using their own, unique, syntax. Using them means having to learn, and apply, two sets of syntactic rules in one application. While DTDs are not the most complex documents imaginable, it is vital that developers define them correctly. Equally as important, parsing and manipulating DTDs within applications requires special libraries. XML Schema documents can be handled much more easily because they are fully compliant XML documents in their own right. What does this mean in practice? The tools that you use to develop, parse and manipulate your XML can also be used for your schemas. Developers need learn only one set of rules for schema and document, and both could be created using the same pieces of editing software.
Using XML Schema requires an understanding of namespaces. Schema definitions always use namespaces, so much so that namespaces are one of the cornerstones of schema technology. I've mentioned namespaces before; now is the time to examine them in detail and learn how to use them.
[1]Although pragmatic realities such as organizational politics, historical preferences or the tools you have available may force you to use DTDs.
hot info
STRUCTURE
The DTD is a series of declarations. Each declaration takes the form:
''
and contains one of four keywords. These are:
ELEMENT which defines a tag,
ATTRIBUTE which defines an attribute of an ELEMENT,
ENTITY which is used to define an ENTITY,
NOTATION which defines a data type.
The easiest way to understand the structure of a DTD is to look at a simplified one. Rather than create a novel structure, I'm going to use part of the DTD for the Business Letter. This is shown in Listing 3.1.
Listing 3.1: Partial DTD for the Business Letter
' ''
' ' ?, country?, code?)>'
''
''
''
''
''
''
' '
''
]>'
Label: STRUCTURE XML
Applications that manipulate XML need to be able to move through the data structure, finding elements, tags and content. Processing data to extract meaning from it is called parsing in computing. The same term is used to describe the processing of sentences in human languages to extract their meaning. The idea, in both cases, is the same. Few developers choose to write their own XML parsers. Although the rules of the grammar are relatively simple, writing fast and accurate parsers is a difficult task. Most people use a parser written by someone else. Many XML parsers are freely available; the choice of which you use tends to depend upon your system and the language that you are developing in. Two popular choices are MSXML from Microsoft, which can be programmed using C++ or Visual Basic, and Xerces from the Apache Foundation. Xerces comes in Java, C++ and Perl versions and can be used on many different operating systems. Both these parsers can be used directly from the command line or called from within applications.
Once you have installed MSXML on your system, it is automatically available within Internet Explorer. This means that you can, for instance, open XML files in Explorer and view them as tree structures.
As you read through this book, you'll find that XML parsers can do lots of interesting things with your XML. One of the most useful is to check if the XML you have written is correct, and if it adheres to the rules set out in the DTD or schema for that particular document.
2.4.1 Valid or well-formed?
XML documents may be either valid or well-formed. The two terms relate to differing levels of conformance with the XML Recommendation, the DTD, or schema, and the basic structure of the XML. All XML documents must be well-formed. Tags should be paired, elements should be properly nested, the document should have an XML declaration. Entities should be properly formed. Any application which can handle XML will be able to cope with a well-formed document. A valid document takes conformance rather further. To be valid, a DTD or schema should be identified for the XML data. The data must meet the rules set out in that document.
All XML parsers are able to check that a document is well-formed. For some such as MSXML, this is where their capabilities end. Other parsers such as Xerces are able to validate an XML document against a DTD. At the time of writing, Schema support in Xerces is in the alpha stage of development. That means it's far from ready for the big time – but it is being implemented. XML is a new technology, it's evolving rapidly and tool support does tend to lag slightly behind. In the near future, though, the tools will be available to use XML Schema as well as DTDs. It's at that stage that we'll start to see DTDs becoming less popular with developers.
Unparsed Character Data
Most of the content in an XML file will be handled by the parser. Generally elements and entities contain text that has some meaning. The content will not include characters such as < which have special meaning to the parser, and when it does contain them, those characters are usually entered as character entities. Sometimes a document will include large numbers of these characters. In such cases using entities may be impractical. The XML standard allows for this. Your document can include sections of CDATA, unparsed character data. All characters inside a CDATA section are assumed to be content, rather than markup. A section of CDATA is started with the string as shown in Listing 2.6. You'll meet CDATA again in the discussion of DTDs in Chapter 3.
Listing 2.6: CDATA Sections
characters until the end of the section is reached
]]>
Label: PARSING XML FILES
THE XML RULES
Computer languages need to be formally defined in some way. Developers need to know what facilities are available in a language and that those facilities will work in the same way in all implementations. Languages are usually standardized by an international body such as the International Standards Organization, ISO, or the Institute of Electrical and Electronic Engineers, IEEE. For those languages that have defined standards, all compilers or interpreters must adhere to the standard: if a C++ compiler doesn't work according to the ANSI/ISO C++ standard then it really isn't a C++ compiler. Often these standards are minimum requirements which will be available in all products and on all platforms. Manufacturers of compilers are free to extend the language by adding their own proprietary features, although this does mean that the extended version will no longer be standard. Often large or powerful companies try to force their extensions into the standards. This can be extremely beneficial when it leads to improvements – too often standardized languages are developed by committees and become lowest common denominator languages. New extensions may only be available on one platform. If developers wish to write code on a Linux box but later compile and execute it on an Apple Macintosh, they can only do this if no extensions have been used. Problems like this tend to force people either to adhere rigidly to the standard or to work exclusively for a subset of all available platforms. When developing for heterogeneous systems such as the Web, adherence to the standard is clearly the preferred option.
XML requires a common set of rules. In fact, since any Web technology must work on every platform in a plethora of software applications, standardization is even more important than for programming languages. Perhaps surprisingly, XML, like HTML, isn't actually an international standard. It's a Recommendation of the World Wide Web Consortium (W3C). W3C Recommendations have much of the force of international standards but the process of creating them is far more flexible and far faster than standardization.
The current XML Recommendation is Version 1.0 (second edition). It can be viewed online at http://www.w3.org/TR/2000/REC-xml-20001006 or downloaded in a variety of formats. The second edition makes no major changes to the first edition of the Recommendation but does incorporate all of its errata. Most standards documents are necessarily complex. They don't make for an easy read, and the XML Recommendation is no exception. If you want to know just how much thought went into the design of XML, download a copy of the Recommendation and spend a few minutes leafing through it.
2.3.1 XML Tags
XML documents are composed of elements. An element has three parts: a start tag, an end tag and, usually, some content. Elements are arranged in a hierarchical structure, similar to a tree, which reflects both the logical structure of the data and its storage structure. A tag is a pair of angled brackets, <… >, containing the name of the element, and pairs of attributes and values. An end tag is denoted by a slash, /, placed before the text. Here are some XML elements:
XML elements must obey some simple rules:
An element must have both a start tag and an end tag unless it is an empty element.
Start tags and end tags must form a matched pair.
XML is case-sensitive so that name does not match nAme. You can, though, use both upper and lower-case letters inside your XML markup.
Tag names cannot include whitespace.
Here are those same elements with introduced errors:
2.3.1.1 Nesting Tags
Even very simple documents have some elements nested inside others. In fact, if your document is going to be XML it has to have a root element which contains the rest of the document. Tags must pair up inside XML so that they are closed in the reverse order to that in which they were opened.
The code in the left column of Table 2.2 is not valid XML since the ordering of the start and end tags has become confused. The correct version is shown on the right side of the same table.
Table 2.2: Nesting Elements Incorrect
Correct
Hi, how're ya doin'?
Hi, how're ya doin'?
2.3.1.2 Empty Tags
Sometimes an element that could contain text happens not to. There may be many reasons for this – the attributes of the element may contain all the necessary information, or the element may be required if the document is to be valid. These empty elements can be represented in two ways:
The empty element can be included by placing an end tag immediately after the start tag. More simply, a tag containing the name of the element followed by a slash can be used.
2.3.1.3 Characters in XML
When the XML Recommendation talks about characters, it means characters from the Unicode and ISO 10646 character sets. Until relatively recently most computing applications used a relatively small set of characters, typically the 128 letters of the ASCII character set which could be represented using seven bits. The ASCII character set, defined in ISO/IEC 646, only allowed users to enter those letters typically found in the English language.
In a multilingual world this is clearly an impractical limitation which led to the development of many alternative character sets. Web applications typically use ISO 8859 which uses 8 bits for each character and which defines a number of alphabets. These include the standard Latin alphabet used as default by most Web browsers. Unicode goes further and uses two bytes to represent each character. This means that Unicode includes 65,536 different characters, insufficient for Chinese but suitable for most uses. ISO 20646 extends the Unicode idea by using four bytes for each character, giving approximately 2 billion possible characters. Unicode is implemented as the default encoding in Microsoft Windows and the Java programming language, among others. But it clearly needs extending to access those extra characters, and has been. Version 2.1 of Unicode includes some facilities that give access to the ISO 10646 character set.
Using ISO 10646 to represent ASCII data is highly inefficient – effectively three bytes of memory are wasted. Even though computer memory and storage are extremely cheap today, such inefficiency is expensive if an application is handling gigabytes of data. Therefore applications use encoding schemes to store data more efficiently. Applications that process XML must support two of these: UTF-8 and UTF-16. UTF-8, for instance, uses a single byte for ASCII data and two to six bytes for extended characters.
Note XML applications support extended character sets. These allow up to 2 billion different characters. When you develop using XML you can use any language and character set that you need to in your applications. You are not restricted to the English language or to the set of languages supported on a particular operating system.
It's worth noting that everything in an XML document that is not markup is considered to be character data. Markup[4] consists of:
start tag,
end tag,
empty tag,
entity reference,
character reference,
comments,
delimiters for CDATA sections,
document type declarations,
processing instructions,
XML declarations,
text declarations.
The final, important thing about characters is that some of them have special meaning or cannot be easily represented in your source text using a conventional keyboard. Most of the characters in ISO 10646 clearly fall into this category. Some mechanism is therefore required to permit the full range of characters to be included in documents. This is done through character references. To demonstrate the use of character references, I'll look at those characters that can have special meaning inside markup. Characters such as <, >, ', " are used as part of the markup of the document. If they're encountered by the parser inside an XML file, it assumes that they are control characters which have special meaning to it, and it then acts accordingly. The obvious example of this behavior is found in handling attributes. The following two examples would be illegal in XML:
In each case, the parser will assume that the content of the src attribute starts at the first apostrophe or set of quotation marks, and stops at the second. Attribute content following this point cannot be parsed since it is not valid XML.
What happens when the file should legitimately contain < as part of its character data? The appropriate character reference is entered instead.[5] Table 2.3 shows the references which must be entered in an XML document if you want a particular character. Here's the previous example reworked to be valid XML:
Label: THE XML RULES
The Disadvantages of XML Searching
Despite the optimistic view of many people in the XML community, the XML searching problem is complicated, from both technical and business perspectives. In some situations, XML-based contextual searching can be a major advantage; in others, it can be an unnecessary cost; in yet others, it can make the search engine's results worse. This section introduces some of the problems with the very idea of XML-based searching.
XML searching may be too complex for most users.
Documents on the Web can use deceptive markup to raise their ranking in a search.
XML documents are generally not interoperable in the same search environment, because of all the different, incompatible vocabularies.
6.2.1. Usability
Tim Bray, cofounder of Open Text, which ran an early Web search engine, and coauthor of the original XML specification [XML], wrote the following passage in a Web log (http://www.tbray.org/ongoing/When/200x/2003/06/17/SearchUsers):
Nobody Uses Advanced Search...
Every search engine has an "advanced search" screen, and nobody (quantitatively, less than 0.5% of users) ever goes there. This drove us nuts back at Open Text, because our engine was very structurally savvy and could do compound/boolean queries that look like what today we'd call XPath. But nobody used it.
What most people want is to have a nice simple field into which they will type on average 1.3 words and hit Enter, and have the result come back to them. So anyone who's building search needs to focus almost all their energy on doing an as-good-as-possible job given those 1.3 words and no other inputs.
This observation does not bode well for XML searching. If users are unwilling to use even relatively simple full-text techniques, such as Boolean or proximity searches, how much hope is there that they will be willing to formulate the complex queries that can take advantage of XML markup? Fortunately for the future of XML searching, Bray does go on to qualify that observation:
...Except the People who Do
Of course, the people who do use Advanced Search are your most fanatical users, the professional librarians, spooks, and private investigators. And the ones who will do what it takes to find out everything about research on the rare disease their child just got diagnosed with. These people tend to be loud-mouthed and aggressive and will get in your face if you don't have advanced search or it's not real good.
Presumably, these same kinds of people would be the ones using XML context in their searches. Others of Bray's "fanatical users" might be academics preparing papers, journalists researching news stories, and software agents collecting and amalgamating information for politicians and managers. This last example, in fact, may point to the real potential users of XML searching: not people but software. People other than governors and CEOs need to make decisions in their own lives, from changing jobs to buying new clothes, and software agents that find information for peoplesay, for price comparisoncould benefit greatly from the extra information provided by XML markup, assuming, of course, that vendors were willing to encode their pricing information in a standard format and accept the transparency that comes with that.
And that leads to another usability problem: XML searching requires people or software to know a lot about the structure of the documents they're searching. If all XML documents shared a single, global vocabulary, searching would be relatively straightforward, at least for power users: Every price would appear inside a price element, every bar code would appear inside a upc element, every person's name would appear inside a person element, and so on. This is unlikely ever to happen, for two reasons:
No single, accepted authority could impose a common vocabulary on all users.
XML documents can encode a potentially infinite variety of information, so a common vocabulary would always be incomplete.
Some XML-related specifications are designed to work around these problems, at least partlysee Section 6.3.3 for more informationbut in reality, if a large amount of XML markup did appear on the Web today, generalized XML searching would be almost useless, given the hundreds of incompatible XML-based vocabularies. The best people can hope for is specialized searching inside repositories or across Web collections in which all XML documents share a common type: Conceptually, this is the equivalent of a site search engine rather than a Web search engine, and it falls far short of a revolution in Web searching. Even then, searching will be more complex than the most difficult "advanced search" page currently available on full-text Web search engines. Either users will have to become experts in XML structure, or they will have to limit themselves to a few precooked searches, such as "Search for a person" or "Search for a part number," through Web sites that can construct an XML query for them.
In the end, XML searching may be useful for specific project applications. But usability issues alone make it seem unlikely that XML will ever cause the social revolution in Web searching that some supporters hoped for when the specification first appeared.
The Advantages of XML Searching
XML markup makes searching smarter by adding contextual information and makes it possible to correlate information from more than one document. Any serious Internet user is familiar with searches that do not work: The words are too common or have too many different meanings for any search engine to return useful results, or perhaps the information is spread among several pages. Solving these problems was one of the initial goals of XML's creators.
6.1.1. Context
Much of the time, full-text search engines do a good job, but they sometimes fall flat. Consider, for example, the difference between Bush the U.S. president and bush the shrub, or Washington the U.S. state and Washington the U.S. city. If you were trying to find information on bush pilots flying out of Washington State, you might try the search "bush pilot washington." In late 2003, Google's first ten results were as follows:
A site selling a book about a Canadian bush pilot (no connection to Washington)
Two newspaper stories about the U.S. Navy naming an aircraft carrier after President Bush Sr.
A 2002 USA Today story about a small plane violating airspace near President Bush Jr.
A 2000 Washington Post story about President Bush Jr.'s service in the Texas Air National Guard
A 2001 Pravda story critical of President Bush after the midair collision between a Chinese fighter jet and a U.S. surveillance plane
The Amazon.com page for a biography of an Alaskan bush pilot, published by University of Washington Press
Two pages from FlyRod & Reel magazine: one stating that it has no listings for Washington and another listing angling retailers in the state
A 2003 news story from the Washington Times about a U.S. Navy pilot being held by the Iraqi government
I choose Google for this example precisely because it is a very good full-text search engine: It infers the relevance of information on a Web page not only from the text on the page but also from the other pages that link to it, the pages that link to those pages, and so on. With this difficult example, the first slightly relevant result is the twenty-third, which mentions a bush pilot who did fly once in Washington State; after that, the matches revert mainly to politics.
An experienced search-engine user could work around the problem by adding more words to the query. For example, bush pilots in Washington State have to deal with a lot of mountains, and the query string "bush pilot washington mountains" returns fewer political hits. Even better, a search string that contains specific aircraft types used in bush plane flying, such as "bush pilot washington cessna 180" returns almost all relevant matches. Most of the population is not that adept with search-engine query strings, however, and would likely give up on the whole thing; furthermore, these more specific queries would miss pages that do not happen to mention mountains or Cessna 180 airplanes.
Although homographs, such as Bush and bush can make full-text searches difficult, an even trickier problem comes when the search results depend more on context than on the individual words. For example, consider trying to find Web pages that discuss the history of the word sex, without hitting thousands of pornography sites. Unless the decades-old dream of full machine artificial intelligence finally shakes off its dust and comes true, these are searches that will continue to flummox traditional full-text search engines.
As long as artificial intelligence remains a distant dream, we need to concentrate on getting plain old human intelligence into our XML documents, and that is precisely what XML-based markup languages do. The following News Industry Text Format [NITF] fragment shows how news providers can tag articles to avoid confusing search engines.
Today in
Bush
The markup represents added human intelligence from the news reporter or editor: Washington represents a state, not a city, and Bush represents a person, not a shrub. The XML document contains not only the news story itself but also what the author knew about the news story. As this information survives all the way to the final document, search engines require no special artificial intelligence to use it.
Similarly, several markup languages, including DocBook [DOCBOOK] and the Text Encoding Initiative [TEI] define markup for talking about the history of words. Following is a DocBook example:
to mean not only gender, but the physical act of procreation, and,
eventually, all physically-intimate acts.
The wordasword element makes it clear that this paragraph discusses the word sex rather than the act: A search engine could easily pick out this contextual information to return exactly the pages the user wanted.
XML markup is information that a document's creator knew but could not put in the main text. Because this information is available, search tools could potentially use it to return far more accurate results.
6.1.2. Correlation
In addition to basic context, XML markup also makes it possible to correlate information, matching instances of the same thing described in different ways and converting among different representations. To start with a simple example, consider only a few of the many ways documents might refer to British Prime Minister Tony Blair:
Tony Blair
Prime Minister Blair
the British Prime Minister
the prime minister
the P.M.
Mr. Blair
Blair
Given this variety, searching for information about Prime Minister Blair is difficult, and the work is made even worse by the fact that many of these phrases can apply to other people: for example, in a different context, "the prime minister" could refer to the prime minister of Australia, Canada, India, or many other countries.
How is it possible to define a Web in which people can easily search for information about Prime Minister Blair no matter how he is described? One possibility is always to normalize the name when it appears; unfortunately, if people are forced always to write "British Prime Minister Tony Blair," text will become awkward and unnatural, and even that might not be sufficient if another person named "Tony Blair" became British prime minister in the future.[1]
[1] This risk is not far fetched: Consider the phrase "U.S. President George Bush."
Using XML markup, however, it is a simple matter to attach a unique identifier to every location that mentions Prime Minister Blair. As long as the identifier is well known, search engines can look for it rather than for the text it contains, as shown by the following markup fragments:
If in the future, another person shared the same name and title, that person would have a different identifier, so search engines would not return false hits.
Most things in the worldpeople, concepts, historical periods, and so ondo not yet have standard, universally accepted identifiers, so this is more than a markup problem. However, many identification schemes do exist, such as stock market symbols; publication identifiers, such as ISBNs; social security numbers, phone numbers, postal codes, and country, language, and currency codes. The following example shows the use of a stock ticker symbol for identifying a company:
Today,
announced a new software strategy.
Even limiting searching to current widely accepted identifiers, XML markup can make it significantly easier to correlate information described in various ways.
Now, consider a more difficult problem than simple identification: a search for world government spending programs that cost more than USD 1 billion. This kind of a search is far beyond the capabilities of current full-text search engines, but XML markup can add hints to help future search engines do the work. The following example uses News Industry Text Format [NITF] markup once again:
Today, Congress approved an additional
billion
The money element makes it clear that this article is referring to "3 billion" in currency, and the unit attribute indicates that the currency is U.S. dollars, using a code from the ISO 4217 standard for identifying world currencies [ISO-4217]. A search engine would still use full-text searching algorithms to determine that the article dealt with government spending, but then the tagging would help it determine the amount. Even more interestingly, the money element adds enough intelligence that the search engine could return correct results for pages using entirely different currencies; at the time of writing, the money in the next example is less than USD 1 billion, so it should not return a hit:
The Canadian federal government committed an additional
Automatic currency conversion during searchingbased on intelligent tagging like thiscould be especially useful for financial institutions and others mining large international document repositories for information. Many other types of conversion and substitution are also possible with markup, including dates and times, language conversion and recognition of synonyms, subsets, and supersets. (For example, a search for information about New England should return pages that mention Vermont.)
Obviously, a lot of infrastructure is required before search engines can work like this. But it does provide an intriguing view of a future that markup might help to enable, where people and programs can search for and find the precise information they need, relying on intelligence encoded in XML markup.
Network Resources
People have concerns about how XML networking will perform once it is in widespread use, but XML networking brings big performance advantages in one area: Because it can contain arbitrarily complex structure, an XML document can batch up information and reduce the number of network transactions required. For example, a hypothetical accounting server with an XML networking interface might allow a client to request information about multiple accounts with a single XML document sent over HTTP, as in Listing 5-1.
Listing 5-1. XML Batch Request
The server could respond with all the information also in a single XML document, as in Listing 5-2.
Listing 5-2. XML Batch Response
In a non-XML system, the same information could require many request/response exchanges, and the extra latency would create major slowdowns for an application. More advanced distributed-computing protocols have mechanisms for batching information on the flycalled marshalingbut they are complex to implement and have proved less than impressive in the field. Perhaps XML's simple approach will turn out to be more robust and effective.