hot info

THE XML RULES

on Sabtu, 20 Juni 2009

THE XML RULES
Computer languages need to be formally defined in some way. Developers need to know what facilities are available in a language and that those facilities will work in the same way in all implementations. Languages are usually standardized by an international body such as the International Standards Organization, ISO, or the Institute of Electrical and Electronic Engineers, IEEE. For those languages that have defined standards, all compilers or interpreters must adhere to the standard: if a C++ compiler doesn't work according to the ANSI/ISO C++ standard then it really isn't a C++ compiler. Often these standards are minimum requirements which will be available in all products and on all platforms. Manufacturers of compilers are free to extend the language by adding their own proprietary features, although this does mean that the extended version will no longer be standard. Often large or powerful companies try to force their extensions into the standards. This can be extremely beneficial when it leads to improvements – too often standardized languages are developed by committees and become lowest common denominator languages. New extensions may only be available on one platform. If developers wish to write code on a Linux box but later compile and execute it on an Apple Macintosh, they can only do this if no extensions have been used. Problems like this tend to force people either to adhere rigidly to the standard or to work exclusively for a subset of all available platforms. When developing for heterogeneous systems such as the Web, adherence to the standard is clearly the preferred option.

XML requires a common set of rules. In fact, since any Web technology must work on every platform in a plethora of software applications, standardization is even more important than for programming languages. Perhaps surprisingly, XML, like HTML, isn't actually an international standard. It's a Recommendation of the World Wide Web Consortium (W3C). W3C Recommendations have much of the force of international standards but the process of creating them is far more flexible and far faster than standardization.

The current XML Recommendation is Version 1.0 (second edition). It can be viewed online at http://www.w3.org/TR/2000/REC-xml-20001006 or downloaded in a variety of formats. The second edition makes no major changes to the first edition of the Recommendation but does incorporate all of its errata. Most standards documents are necessarily complex. They don't make for an easy read, and the XML Recommendation is no exception. If you want to know just how much thought went into the design of XML, download a copy of the Recommendation and spend a few minutes leafing through it.

2.3.1 XML Tags
XML documents are composed of elements. An element has three parts: a start tag, an end tag and, usually, some content. Elements are arranged in a hierarchical structure, similar to a tree, which reflects both the logical structure of the data and its storage structure. A tag is a pair of angled brackets, <… >, containing the name of the element, and pairs of attributes and values. An end tag is denoted by a slash, /, placed before the text. Here are some XML elements:

The Lord Of The Rings
Helm's Deep
Professor J. R. R. Tolkien

XML elements must obey some simple rules:

An element must have both a start tag and an end tag unless it is an empty element.

Start tags and end tags must form a matched pair.

XML is case-sensitive so that name does not match nAme. You can, though, use both upper and lower-case letters inside your XML markup.

Tag names cannot include whitespace.

Here are those same elements with introduced errors:

The Lord Of The Rings
Helm's Deep
Professor J. R. R. Tolkien

2.3.1.1 Nesting Tags
Even very simple documents have some elements nested inside others. In fact, if your document is going to be XML it has to have a root element which contains the rest of the document. Tags must pair up inside XML so that they are closed in the reverse order to that in which they were opened.

The code in the left column of Table 2.2 is not valid XML since the ordering of the start and end tags has become confused. The correct version is shown on the right side of the same table.

Table 2.2: Nesting Elements Incorrect
Correct





Chris Bates
Mr. M. Mouse


Hi, how're ya doin'?







Chris Bates
Mr. M. Mouse

Hi, how're ya doin'?






2.3.1.2 Empty Tags
Sometimes an element that could contain text happens not to. There may be many reasons for this – the attributes of the element may contain all the necessary information, or the element may be required if the document is to be valid. These empty elements can be represented in two ways:

The Lord Of The Rings



The empty element can be included by placing an end tag immediately after the start tag. More simply, a tag containing the name of the element followed by a slash can be used.

2.3.1.3 Characters in XML
When the XML Recommendation talks about characters, it means characters from the Unicode and ISO 10646 character sets. Until relatively recently most computing applications used a relatively small set of characters, typically the 128 letters of the ASCII character set which could be represented using seven bits. The ASCII character set, defined in ISO/IEC 646, only allowed users to enter those letters typically found in the English language.

In a multilingual world this is clearly an impractical limitation which led to the development of many alternative character sets. Web applications typically use ISO 8859 which uses 8 bits for each character and which defines a number of alphabets. These include the standard Latin alphabet used as default by most Web browsers. Unicode goes further and uses two bytes to represent each character. This means that Unicode includes 65,536 different characters, insufficient for Chinese but suitable for most uses. ISO 20646 extends the Unicode idea by using four bytes for each character, giving approximately 2 billion possible characters. Unicode is implemented as the default encoding in Microsoft Windows and the Java programming language, among others. But it clearly needs extending to access those extra characters, and has been. Version 2.1 of Unicode includes some facilities that give access to the ISO 10646 character set.

Using ISO 10646 to represent ASCII data is highly inefficient – effectively three bytes of memory are wasted. Even though computer memory and storage are extremely cheap today, such inefficiency is expensive if an application is handling gigabytes of data. Therefore applications use encoding schemes to store data more efficiently. Applications that process XML must support two of these: UTF-8 and UTF-16. UTF-8, for instance, uses a single byte for ASCII data and two to six bytes for extended characters.

Note XML applications support extended character sets. These allow up to 2 billion different characters. When you develop using XML you can use any language and character set that you need to in your applications. You are not restricted to the English language or to the set of languages supported on a particular operating system.


It's worth noting that everything in an XML document that is not markup is considered to be character data. Markup[4] consists of:

start tag,

end tag,

empty tag,

entity reference,

character reference,

comments,

delimiters for CDATA sections,

document type declarations,

processing instructions,

XML declarations,

text declarations.

The final, important thing about characters is that some of them have special meaning or cannot be easily represented in your source text using a conventional keyboard. Most of the characters in ISO 10646 clearly fall into this category. Some mechanism is therefore required to permit the full range of characters to be included in documents. This is done through character references. To demonstrate the use of character references, I'll look at those characters that can have special meaning inside markup. Characters such as <, >, ', " are used as part of the markup of the document. If they're encountered by the parser inside an XML file, it assumes that they are control characters which have special meaning to it, and it then acts accordingly. The obvious example of this behavior is found in handling attributes. The following two examples would be illegal in XML:




In each case, the parser will assume that the content of the src attribute starts at the first apostrophe or set of quotation marks, and stops at the second. Attribute content following this point cannot be parsed since it is not valid XML.

What happens when the file should legitimately contain < as part of its character data? The appropriate character reference is entered instead.[5] Table 2.3 shows the references which must be entered in an XML document if you want a particular character. Here's the previous example reworked to be valid XML:




0 komentar:

Posting Komentar