hot info

Components of an XML

on Senin, 15 Juni 2009

Components of an XML Project
XML is nothing more than a way of adding structure to information, so you can use XML for almost any purpose; in that sense, there is no such thing as a typical XML project. XML can show up in technical publishing, networked games, spreadsheets, air traffic control, news publishing, blogging, or just about anything you can imagine that involves passing information from one system to another.

Still, many XML projects involve performing similar operations on XML information, even if the final result is different. The operations described in this section and illustrated in Figure 2.1 are not low-level libraries and tools, such as parsers, as important as those are, but high-level stages in the life cycle of an XML document:

Creation

Storage

Search

Archiving

Transformation

Rendering

Transport


Figure 2.1. Components of an XML project





To illustrate these stages, this section describes a hypothetical production system for a retail catalog. The designers chose XML because the company needs to publish the catalog in print for mailing to telephone customers and online for use by Web customers. (For more information on single-source publishing, see Section 3.1.1.)

For creation, the developers build a Web application with forms that authors can use to enter information directly into a database. In this system, the authors deal with only a tiny amount of XML. Photographers upload digital photographs of products, filling in metadata fields with basic information about each picture: date, product number, and so on. Writers write product information in various fields of a Web form, including product number, colors, and styles, and a short description, which allows a few simple types of in-line markup. When it is time to generate the complete XML master catalog, a script issues SQL database queries to collect information and then assembles it into an XML document, matching photos with descriptions and extracting current pricing and shipping information from other data tables.

The storage is the relational database, which holds the product photographs as binary large objects (BLOBs), and puts textual information directly into relational tables. The database also contains other product information, such as price, size, and weight, all keyed on the product bar code.

Search takes place through the standard database query language [SQL]. For information already in database tables, a specialized XML search engine is not needed. A separate Web application translates user search criteria into SQL database queries, runs them against the database, and then formats the results as HTML pages with links into the catalog.

Because the catalog is updated and published relatively infrequently, XSLT is adequate for rendering, despite its performance problems in high-speed environments (see Section 8.3 for more information). A series of XSLT templates generates HTML and prints renditions from the XML master file exported from the database during the creating stage.

Transport takes place in various ways, depending on the catalog media. For the printed version, the transport is nothing more than regular mail; for the Web version, the transport is a Web server using HTTP. The catalog company also sends the raw XML version of the catalog to sales partners through a secure FTP server so that they can customize it and then generate their own formatted output, using their own XML systems.

As XML systems go, this one is fairly straightforward. Authors can use simple forms-based interfaces rather than unfamiliar XML editing tools, and searching and storage use standard relational database facilities. This kind of approach does not always work, however, particularly for less structured information, such as reports or news stories. The following subsections discuss the various approaches people can take for each stage.

2.1.1. Creation
Normally, an XML system starts with an XML document, which has to come from somewhere. Two common ways of creating the starting XML document are to

Have authors create it directly, using a text editor or a custom XML authoring tool

Have software assemble it automatically from other sources, such as database tables, non-XML data files, or even other XML documents (see Section 2.1.4).

The second approach will not always work, but where it does, as in the retail catalog example earlier, it will be significantly cheaper and easier than using XML authoring tools. Automatic software assembly generally works for data-oriented XML (see Chapter 4) but not for document-oriented XML (see Chapter 3). Direct XML authoring allows for richer information and works well with document-oriented XML, but it comes with higher ongoing costs for training, technical support, and staff time, as well as a higher probability of resistance from users.

Larger XML projects sometimes combine the two approaches. Authors write basic in-line content and possibly skeleton structures in XML; then automated processes flesh out the document with automatically generated boilerplate text, tables, figures, and other data. A project producing maintenance manuals for large machinery might follow this approach, using the database to hold part numbers, standard procedures, warnings and cautions, diagrams, and other reusable information. Changes to the database will automatically appear in the XML document without requiring human editing.

2.1.2. Storage and Archiving
Now that you have created an XML document, either by hand or through scripts, you might need somewhere to keep it. That is not always the case, though; in XML networking (see Chapter 5), your system might simply generate the XML, blast it out over the network, and then forget about it. Even if you need to keep the XML around, simply saving it to the hard drive or LAN, the same way you would with a spreadsheet or word-processing file, might be sufficient. You can get a little fancier by keeping the XML in a revision-control system, such as Concurrent Version System (CVS) or Microsoft's Visual SourceSafe, without having to buy or build any specialized XML software.

You cannot always get away with the easy solutions, however. You might need to allow several authors to work on different parts of the same XML document simultaneously, be able to maintain snapshots of hundreds of documents in a consistent state, or automate workflow through the authoring and editorial processes. For the first requirement, vendors sell custom XML databases that can manage each element in a document as if it were a separate file, but these databases have not had good results in the field. More typically, people will store XML documents in relational databases, either by decomposing them into data tables or by storing them as BLOBs or character large objects CLOBs.

The major database vendors, such as Oracle and IBM, provide special support for working with XML in their products. Normally, even large projects can avoid the need for simultaneous authoring by dividing documents into small files. For example, a system could store an XML manual as 500 separate files, one for each task, rather than as a single, large filethat way, it is easy for different authors to work on different tasks without conflict. Larger repositories will almost always require some search ability: see Section 2.1.3 and Chapter 6 for more information.

Archiving is a special case of storage. One of the major selling points of XML is future proofing: In 50 or 100 years, it may be difficult to read proprietary binary formats, but XML is designed to be easily accessible. Archiving may have special requirements, such as optical rather than magnetic media, and may also impose additional requirements on XML information, such as encryption, digital signatures, and metadata about when, how, why, and by whom each document was created. Archives typically also require an ability to search.

2.1.3. Search and Retrieval
Chapter 6 deals with the complex topic of XML searching. When an XML project contains dozens, hundreds, or even thousands of individual XML documents, authors and others working on the project will require some form of search and retrieval to find information. Following are several common approaches, from least to most complex, for searching XML documents:

Batch searching

Full-text indexing

Database metadata

Structural indexing

With batch searching, a program reads all the documents for every search, similar to the way the Unix grep command searches plaintext files. Batch searches can be relatively slow, taking anywhere from a few seconds to a few hours or more; however, because there is no preindexing, there are no built-in limitations about the kinds of searches people can make. Batch searching is most appropriate when searches are rare but possibly complex and delays are not a problem.

Full-text searching uses pregenerated indexes to speed up searching but simply treats XML documents like any other text documents, filtering out the markup and indexing the content and, possibly, attributing values. Although full-text searching is a blunt tool, it can be surprisingly effective, and many well-tested free and commercial indexing and retrieval tools are available off the shelf. Some full-text search engines allow labeled fields, so it is possible to add the name of the element containing text to the index, providing some simple structural search ability. Full-text indexing is most appropriate when content consists mainly of prose, such as novels, Web logs, or newspaper stories.

Database metadata is a useful approach for finding XML documents based on preselected criteria. When a user checks an XML document into the system, the system scans it once, extracting predetermined information, such as names, organizations, country codes, dates, headlines, and so on, and stores that information in regular relational database tables. The system is then able to find XML documents using normal SQL database queries. This approach is most appropriate for documents that consist mainly of highly structured information, such as lists, tables, or fields, or for documents that include explicit metadata, such as news stories.

Like full-text searching, structural searching uses indexes to speed up operations. However, instead of indexing only the text, the software also indexes the XML structure that goes with it. As a result, it is possible to formulate complex queries combining XML structure with text content. Both the XML Path Language [XPath] and the forthcoming XML Query Language [XQuery] can take advantage of structural search engines when they are available.

2.1.4. Transformation
Many XML systems include a transformation pipeline. A preliminary, raw XML document starts out at one end of the system and moves down the pipeline like a virtual assembly line, going through various stages of transformation until a finished XML document emerges from the other end. Transformations may involve rearranging or removing information that is already in the XML document, adding information from external sources, merging several smaller XML documents into one larger onesuch as assembling chapters into a bookor splitting a large XML document into several smaller ones, such as breaking a book up into smaller Web pages.

Transformation components typically go through at least two iterations. First, developers prototype the transformations by using simple, template-based tools, such as XSLT processors; then, to improve efficiency and reduce memory requirements, developers rewrite the transformations in custom source code. In some cases, if speed is not essential and memory restrictions are not a problem, a system will continue to use XSLT right into production. One advantage of custom coding, however, is that it is easier to include information from non-XML sources, such as relational databases.

Typically, transformation tools require more custom coding than storage or searching, but they are not overly complex or expensive. See Section 8.3.4 for more information.

2.1.5. Rendering
Rendering is a specialized form of transformation (Section 2.1.4) intended for human consumption rather than machine use. Rendering is also a complex topic, and Chapter 3 examines it in more detail.

In practice, rendering components nearly always convert XML documents to HTML for online display and PDF or PostScript for printing. Normally, it is necessary to write separate code for rendering print and HTML, as the primitives are entirely different: HTML documents have tables, paragraphs, and links, whereas printed documents are usually formatted as a series of nested boxes on the page.

Online rendering has some special possibilities. The simplest, most portable, approach is to convert the XML to HTML in advance, but some Web sites store only the XML and generate HTML on the fly when requested; modern browsers can handle XML directly without a conversion step, a just-in-time rendering approach that allows the user to set preferences and customize the appearance or content of the rendered document. Both of the major browsersMozilla and Microsoft's Internet Explorersupport client-side XML rendering using XSLT or CSS, but very few sites take advantage of this capability.

2.1.6. Transport
The final major component is transport: Once the information is ready, it needs to get to the end user. Chapter 5 is devoted to XML networking and deals extensively with transport issues.

Very simple forms of transport include burning information onto a CD-ROM and mailing it, sending it as an e-mail attachment, or making it available through an FTP or HTTP server. More sophisticated projects may require scheduled, guaranteed delivery, publish-subscribe, and other features supported by advanced XML-related networking specifications.

For some projects, transport is the most important part. For example, financial information services, such as Bloomberg and Reuters, make their money from getting information to a customer as quickly as possible, and wire services add extensive metadata to their news stories to help customers process it automatically. The Web log movement is built almost entirely around the ability of RSS to make transport simple. Such specifications as NewsML [NEWSML], RSS [RSS], and Internet Content Exchange [ICE] deal with transport in great detail.

0 komentar:

Posting Komentar