learn xml script: Formatting XML

Formatting XML

Diposting oleh saeful uyun on Senin, 15 Juni 2009

Formatting and Production
Once the higher-level issues are resolved and the authoring system is installed, it is time to turn to the nitty-gritty details of formatting. In many cases, there will be no problems at all; XML, together with transformation and formatting software, does a good job of handling the typical, routine tasks of formatting, particularly if the XML master document contains a single text flow continuing over several pages, such as a technical manual.

Unfortunately, things are sometimes not so simple. This section examines some of the physical aspects of printed documentation that can cause problems for XML publishing. Sometimes, these problems will not surface until late in a project, when there is not enough time or money left to fix them properly; learning to anticipate them can make a big difference to an XML publishing project's chance of success.

3.3.1. Change Markup
Technical publications often include various kinds of change information to make it easy for users to find differences between versions, and encoding this kind of information in XML markup probably represents the single biggest difficultly in XML publishing. The final change information in a printed text can take many forms:

Vertical bars in the margin beside changed text

Separate textual descriptions of changes made and the reasons for them

Different font combinations to show text removed and added

Differently formatted section headings to show sections that contain changes

Even finding the differences between two versions of the same XML document in the first place can be a problem, although more open-source and commercial software is becoming available. Some XML differencing algorithms scale badly with large documents, so it is worth load testing your intended differencing software early in any project. Assuming that you do have some mechanismeven manual identification by the authorin place for locating changes in an XML document, this section examines some of the problems with inserting the change markup into XML documents for publication.

3.3.1.1 Markup Issues
Change markup in XML documents causes publishing difficulties on several levels. Most basically, changes do not tend to fit neatly into XML markup trees. Consider the following:

[...] There are 203 authorized service depots in Southeast
Asia.

The authorized service depots all provide [...]

A change in the company's technical-support structure could cause the content to change, as follows:

[...] There are 55 service partners in the Asia-Pacific
region.

The service partners all provide [...]

What kind of markup should a system add to this document to show where the changes are? The change begins in one paragraph and ends in the next one, but XML does not allow an element to start and end inside different parent elements, so it is not possible to tag the entire changed sequence as a normal XML element.

The first option is to put empty tags at the start and end of the changed text:

[...] There are 55 service partners in
the Asia-Pacific region.

The service partners all provide [...]

A variation on the same theme is the use of processing instructions, to avoid contaminating the main element tree:

[...] There are 55 service partners in
the Asia-Pacific region.

The service partners all provide [...]

Unfortunately, XML publishing tools normally apply formatting based on element boundaries, and many of those tools are not capable of recognizing a span from one empty element to another. Custom-written Perl or Python scripts or very clever and complicated XSLT templates can handle this kind of markup in many cases, but developing them will use up a disproportionately large amount of time on any project.

A second option is to split up the change so that it falls into element boundaries:

[...] There are 55 service partners in
the Asia-Pacific region.

The service partners all provide [...]

This approach is much more practical for working with formatting tools, as they can apply normal formatting based on element context, but can cause awkward problems when additional information is attached to the change markup. Consider the following:

[...] There are 55
service partners in the Asia-Pacific region.

The service
partners all provide [...]

If the publishing system is also generating a list of changes or is adding marginal notes or footnotes describing the changes, the change will show up twice. If authors add change markup by hand, splitting a long changesay, over several paragraphs or stepswill be tedious and could lead to errors.

A third option is to use a single change element placed higher up in the document tree:

[...] There are 55 service partners in the Asia-Pacific
region.

The service partners all provide [...]

This approach has the advantage of avoiding duplicate change elements, but it can end up tagging far more text than has changed. An even coarser variation on this approach is to mark changes only on the element level, using attributes:

[...] There are
55 service partners in the Asia-Pacific region.

The service
partners all provide [...]

This approach can be useful for specialized applications, such as legal texts, with individually numbered paragraphs or subparagraphs. In the general case, however, it has the disadvantages of both including too much and duplicating change information.

The last solution is both the most elegant and the most brittle: Track changes outside of the document by using, for example, XPointer expressions to describe the start and end of each change:

Change to new service system

//step[@id="foo"]/p[2]/text()/point()[position()=247]

//step[@id="foo"]/p[3]/text()/point()[position()=20]

Although this approach allows tracking the change precisely, without duplication, it also requires an enormous amount of coordination between the out-of-line index and the authoring system; if they are not kept perfectly synchronized, the whole thing will fall apart.

So far, this section has not mentioned the problem of marking changes in attribute values. Because attribute values cannot contain tags or processing instructions, marking changed attributes is always awkward; therefore, tracking changes externally might be the best option in this case. XML projects sometimes ensure that all information that needs to be marked as change appears within elements.

3.3.1.2 Custom Publishing Issues
Although the tagging issues for change markup can be tricky, the more serious problems come with custom publishing. The change information in the final published document has to represent changes visible to the reader, not necessarily changes visible to the author.

In custom publishing, documents are typically assembled from text objects that have rules governing when they should or should not appear. For example, a warning may apply only to aircraft that use a certain engine or to reactors that use a certain cooling process. If the rule for the text object changes, it may suddenly appear in one customer's document or disappear from another's. Consider this warning:

Using the wrong grade of
lubricant can cause engine failure.

The warning is applicable for serial numbers 00500200 of a product; a custom publishing system will include it in publications for customers owning products with those serial numbers and omit it for all other customers. Now, an author makes a couple of small changes to the warning:

Using the wrong grade of
lubricant can cause valve damage.

The customer with product serial number 0150 should see essentially the same warning, with the phrase engine failure changed to valve damage:

Using the wrong grade of lubricant can cause valve damage.

The customer with product serial number 225 previously did not see the warning at all, so the whole thing requires change markup:

Using the wrong grade of lubricant can cause valve damage.

The customer with product serial number 0075 previously had the warning, but the change in effective serial numbers means that it will no longer appear in that version of the manual, so the change in this case is a deletion:

[Deleted]

Any descriptions of the changes also need to make sense from the reader's perspective. Many projects do not need to report changes with this level of accuracy, but when they do, it can end up being a major project in itself.

3.3.2. Looseleaf Publishing
Another challenging problem for any automated formatting system is page-based updates, otherwise known as looseleaf publishing. Some kinds of technical documents, such as maintenance manuals and regulatory documents, need to be updated frequently. A standard practice in the paper-based publishing world is to distribute the entire document once, in a binder, and then to send new or updated pages at regular intervals, perhaps every month or a few times a year, with instructions on where to add, remove, or replace pages in the current manual. The instructions, called change pages, might look like this:

Remove pages 1-3 to 1-5, 1-7, 1-18, 1-26 to 1-44

Add pages 1-3 to 1-5, 1-7, 1-18, 1-26 to 1-48

To ensure that the publications do not fall out of sync, the publishers will periodically issue a list of effective pages (LEP) showing what pages should be in the binder. Normally, page numbering starts fresh in each section or chapter, so that a page inserted in one part of the publication will not force renumbering of all pages.

The advantage of any automated publishing system is its ability to free authors from worrying about formatting details, such as pagination, but in this case, pagination matters quite a bit. For page-based updates, a publishing system has to be able to manage the following tasks:

Preserve page numbers from the last revision, whenever possible

Preserve page breaks from the last revision, whenever possible

Identify and print changed pages, with instructions for adding, removing, or replacing, as necessary

This process is not easy to automate, as a lot of judgment is involved: How much whitespace should the system allow at the bottom of a page before changing a page break, for example? Another problem is that formatting information, such as page breaks and numbering, has to be preserved somehow and kept in sync with the XML markup. One option is to design a system that will insert the information back into the XML document after each formatting run:

Airspace above FL180 is Class A, and is restricted to
aircraft flying IFR.

Some class E airspace may require a mode C
transponder.

As an alternative, a system could store pagination information externally as a set of pointers into the XML document. In that case, however, the authoring system will need to be able to update the pointers as authors make changes to the XML document, so a fairly elaborate technical infrastructure will be required.

Unfortunately, this problem has no simple, technically elegant solution. It blurs the boundary between content and presentation, a boundary that is usually very important for XML work. The best anyone can do is identify the requirement early and, once again, allow a lot of time and money for meeting it. In time, page-based updates will disappear as publishers distribute more and more information electronically; it is generally easier to redistribute an entire electronic document rather than only changed pages. Also, if sections are small, it is sometimes easier to redistribute an entire changed section rather than individual pages. (XML publishing systems will manage that task much more easily.) Note that documents that use page-based updates generally also require change markup, described in Section 3.3.1.

3.3.3. Multiple Text Flows
A text flow is a single sequence of text meant to be read from start to end. In a simple publication, all the text flows occur in sequence: for example, an introduction, followed by several chapters, followed by several appendixes. Multiple text flows in sequence are not much more difficult to work with than a single text flow.

Some types of publications, however, take advantage of both dimensions of the page to present text flows in parallel. One obvious example is a newspaper: Several stories can appear together on the same page, and some stories can continue on other pages, wherever space is available around paid advertisements.

Automating the layout of a newspaper from an XML master document would be a difficult task. Fitting stories together on a newspaper page is a bit like a jigsaw puzzle, except that it involves answering hundreds of subjective questions as well: Editors have to decide what stories are important, and marketable, enough to appear on the front page and, in a broadsheet, above the fold. A certain amount of variety in the story selection is needed: Unless something important had occurred, a newspaper editor would not want too many stories about the same person or event to appear together at the front, even if they would otherwise be the highest-priority stories. During an election, the newspaper editor may want to be careful not to appear to be biased by giving one candidate a disproportionately large amount of front-page coverage. (Or, on the other hand, the editor may indeed want to favor one candidate that way.)

Off the front page, the paper is, of course, divided into sections, and related stories tend to be grouped together. Advertising pays a big part of the newspaper's expenses, so ad placement is critical, and the editor also has to watch for conflicts; for example, the lawn-care company ad must not appear too close to the story on the danger of pesticides. Visual appearance is also important; some stories have pictures attached, and the stories must be arranged so that the pictures are spread out evenly among the pages. Without a lot of care, the newspaper could end up with five pictures on one page and solid text on the next.

Can all of this decision making and design be automated, with or without XML? Fortunately for the job security of newspaper editors, it does not appear so. At best, an XML-based publishing system can chip around the edges by adding metadata to each story, including its priority and subject codes, so that the computer system can help the editor find and organize the stories more easily.

The newspaper is an extreme example, but the same problems arise in other, more routine kinds of publications. Footnotes, for example, are a separate kind of text flow, but one that many automated formatting systems, such as TeX, handle fairly well. Tables and illustrations are a little more difficult, as they need to be placed close to the text that references them without creating too many widow and orphan lines on the page, and sidebars make the problem a little more difficult yet. Although automated systems are not yet ready to handle newspaper or glossy magazine layout at all, they can handle footnotes, sidebars, tables, and illustrations, but many publishers find the result a little sloppy, and they still employ human layout artists working with interactive programs, such as Adobe FrameMaker, rather than fully automated publishing systems.

So right now, automated formatting systems can handle some kinds of multiple text flows well enough for technical publications but not well enough for, say, glossy magazines or advertising material; those still require the services of human layout designers. In those cases, XML is most useful for the content of individual text flowsmarking paragraphs, special text, and so onrather than for the document as a whole.

In the future, this problem will solve itself as more and more publishing moves online. Most attempts to do newspaper- or magazine-style layout online look horrible and are awkward to use on current computers. Instead, a typical magazine or newspaper Web site consists of a list of headlines and, possibly, summaries, with links to individual stories on their own pages. Computers will still probably not be smart enough to lay out advertising material or glossy magazines on their own, but they will be able to handle the bulk of

0 komentar:

Posting Komentar

Langganan: Posting Komentar (Atom)