hot info

The Advantages of XML Searching

on Rabu, 17 Juni 2009

The Advantages of XML Searching
XML markup makes searching smarter by adding contextual information and makes it possible to correlate information from more than one document. Any serious Internet user is familiar with searches that do not work: The words are too common or have too many different meanings for any search engine to return useful results, or perhaps the information is spread among several pages. Solving these problems was one of the initial goals of XML's creators.

6.1.1. Context
Much of the time, full-text search engines do a good job, but they sometimes fall flat. Consider, for example, the difference between Bush the U.S. president and bush the shrub, or Washington the U.S. state and Washington the U.S. city. If you were trying to find information on bush pilots flying out of Washington State, you might try the search "bush pilot washington." In late 2003, Google's first ten results were as follows:

A site selling a book about a Canadian bush pilot (no connection to Washington)

Two newspaper stories about the U.S. Navy naming an aircraft carrier after President Bush Sr.

A 2002 USA Today story about a small plane violating airspace near President Bush Jr.

A 2000 Washington Post story about President Bush Jr.'s service in the Texas Air National Guard

A 2001 Pravda story critical of President Bush after the midair collision between a Chinese fighter jet and a U.S. surveillance plane

The Amazon.com page for a biography of an Alaskan bush pilot, published by University of Washington Press

Two pages from FlyRod & Reel magazine: one stating that it has no listings for Washington and another listing angling retailers in the state

A 2003 news story from the Washington Times about a U.S. Navy pilot being held by the Iraqi government

I choose Google for this example precisely because it is a very good full-text search engine: It infers the relevance of information on a Web page not only from the text on the page but also from the other pages that link to it, the pages that link to those pages, and so on. With this difficult example, the first slightly relevant result is the twenty-third, which mentions a bush pilot who did fly once in Washington State; after that, the matches revert mainly to politics.

An experienced search-engine user could work around the problem by adding more words to the query. For example, bush pilots in Washington State have to deal with a lot of mountains, and the query string "bush pilot washington mountains" returns fewer political hits. Even better, a search string that contains specific aircraft types used in bush plane flying, such as "bush pilot washington cessna 180" returns almost all relevant matches. Most of the population is not that adept with search-engine query strings, however, and would likely give up on the whole thing; furthermore, these more specific queries would miss pages that do not happen to mention mountains or Cessna 180 airplanes.

Although homographs, such as Bush and bush can make full-text searches difficult, an even trickier problem comes when the search results depend more on context than on the individual words. For example, consider trying to find Web pages that discuss the history of the word sex, without hitting thousands of pornography sites. Unless the decades-old dream of full machine artificial intelligence finally shakes off its dust and comes true, these are searches that will continue to flummox traditional full-text search engines.

As long as artificial intelligence remains a distant dream, we need to concentrate on getting plain old human intelligence into our XML documents, and that is precisely what XML-based markup languages do. The following News Industry Text Format [NITF] fragment shows how news providers can tag articles to avoid confusing search engines.

Today in Seattle,
Washington
, President
Bush
opened a new museum.





The markup represents added human intelligence from the news reporter or editor: Washington represents a state, not a city, and Bush represents a person, not a shrub. The XML document contains not only the news story itself but also what the author knew about the news story. As this information survives all the way to the final document, search engines require no special artificial intelligence to use it.

Similarly, several markup languages, including DocBook [DOCBOOK] and the Text Encoding Initiative [TEI] define markup for talking about the history of words. Following is a DocBook example:

The word sex has gradually come
to mean not only gender, but the physical act of procreation, and,
eventually, all physically-intimate acts.




The wordasword element makes it clear that this paragraph discusses the word sex rather than the act: A search engine could easily pick out this contextual information to return exactly the pages the user wanted.

XML markup is information that a document's creator knew but could not put in the main text. Because this information is available, search tools could potentially use it to return far more accurate results.

6.1.2. Correlation
In addition to basic context, XML markup also makes it possible to correlate information, matching instances of the same thing described in different ways and converting among different representations. To start with a simple example, consider only a few of the many ways documents might refer to British Prime Minister Tony Blair:

Tony Blair

Prime Minister Blair

the British Prime Minister

the prime minister

the P.M.

Mr. Blair

Blair

Given this variety, searching for information about Prime Minister Blair is difficult, and the work is made even worse by the fact that many of these phrases can apply to other people: for example, in a different context, "the prime minister" could refer to the prime minister of Australia, Canada, India, or many other countries.

How is it possible to define a Web in which people can easily search for information about Prime Minister Blair no matter how he is described? One possibility is always to normalize the name when it appears; unfortunately, if people are forced always to write "British Prime Minister Tony Blair," text will become awkward and unnatural, and even that might not be sufficient if another person named "Tony Blair" became British prime minister in the future.[1]

[1] This risk is not far fetched: Consider the phrase "U.S. President George Bush."

Using XML markup, however, it is a simple matter to attach a unique identifier to every location that mentions Prime Minister Blair. As long as the identifier is well known, search engines can look for it rather than for the text it contains, as shown by the following markup fragments:

Tony Blair

Prime Minister Blair

the British Prime Minister

the Prime Minister

the P.M.

Mr. Blair

Blair



If in the future, another person shared the same name and title, that person would have a different identifier, so search engines would not return false hits.

Most things in the worldpeople, concepts, historical periods, and so ondo not yet have standard, universally accepted identifiers, so this is more than a markup problem. However, many identification schemes do exist, such as stock market symbols; publication identifiers, such as ISBNs; social security numbers, phone numbers, postal codes, and country, language, and currency codes. The following example shows the use of a stock ticker symbol for identifying a company:

Today, Microsoft
announced a new software strategy.





Even limiting searching to current widely accepted identifiers, XML markup can make it significantly easier to correlate information described in various ways.

Now, consider a more difficult problem than simple identification: a search for world government spending programs that cost more than USD 1 billion. This kind of a search is far beyond the capabilities of current full-text search engines, but XML markup can add hints to help future search engines do the work. The following example uses News Industry Text Format [NITF] markup once again:

Today, Congress approved an additional 3
billion
in education spending.





The money element makes it clear that this article is referring to "3 billion" in currency, and the unit attribute indicates that the currency is U.S. dollars, using a code from the ISO 4217 standard for identifying world currencies [ISO-4217]. A search engine would still use full-text searching algorithms to determine that the article dealt with government spending, but then the tagging would help it determine the amount. Even more interestingly, the money element adds enough intelligence that the search engine could return correct results for pages using entirely different currencies; at the time of writing, the money in the next example is less than USD 1 billion, so it should not return a hit:

The Canadian federal government committed an additional unit="CAD">1.1 billion in health-care spending.





Automatic currency conversion during searchingbased on intelligent tagging like thiscould be especially useful for financial institutions and others mining large international document repositories for information. Many other types of conversion and substitution are also possible with markup, including dates and times, language conversion and recognition of synonyms, subsets, and supersets. (For example, a search for information about New England should return pages that mention Vermont.)

Obviously, a lot of infrastructure is required before search engines can work like this. But it does provide an intriguing view of a future that markup might help to enable, where people and programs can search for and find the precise information they need, relying on intelligence encoded in XML markup.

0 komentar:

Posting Komentar