On Search: XML

Searching is all about text, and the proportion of all the world’s text that is XML keeps getting higher and higher. So if you’re going to do search, at some point you’re going to have to think about searching XML. Herewith a survey of some of the issues and problems (which, like other essays as we approach the end of On Search, contains opinions among the reportage).

Markup as Metadata · Back when people were doing the initial sales job for XML (and its predecessor SGML) one big part of the pitch was how this was going to make search so much better: “Searching in the context of a <title> or <product-name> or <metaphysical-paradigm> is going to be ever so much more precise and powerful than boring old brute-force full-text search.” And in principle, it should be.

But there are a couple of things wrong with this picture. First, people don’t want to compose queries and do flexible, powerful structure-sensitive searches. As I’ve written here previously, people in general want to type the minimal number of keystrokes into a search window and say Go, and have the system figure it out for them. Secondly, descriptive markup is a form of metadata, and there is no cheap metadata, and XML is no exception. If your text inventory is in Word or HTML, XMLifying it in any useful way is going to be very, very expensive. Which is to say, XML may not be cost-effective strictly in terms of making search run better.

On the other hand, if you’re a serious searcher (professional academic, sufferer from a rare disease, whatever) you’d probably be delighted to go the extra mile, first to get the markup into the text and then to learn how to use it to juice up your queries.

And there is no doubt that in principle all kinds of sophisticated search are enabled by XML that are just unthinkable without it; so let’s consider the “how” not the “why.”

But first, a side-trip into a couple of other issues that arise in searching XML.

Searching Around Markup · When you’re searching, there are times that you want to ignore all those tags. But it turns out to be pretty well impossible to predict when. It’s easiest to illustrate this by example. Suppose I wanted to search for the phrase “Joe is terrific”—here are four different kinds of XML markup that could get in the way:


Joe is <em>terrific</em>.
Joe is <em>not</em> terrific, said his wife.
Joe is <fnote>(according to his Mom)</fnote> terrific.
Joe is <note>terrific applause erupted</note> here!

The lesson is: don’t try to get too clever about combining markup with search. Probably the safest thing would be to count each start or end tag as a word, so that you’d get no phrase matches in that text above.

The I18n Dividend · One of the big problems in search is figuring out what encoding internationalized text is in: ISO-this, Microsoft-that, UTF-the-other. Fortunately, if the text is in XML, this problem goes away, since an XML document knows what encoding it’s in, and you can run your search machinery in Unicode, the way search machinery should be run. So in this respect XML is significantly easier to search than ordinary text.

How To Use That Structure? · To return to maybe the most interesting question: given that XML text contains a rich, nested, sequenced, labeled structure, how do we use that to add value to search?

The OED, Pat, and Element Sets · Back in 1987, I went to work at the University of Waterloo on the “New Oxford English Dictionary Project.” The text of the OED was marked-up in what today we’d call XML, and the most successful piece of technology that came out of that project was a search engine called “Pat.” One of the interesting things about Pat was that it did a really good job in using the XML structure to drive searches. (There were some other interesting things about Pat. First, unlike every other successful search engine, its index wasn’t built around postings, but around a data structure called a “suffix array,” go look it up if you care. It was blindingly fast for phrase searching but I/O-bound for big datasets, and brutally difficult to update in place).

Pat got turned into a commercial product and served as the launching pad for Open Text, a company I co-founded that’s still there and very successful, although today search is just a small part of their product’s feature set, and the (very good) Open Text search module is not based on Pat.

I don’t have a pointer to documentation of the Pat query facility, but that’s OK, because I documented a slightly-modified version of it a few years ago, using the label Element Sets. You can follow that link to read up in detail, but I’ll extract a couple of examples from that paper that give a feel for how they’d work.

The following example finds the set of all paragraphs in the introduction that contain one or more footnotes and also one or more cross references.

set1 = Set('Paragraph') within Set('Introduction')
set2 = set1 including Set('footnote')
set3 = set2 including Set('xref')

The following example finds the set of paragraphs in the introduction that contain either a cross reference or a footnote but not both.

set1 = Set('Paragraph') within Set('Introduction')
set2 = set1 including Set('footnote')
set3 = set1 including Set('xref')
set4 = (set2 + set3) - (set2 ^ set3)

Of course, you can combine element sets with search:

set1 = Set('Title', contains="introduction")
set2 = Set('Title', attribute=("xml:lang", "en-US"))

I think that this approach is important to think about because, unlike some of the other approaches below, it has been commercially implemented and has been proven to work just fine and be very useful for searching XML.

A Note on Typing · In many cases, when you’re talking about full-text-search, you can ignore the the labyrinth of issues around data types, because everything’s a string, right? But consider the following little chunk of HTML:

<p>The full-size picture 
 <img src="madonnaBig.jpg" width="500" />
and a smaller version: 
 <img src="madonnaTN.jpg" "width="80" /></p>

When some of the things in your markup are numbers, you’d like to be able to treat them as numbers; in the preceding example, for example, you’d like to be able to search for images larger than, say 90 pixels in width, which you can’t do unless you treat the width= values as numbers, not strings.

And if your XML isn’t HTML-like, but contains prices and dates and that kind of thing, the scope for type-sensitive querying obviously is much greater. Put another way, if your XML is less like a document and more like a database record, you’re probably less interested in full-text search and more interested in strongly-typed querying.

The W3C and XQuery · That paper I cited above is my contribution to a workshop the W3C held back in 1998 on the general subject of how to query XML. The Element Sets approach was quickly passed over as being too simple to be interesting, in favor of the approach that has led to W3C’s massive, years-long XML Query activity. On November 12th, this group went to “Last Call” on its draft specifications. They comprise eight documents, in aggregate many hundreds of pages long, and are remarkably ambitious and complex.

The basic idea, back when they got started, was to take XPath, a little language originally invented for XSLT to select pieces of XML documents, and use that as the basis for a general-purpose query facility.

Along the way, they decided to build in a generalized type query engine, not just for the kind of numbers and dates described above, but so you can do queries such as “Find me anything which matches the XML Schema type named extended-abstract or any type derived from that”—one consequence is that XQuery is joined at the hip to XML Schema.

I’m the wrong person to go to for an opinion on XQuery; sometime a year or more ago I became uncomfortable with several aspects of the project, including its sprawling scope and tight XML-Schema integration. I submitted a lengthy and detailed body of feedback on an earlier set of working drafts, with many specific suggestions for how they might address these issues. I will say nothing about the response to my input except that, all these months later, I am still deeply angry and have no intention wasting any further time in reviewing or commenting on their drafts.

In fairness, many people are quite positive on XQuery, and building an XML-savvy query facility with XPath as a basis feels like it ought to be a good idea, and the group has lots of smart people in it and there are implementations; so maybe I’m wrong and everything’s just fine. The marketplace will eventually judge.

SQLX · In parallel to the W3C activity, and with some of the same people participating, the SQL community has been doing work on extensions for XML. It’s called SQL/XML and is being prepared by the “SQL X Group” and is intended for ratification by INCITS H2. Unfortunately, the SQL/X Website is down as I write this while they build defenses against a determined denial-of-service attack. Fortunately, there’s what seems to be a pretty good overview in Oracle-land, a page that discusses SQL/XML and XQuery.

I’ve been to one SQL/XML presentation by the redoubtable Jim Melton, who’s pretty central to SQL as a whole, and the whole effort seemed sensible for those whose worldview is SQL-centric (that’s a lot of people). As I said, the marketplace will sort all these experimental science projects out and, looking back in a few years, the right answer will always have been obvious.

We Don’t Know the Answer · In case it’s not obvious, we haven’t figured out what the right way to search XML is. It’s worse than that, here’s a list of the things that we don’t know:

Whether there’s going to be a lot of XML around in repositories to search. XML these days is more used in interchange rather than archival applications.
Whether the rewards to be found in enhancing search based on XML’s flexible, dynamic structure are great enough to justify the cost of building search systems that can deal with XML’s flexible, dynamic structure.
If there is a lot of XML around to be searched, and if people actually want to make the effort to use the structure to support searching, which kind of approach—minimal like Element sets, SQL-integrated, or the brave new world of XQuery—will prove to be the winner.

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

November 30, 2003
· Technology (90 fragments)
· · Search (66 more)
· · XML (136 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!