Most places in the world, when you talk about “Intel” you’re talking about CPU chips. In Washington DC, that word is universally an abbreviation for Intelligence, as in three-letter agencies, spooks, and so on. A couple of weeks ago I was invited to an all-day meeting of the Intelligence Community Metadata Working Group (ICMWG) out in Virginia, one of my more interesting outings of 2003. Herewith some personal notes on the notion of spying and a look at what may be the world’s biggest and hardest metadata problem, with some remarks on the application of RDF.
Spy Fancier · I’ve always had a fascination with intelligence, perhaps partly due to growing up in Beirut, where the major powers maintained a higher ratio of diplomatic staff to local population than anywhere else in the world, and everyone knew it was all about spying; in seventh grade, my best friend’s Dad was an obvious spook.
I admit to being fascinated by the glamour but I’ve never made an effort to get involved, since in practical terms a decade overseas and a period of having in my youth been a long-haired commie pinko radical I suppose make me practically un-clearable. (But hey, if I’m wrong, and you’re a well-funded intelligence agency, give me a call.)
It’s not the cloak-and-dagger end of it, I’m a geek after all; but the idea of applying large amounts of computing resources to chip away at the surfaces of opaque datasets to dig out Really Important Stuff strikes me as fascinating.
Also I believe that spying makes the world safer. There are bad people out there who are inclined to start trouble ranging from assassinations to 9/11 to land wars. The likelihood that they will do is lessened to the extent that their victims find out in advance, and also to the extent that the bad guys think the victims might know what they’re planning. So the huge sums of money that get poured into intelligence around the world have always struck me as a sound investment in peace.
Metadata Writ Large · The ICMWG was well-attended, the organizations represented that are public enough to have web sites included the CIA, NSA, DIA, NIMA, NRO, and INSCOM.
The reason the ICMWG exists is obvious. The intel community is the world’s largest user of full-text-search software, but as anyone who lives by search knows, and as I’ve written here, search is in many ways ultimately unsatisfying. For information that you really care about searching, you need to gather, organize, and maintain metadata.
So the idea of getting together to talk about metadata standards is a no-brainer. The idea of using XML to facilitate this work is likewise not much of a stretch. Once you go that far, the problems get very tough very fast.
The most basic problem is that at the end of the day there is no cheap metadata; you can throw all the software and hardware in the world at the problem, but to generate useful metadata there is no substitute for having experts think about it and write it down, and this can never be made cheap.
You can provide your experts with infrastructural support (and you should), and you can look aggressively for places you can automate metadata gathering (and you should), and you can experiment with computational linguistics and pattern matching technologies (and you should). But when the balance of war or peace, or the prevention of the next 9/11, depends on finding correspondence on subject X written by persons with links to organization Y in the context of country Z, the metadata’s cheap at almost any price.
However, the cost of doing that work more than once because you don’t have facilities for sharing or re-using the metadata is just not bearable. Hence the ICMWG.
Hard Stuff · Given all that, the problems of organizing and sharing this information are horribly hard. Some of the hot points are taxonomies, ontologies, thesauri, lexica, character sets, human languages, graphics formats, user interfaces, query languages, authority files, schemas, you name it. And the datasets are big and the stakes are high.
And of course, this being an arena of human activity, the politics are intense and occasionally toxic.
RDF, Semantic Web, etc · It is no accident that these people are really interested in the promise of the Semantic Web, and they’ve voted with their wallets, a lot of the work in that space has Defense funding behind it.
After spending the day listening to these smart, motivated people talk about these hard, important problems, I stood up and told them I thought they should use RDF.
This may come as a surprise to some of the RDF tribe, since I’ve whined heavily and on the record about RDF’s shortcomings. But here you have a wildly heterogeneous set of communities with wildly heterogeneous technologies and wildly heterogeneous data architectures, and yet there’s huge overlap in their problem spaces. To the extent that RDF is going to force them to boil their data down to simple, cleanly-defined assertions, and be clear about what the properties are and the ranges of allowed values, they win. To the extent that RDF/XML forces them to get their Unicode right and isolate the sources of error, they win.
I think the benefits they’ll achieve from using RDF are independent of the the progress (or lack of it) of the Semantic Web; just by reducing friction and simplifying data models, they’re going to save money and effort.
But this does touch the Semantic Web. Obviously, a precondition for the SW becoming interesting is that there be enough data out there to drive the assertion engines and knowledge distillers and linkage followers that we are told will change the world. Right now, getting people to pay the RDF Tax based on this future potential is a hard sale.
But the intelligence committee might end up aggregating enough graph-structured data simply as a side-effect of wanting to save money and effort on tough national-security problems. The Semantic Web’s emergent properties might start to show up as an unintended side-effect.
Of course, because of the nature of the data, most of us would never know. In fact, it might be happening right now.