I’ve decided that
mod_atom really needs to be a
blog-publishing system, not just an Atom Store. And furthermore, based mostly
on the comments to
that Sanitation piece,
I’ve made two design decisions. First, the sanitizing happens only on the
HTML output; the Atom-store part will persist the data as close as possible to
the way it was sent upstream. Second, I’m going to try using the
TidyLib parser to pick
apart type="html"
text constructs so I can clean ’em up.
Why Tidy? · The other candidate was libxml2, and online research failed to reveal any hands-on comparisons of the two, but it also failed to turn up anyone seriously dissing either HTML parser. So then I noticed that the libxml2 binary was like 3.8M, while TidyLib is under 400K. Of course, to be fair, libxml2 does tons of other useful stuff that I don’t care about.
So after a couple of days’ part-time poking around, I figured out how to compile TidyLib and mod_atom together and load the result into httpd.
Now let’s see how it goes. I must say that I’m a little intimidated by Tidy’s memory allocator. That’s extremely, uh, extreme. I suppose I can figure it out. Compare Genx’s. Am I too simple-minded?
As soon as I stop blogging I’m going to try to wire it up. Surely I have some big thick books or corporate strategies or social-software trends to review first?
Comment feed for ongoing:
From: Bob DuCharme (Aug 16 2007, at 19:43)
What about John Cowan's TagSoup (http://ccil.org/~cowan/XML/tagsoup/)?
[link]
From: Tim (Aug 16 2007, at 21:43)
Bob: TagSoup is in Java.
[link]
From: Aslak Raanes (Aug 17 2007, at 01:01)
I guess a plain C version of html5lib would be nice, but don't know if someone is working on that.
[link]
From: David Comay (Aug 27 2007, at 10:53)
Tim, you may be interested to know that Tidy has been integrated into build 71 of OpenSolaris so it's now part of Solaris Express.
[link]