Author: Tim Bray
Date: Jan-May, 2004
Locations: Vancouver, Melbourne, Mooloolaba
This document describes release beta5 of Genx.
Genx may not remain hosted at wherever you got this file from, and is quite likely to change and grow based on community feedback. You’ve been warned!Genx is copyright © Tim Bray and Sun Microsystems, 2004.
It is licensed for re-use under the terms described in the
file COPYING
.
Genx is a software library written in the C programming language. You can use it to generate XML output, suitable for saving into a file or sending in a message to another computer program. Genx does several things at once:
Takes care of escaping XML’s special characters for you.
Keeps you from generating text which isn’t well-formed.
Generates namespace prefixes so you don’t have to.
Produces documents which are Canonical XML, suitable for use with digital-signature technology.
Tries to do all this efficiently.
Here’s the program:
Compile it with something like
cc -o hello hello.c libgenx.a
and the output should look like this:
Of course, useful XML documents have attributes as well as elements, so let’s add one:
This generates:
Another common XML idiom is namespaces, so let’s put our element and attribute into two separate namespaces.
This makes the output quite a bit uglier:
Passing all these literal strings for element types and attribute names and so on is inefficient, particularly since they usually don’t change much. So if you wanted to generate a million random year/month combinations efficiently as in the example below, you’d use the predeclared versions of the Genx calls. Also, if something goes wrong, you’d like to hear about it before looping a million times uselessly; so this version has error-checking.
Also, I’ve put the root element in a namespace so you can see how that works.
Here are the first 10 lines of output:
Before you do anything, you need to create a genxWriter
with genxNew
. A genxWriter
can be used to
generate as
many XML documents as you want (one at a time). It’s a bit expensive to
create, so if you’re going to be writing multiple XML documents,
particularly if they all have the same elements and attributes, do re-use
a genxWriter
.
Declaring your elements and attributes is much more
efficient than using
the Literal
versions of the calls.
This is because Genx only needs to check the names once for
well-formedness, and because it can pre-arrange the sorting of attributes
in canonical order.
Also, Genx makes its own copy of the element, attribute and
namespace names and prefixes and so on, so you don’t have to keep them
around.
For any production application, predeclaration is the way to go.
Once you’ve got a genxWriter
, you set up to write a document
either with genxStartDocFile
or genxStartDocSender
.
The first is easiest to understand; you provide a FILE *
,
and Genx writes into it.
Alternatively, you can provide your own set of routines to do output, for
example into a relational database or a socket, in a package called
a genxSender
, and Genx uses that instead.
Once you’ve got your elements, attributes, and namespaces declared, you
start new documents with genxStartDocFile
or genxStartDocSender
, then you
can just bang away
with genxStartElement
, genxAddAttribute
,
genxAddText
, genxEndElement
, and so on,
and end each document with genxEndDoc
.
Genx expects you to provide all strings in UTF-8 format, and
checks each one to make sure that it’s real UTF-8 and that each
character is a legal XML character.
It doesn’t know about <
and &
and
so on; that is, it knows how to generate them, but it won’t interpret them
in the input. So if you want to
say if(a<b&&c<d)
, don’t fool with any escaping,
just use
genxAddText(w,"if(a<b&&c<d)")
and Genx will
sort it all out.
If there is some “difficult” character that you want to get into your
XML output, say a mathematical integral symbol “∫”, and you’d really
like the equivalent of
∫
or ∫
, just use
the Unicode value: genxAddCharacter(w,0x222b)
.
You can control your namespace prefixes if you use the predeclared version.
But you can always leave out the prefix and Genx will generate one;
the first will be g1:
, the second g2:
, and so
on.
Genx provides a set of status codes in an enum
called genxStatus
.
The value for success, GENX_SUCCESS
, is guaranteed to be zero,
so it’s easy to check errors in Genx calls along the lines
of:
if (genxAddAttribute(id, idValue) { /* oops! */ }
Well, except when it isn’t. The routines that declare things return the
things they declare (NULL
on error) and write the genxStatus
into a variable
whose address you provide, for example
genxElement genxDeclareElement(genxWriter w, genxNamespace ns, constUtf8 type, genxStatus * statusP);
There are a couple of routines, genxGetErrorMessage
and
genxLastErrorMessage
, which retrieve English-language
descriptions of what went wrong.
There are three kinds of errors you can encounter with Genx.
We all have reduced-mental-function days, and Genx will sneer
pityingly at you if you try to genxStartElement
without
having previously done a GenxStartDoc
call, or do
a genxAddAttribute
any time but after
a genxStartElement
. And so on.
This is the kind of problem that you’re most likely to run across. If you’re trying to wrap XML tags around input data you don’t control (common enough), Genx will be unhappy if the data has malformed UTF-8 or contains Unicode characters that XML doesn’t allow.
To help out with these situations, there are
the genxCheckText
and genxScrubText
calls.
Appropriate use of these ensures that you never hurt any feelings,
either in the Genx software or, more important, with whoever’s going
to be receiving your XML.
See the write-up on utility routines for some
specific suggestions.
Genx throws up its hands in despair if it can’t allocate
memory
or it gets an I/O error writing data.
The first is unlikely to happen, since Genx doesn’t use much memory.
However, it does store up attribute values per element, so if you did a
thousand or
so genxAddAttribute
calls for a single element, each with an
attribute value ten
megabytes long, some pain would ensue.
To make sure you never hand Genx an illegal name or
malformed XML, there are the handy utility
routines genxCheckText
and genxCheckName
.
If you’re including someone else’s data in your XML and you can’t control
whether it contains proper XML characters properly UTF-8 encoded,give serious
thought to using genxScrubText
, which brutally discards any
bytes that aren’t well-formed UTF-8 or don’t encode legal XML characters.
Since genxAddText
does the checking anyhow,
there’s no need for you to do it first. Consider an idiom like:
/* Add text safely */
status = genxAddText(w, text);
if (status == GENX_BAD_UTF8 || status == genx_NON_XML_CHAR)
{
constUtf8 newText = (constUtf8) alloca(strlen(text) + 1);
genxScrubText(text, newText);
status = genxAddText(w, newText); /* Can't fail */
}
if (status) /* something SERIOUSLY wrong */
There are a bunch of things that people often do in creating XML but that Genx doesn’t support. In some cases, Doing These Things Would Be Wrong. In others, they might be handy but don’t feel essential for this kind of a low-rent package.
The things that Genx can’t do include:
Generating output in anything but UTF8.
Writing namespace-oblivious XML. That is to say, you can’t have an
element or attribute named foo:bar
unless foo
is a prefix associated with some namespace.
Empty-element tags.
Writing XML or <!DOCTYPE>
declarations.
Of course, you could squeeze these into the output stream yourself before
any Genx calls that generate output.
Pretty-printing. Of course, you can pretty-print yourself by putting the linebreaks in the right places and indenting appropriately, but Genx won’t do it for you. Someone might want to write a pretty-printer that sits on top of Genx.
By design, Genx writes Canonical XML.
This means that there are no XML or <!DOCTYPE>
declarations,
that the attributes are sorted in a particular order, that all
instances of >
and carriage-return (U+000D) are escaped, and
that there is no whitespace outside the root element except newlines that
precede and follow comments and PIs.
Normally, this should cause no surprises or difficulties, except that Canonical XML documents don’t have a closing new-line character, which may irritate some applications such as text editors.
As noted above, if you want extra declarations or closing newlines, you can put them in yourself before and after doing your Genx calls; but be aware that your output will no longer be Canonical XML.
The design of Genx takes some care to achieve good performance. However, there are some things you can do to help, and others which will slow it down; one function in particular can be used in optimizing or pessimizing performance.
The genxAddNamespace
call
instructs Genx to insert a namespace declaration; it must be called
after starting an element and before any
genxAddAttribute
calls.
You don’t ever need to call it; Genx will figure out when it
needs to add namespace declarations on its own. However, if you have a bunch
of elements or attributes, all in the same namespace, scattered all around
your document, if you do a
genxAddNamespace
for that
namespace on the root element, Genx won’t ever have to add another
declaration, and your document will end up smaller, more readable, and
quicker to transmit and parse.
On the other hand,
genxAddNamespace
can be called
with an extra argument, a prefix to use, which need not be the same as the
default prefix for that namespace.
If you do this, performance will suffer grievously, as it makes a bunch of
internal optimizations impossible and Genx has to laboriously examine
its whole internal stack any time use you use that namespace again to make sure
the right prefixes are in scope.
(By the way, it’s good practice anyhow to use the same prefix for the same
namespace throughout an XML document, so Genx rewards good practice
with good performance.)
Genx also has a
genxUnsetDefaultNamespace
call, which does what its name suggests. If you use this, however, you will
defeat a bunch of optimizations and make the namespace that used to be the
default much slower to process.
This section documents all the datatypes that appear in Genx’s
published interface, found in the file genx.h
.
typedef enum
{
GENX_SUCCESS = 0,
GENX_BAD_UTF8,
GENX_NON_XML_CHARACTER,
GENX_BAD_NAME,
GENX_ALLOC_FAILED,
GENX_BAD_NAMESPACE_NAME,
GENX_INTERNAL_ERROR,
GENX_DUPLICATE_PREFIX,
GENX_SEQUENCE_ERROR,
GENX_NO_START_TAG,
GENX_IO_ERROR,
GENX_MISSING_VALUE,
GENX_MALFORMED_COMMENT,
GENX_XML_PI_TARGET,
GENX_MALFORMED_PI,
GENX_DUPLICATE_ATTRIBUTE,
GENX_ATTRIBUTE_IN_DEFAULT_NAMESPACE,
GENX_DUPLICATE_NAMESPACE,
GENX_BAD_DEFAULT_DECLARATION
} genxStatus;
This documents all the things that can go wrong.
You can use the functions genxGetErrorMessage
and genxLastErrorMessage
to associate
English-language messages with these codes.
Here are some further notes on the ones that are actually used in
the implementation:
A violation of the UTF-8 encoding rules, as as documented in Chapter 3.10 of The Unicode Specification. That’s the chapter reference for Version Four of Unicode, anyhow, which is what I used to help me write Genx. The explanation of UTF-8 in Version Four is quite a bit better than in any of the earlier releases.
The rule for what characters are legal in XML comes from the
production labeled Char
in the XML 1.0
specification.
The rule that applies here is the production labeled NCName
in Namespaces in XML.
The bad name could be an element type, an attribute name, a PI target, or a
namespace prefix.
This means that Genx failed to allocate memory for some reason that it has no hope of understanding and you probably have no hope of fixing, but at least you know.
This means that you tried to genxDeclareNamespace
and
passed NULL
as a namespace name, which pretty well defeats the
purpose. Or, you passed the empty string ""
, which would
undeclare a default namespace except for Genx doesn’t do those.
Something is terribly wrong inside Genx, send mail to the bozo who wrote it, I think he’s named Ibrahim and lives in Singapore.
You tried to declare two namespaces with the same default prefix.
Genx functions have to be called in
a particular order, which is reasonably self-evident:
You can only call genxAddNamespace
and
genxUnsetDefaultNamespace
after a
genxStartElement
call and before any
genxAddAttribute
calls.
Turning it around, genxAddAttribute
can only be
called after genxStartElement
and possibly one or
more genxAddNamespace
/genxUnsetDefaultNamespace
calls.
This code means you got that order wrong.
You called genxEndElement
, but there was no
corresponding genxStartElement
call.
An I/O routine has complained to Genx, which is
passing the complaint on to you, so it’s your problem now.
If you used genxStartDocFile
, the error comes from
down in
the stdio library, which probably means something is terribly
wrong at a level too low for you to fix. If on the other hand you’re
doing your own I/O via genxStartDocSender
, you
may be able
to do something useful.
You called genxAddAttribute
but used NULL
for
the attribute value; if you want it to be empty, use ""
instead.
A comment’s text isn’t allowed to either begin or end
with -
, nor is it allowed to contain --
.
You called genxComment
with text exhibiting one of these
problems.
You tried to create a PI whose target was xml
(in any
combination of upper and lower case). XML 1.0 says you can’t do
that.
genxPI
with a body which included an
illegal ?>
.
You tried to add the same attribute to some element more than once. There’s no check whether you provided the same value or not; this is evidence of breakage.
You either tried to declare an attribute in a namespace whose default prefix is empty (i.e. it’s the default namespace), or tried to add an attribute which is in a namespace, and the currently-effective declaration for that namespace has an empty prefix, i.e. it’s the default namespace.
You tried to add two namespace declarations for the same namespace on the same element, but with different prefixes.
You tried to declare some namespace to be the default on an element which is in no namespace.
#define GENX_XML_CHAR 1
#define GENX_LETTER 2
#define GENX_NAMECHAR 4
These are mostly used internally, but the utility
function genxCharClass
returns the OR of any that
apply.
typedef unsigned char * utf8;
This is the flavor of text string that all GenX functions expect.
typedef const unsigned char * constUtf8;
You’d think that this would be the same as const utf8
but
it’s not, since const
applies a typedef at a time.
Opaque pointer type which identifies a writer object and is the first
argument to most Genx calls; created
with genxNew
.
Opaque pointer identifying a namespace; created
with genxDeclareNamespace
.
Opaque pointer identifying an element; created
with genxDeclareElement
.
Opaque pointer identifying an attribute; created
with genxDeclareAttribute
.
typedef struct
{
genxStatus (* send)(void * userData, constUtf8 s);
genxStatus (* sendBounded)(void * userData, constUtf8 start, constUtf8 end);
genxStatus (* flush)(void * userData);
} genxSender;
A user-provided package of I/O routines, to be passed via
genxStartDocSender
.
Their names should be self-explanatory; for sendBounded
, if you
have s = "abcdef";
and you want to send abc
, you’d
call sendBounded(userData, s, s + 3);
This section documents all the function calls that appear in Genx’s
published interface, found in the file genx.h
.
genxWriter genxNew(void * (*alloc)(void * userData, int bytes),
void (* dealloc)(void * userData, void * data),
void * userData);
Creates a new instance of genxWriter
.
The three arguments are a memory allocator and deallocator
(see genxSetAlloc
and genxSetDealloc
),
and a userData
value
(see genxSetUserData
).
void genxDispose(genxWriter w);
Frees all the memory associated with
a genxWriter
.
void genxSetUserData(genxWriter w, void * userData);
The value passed in userData
is passed as the first
argument to memory-allocation (see genxSetAlloc
) and I/O
(see genxStartDocSender
) callbacks.
If not provided, NULL
is passed.
void * genxGetUserData(genxWriter w);
Retrieves the value set with genxSetUserData
,
or NULL
if none was set.
void genxSetAlloc(genxWriter w,
void * (* alloc)(void * userData, int bytes));
The subroutine identified by alloc
is used by Genx
to allocate memory.
Otherwise, Genx uses malloc
.
void genxSetDealloc(genxWriter w,
void (* dealloc)(void * userData, void * data));
The subroutine identified by dealloc
is used
by Genx to deallocate memory, but only if you called genxSetAlloc
with a non-NULL
argument.
If you set a non-NULL
allocator
with genxSetAlloc
but no deallocator, Genx
will never deallocate memory.
void * (* genxGetAlloc(genxWriter w))(void * userData, int bytes);
Retrieves the allocator routine pointer (if any) set
with genxSetAlloc
.
void (* genxGetDealloc(genxWriter w))(void * userData, void * data);
Retrieves the deallocator routine pointer (if any) set
with genxSetDealloc
.
genxNamespace genxDeclareNamespace(genxWriter w,
constUtf8 uri, constUtf8 prefix,
genxStatus * statusP);
Declares a namespace.
If successful, the genxNamespace
object is returned and
the genxStatus
variable indicated by statusP
is set to GENX_SUCCESS
.
The prefix, if provided, is the default prefix which will be used
when Genx has to insert its own xmlns:whatever
attribute
when you insert an element or attribute in a namespace that you haven’t
previously done a genxAddNamespace
call on; the
default is also used when you call genxAddNamespace
with
a NULL
second argument.
You can use ""
for the default prefix to make this default to
being the default namespace (xmlns=
).
If the prefix argument is NULL
and you haven’t previously
declared this namespace, Genx generates a
default prefix; the first is g1:
, the
second g2:
, and
so on.
If the prefix argument is NULL
but you had previously
declared a default prefix for this namespace, this is a no-op.
You can declare the same namespace multiple times with no ill effect.
Things can go wrong, signaled by a return value of NULL
and
a genxStatus
code written into *statusP
:
The namespace name URI is either NULL
or an empty string.
The namespace name contains broken UTF-8 or a non-XML character.
The namespace prefix (if provided) isn’t
an NCName
You declared two namespaces with the same default prefix.
utf8 genxGetNamespacePrefix(genxNamespace ns);
Returns the prefix associated with a namespace; particularly useful where the prefix has been generated for the caller by Genx.
genxElement genxDeclareElement(genxWriter w,
genxNamespace ns, constUtf8 type,
genxStatus * statusP);
Declares an element.
If successful, the genxElement
object is returned and
the genxStatus
variable indicated by statusP
is set to GENX_SUCCESS
.
You can declare the same element multiple times.
If the ns
is NULL
, the element is not in
any namespace.
The only likely error is the type
not being an
NCName
, in which case NULL
is returned
and *statusP
is set appropriately.
genxAttribute genxDeclareAttribute(genxWriter w,
genxNamespace ns,
constUtf8 name, genxStatus * statusP);
Declares an attribute.
If successful, the genxAttribute
object is returned and
the genxStatus
variable indicated by statusP
is set to GENX_SUCCESS
.
You can declare the same attribute multiple times.
If the ns
is NULL
, the attribute is not in
any namespace.
The only likely error is the name
not being an
NCName
, in which case NULL
is returned
and *statusP
is set appropriately.
genxStatus genxStartDocFile(genxWriter w, FILE * file);
Prepares to start writing an XML document, using the
provided FILE *
stream for output.
genxStatus genxStartDocSender(genxWriter w, genxSender * sender);
Prepares to start writing an XML document, using the
provided genxSender
structure for output.
genxStatus genxEndDocument(genxWriter w);
Signals the end of a document.
Actually does very little aside from calling fflush
if writing
to a FILE *
, the flush
method
of genxSender
otherwise. Since Genx can detect when the
root element has ended, perhaps this should be removed?
genxStatus genxComment(genxWriter w, constUtf8 text);
Inserts a comment with the text provided.
Can provoke an error if the text fails to follow the XML 1.0 rules for
comment text: no leading or trailing -
, and no
embedded --
.
Per Canonical XML, if the comment appears before the root element, it will be followed by a newline; if after the root element, it will be preceded by a newline.
genxStatus genxPI(genxWriter w, constUtf8 target, constUtf8 text);
Inserts a Processing Instruction.
Can provoke an error if the the target is xml
in any combination
of upper and lower case; or if the text contains ?>
.
PIs outside the root element are equipped with newlines exactly as with comments.
genxStatus genxStartElementLiteral(genxWriter w,
constUtf8 xmlns, constUtf8 type);
Start writing an element.
The xmlns
argument, if non-NULL
, is the
namespace name, a URI. Genx generates a prefix.
If xmlns
is NULL
, the element will be in no
namespace.
If you have previously declared a namespace for the namespace name, the prefix associated with that declaration will be used.
Errors can occur if the xmlns
contains broken UTF-8 or
non-XML characters, or the type
is not an
NCName
.
This call is much less efficient than genxStartElement
.
genxStatus genxStartElement(genxElement e);
Start writing an element using a predeclared genxElement
and
(optionally) genxNamespace
.
There is very little that can go wrong with this call, unless you neglect
to start the document or have already
called genxEndDocument
.
genxStatus genxAddAttributeLiteral(genxWriter w, constUtf8 xmlns,
constUtf8 name, constUtf8 value);
Adds an attribute to a just-opened element; i.e. it must be called immediately after one of the start-element calls.
The xmlns
argument, if non-NULL
, is the
namespace name, a URI. Genx generates a prefix.
If xmlns
is NULL
, the attribute will be in no
namespace.
Errors can occur if the xmlns
or value
contains broken UTF-8 or
non-XML characters, the type
is not an
NCName
, or if you try to add the same attribute to an element more than once.
Since there is no DTD available, Genx does not do any
attribute normalization.
However, it does escape the
characters <
, &
, >
,
carriage-return (U+000D), and "
in the attribute value.
This call is much less efficient than genxAddAttribute
.
genxStatus genxAddAttribute(genxAttribute a, constUtf8 value);
Adds a predeclared attribute with an (optional) predeclared namespace to a just-opened element; i.e. it must be called immediately after one of the start-element calls.
Errors can occur if the provided value contains broken UTF-8 or non-XML characters, or if you try to add the same attribute to an element more than once.
Since there is no DTD available, Genx does not do any
attribute normalization.
However, it does escape the
characters <
, &
, >
,
carriage-return (U+000D), and "
in the attribute value.
genxStatus genxAddNamespace(genxNamespace ns, constUtf8 prefix);
Inserts a declaration for a namespace, with the requested prefix, or with
the default prefix if the second argument is NULL
.
If the requested prefix is not the default, this will have a significant
impact on the performance of subsequent Genx calls involving this
namespace.
This is a no-op if a declaration of this namespace/prefix combination is
already in effect.
You can’t use the same prefix for two different namespaces within a single start-tag, and you can’t use two different prefixes for the same namespace in the same scope.
This must be called after a
genxStartElement
call and before any
genxAddAttribute
calls or
a GENX_SEQUENCE_ERROR
will ensue.
genxStatus genxUnsetDefaultNamespace(genxWriter w);
Inserts a xmlns=""
declaration to unset the default namespace
declaration.
This is a no-op if no default namespace is in effect.
genxStatus genxEndElement(genxWriter w);
Close an element, writing out its end-tag. The only error that can normally arise is if this is called without a corresponding start-element call.
genxStatus genxAddText(genxWriter w, constUtf8 start);
genxStatus genxAddCountedText(genxWriter w, constUtf8 start, int byteCount);
genxStatus genxAddBoundedText(genxWriter w, constUtf8 start, constUtf8 end);
Write some text into the XML document. This can only be called between start-element and end-element calls.
The text is processed by escaping <
, &
,
>
, and carriage-return (U+000D) characters.
In the first version, the text is zero-terminated; the
Counted
and Bounded
versions allow the caller to
avoid the
zero-termination.
genxStatus genxAddCharacter(genxWriter w, int c);
Add a single character to the XML document.
The value passed is the Unicode scalar as normally expressed in the U+XXXX
notation.
Like genxAddText
, this can only be called between
start-element and end-element calls.
This should not normally provoke an error unless the character provided is not
a legal XML character.
int genxNextUnicodeChar(utf8 * sp);
Returns the Unicode character encoded by the UTF-8 pointed-to by the argument, and advances the argument to point at the first byte past the encoding of the character. Returns -1 if the UTF-8 is malformed, in which case advances the argument to point at the first byte after the point where malformation was detected.
genxStatus genxCheckText(genxWriter w, constUtf8 s);
This utility routine checks the null-terminated text provided and returns
one of GENX_SUCCESS
, GENX_BAD_UTF8
,
or GENX_NON_XML_CHARACTER
.
int genxCharClass(genxWriter w, int c);
The argument is a single Unicode scalar character value.
Returns an integer which is the OR
of one or more of GENX_XML_CHAR
, GENX_LETTER
, and
GENX_NAMECHAR
.
int genxScrubText(genxWriter w, constUtf8 in, utf8 out);
Copies the zero-terminated text from in
to out
,
removing any bytes which are not well-formed UTF-8 or which represent
characters that are not legal in XML 1.0.
The output length can never be greater than the input length.
Returns a nonzero value if any changes were made while copying.
char * genxGetErrorMessage(genxWriter w, genxStatus status);
Returns an English string containing the error message corresponding to
the provided genxStatus
code.
char * genxLastErrorMessage(genxWriter w);
Returns an English string containing the error message corresponding to the last error Genx encountered.
char * genxGetVersion();
Returns a string representation of the current version of Genx.
For the package you are reading, it returns:
The design of Genx was substantially shaped by discussion in the XML-dev mailing list. Particular credit is due to John Cowan, David Tolpin, Rich Salz, Elliotte Rusty Harold, and Mark Lentczner; not that they or anyone but Tim Bray should be blamed for the inevitable infelicities and outright bugs herein.