A representation is data that represents the state of a resource. It consists of:
Web agents may use representations to modify as well as read resource state.
The Web can be used to interchange resource representations in any format. This is a good thing, since there is continuing progress in the development of new data formats for new applications and the refinement of existing ones.
Clearly, for a format to be usefully interoperable between two parties, they must have a shared understanding of its syntax and semantics. This is not to imply that a sender of data can count on constraining its treatment by a receiver; simply that making good use of electronic data usually requires knowledge of its designers' intentions.
For a format to be widely interoperable across the Web, the following must obtain:
It should be noted that the invention of new data formats is expensive, and the Web-wide deployment of software able to handle them is immensely expensive. Thus, before inventing a new data format, careful consideration should be given to re-using one that is already available. For example, if a format is required to contain human-readable text with embedded hyperlinks, it is almost certainly better to use HTML for this purpose than to invent a new format.
As noted above, the utility of data formats deponds on an accessible normative specification. Some of the desirable characteristics of these specifications include:
This section discusses important characteristics of data formats which can together be used to describe and understand them.
A textual data format is one in which the data is specified as a linear sequence of characters. HTML, Internet e-mail, and all XML-based languages are textual. In modern textual data formats, the characters are usually taken from the Unicode repertoire.
Binary data formats are those in which portions of the data are encoded for direct use by computer processors, for example thirty-two bit little-endian two's-complement and sixty-four bit IEEE double-precision floating-point. The portions of data so represented are include numeric values, pointers, and compressed data of all sorts.
In principle, all data can be represented using textual formats.
The trade-offs between binary and textual data formats are complex and application-dependent. Binary formats can be substantially more compact, particularly for complex pointer-rich data structures. Also, they can be consumed more rapidly by software in those cases where they can be loaded into memory and used with little or no conversion.
Textual formats are often more portable and interoperable, since there are fewer choices for representation of the basic units (characters), and those choices are well-understood and widely implemented.
Textual formats also have the considerable advantage that they can be directly read and understood by human beings. This can simplify the tasks of creating and mainting processing software, and allow the direct intervention of humans in the processing chain without recourse to tools any more complex than the ubiquitous text editor. Finally, it simplifies the necessary human task of learning about new data formats.
All things being equal (a rare state of affairs) textual formats are generally preferable to binary ones in Web applications.
It is important to emphasize that intuition as to such matters as data size and processing speed are not a reliable guide in data format design; quantitative studies are essential to a correct understanding of the trade-offs.
Final-form data formats are not designed to allow modification or uses other than that intended by their designers. An example would be PDF, which is designed to support the presentation of page images on either screen or paper, and is not readily used in any other way. XML Flow Objects share this characteristic.
XHTML, on the other hand, can be and is put to a variety of uses including direct display (with highly flexible display semantics), processing by network-sensitive Web spiders to support search and retrieval operations, and reprocessing into a variety of derivative forms.
In general XML-based data formats are more re-usable and repurposable than the alternatives, although the example of XML-FO shows that this is not an absolute.
There are many cases where final-form is an application requirement; representations which embody legally-binding transactions are an obvious example. In such cases, the use of digital signatures may be appropriate to achieve immutability, whether the format is naturally final-form or some XML vocabulary.
On the other hand, where such requirements are not in play, representations that are reusable and repurposable are in general higher in value, particularly in the case where the information's utility may be long-lived.
Some data formats are explicitly designed to be used in combination with others, while some are designed for standalone use. An example of a standalone data format is PDF; it is typically neither embedded in representations encoded in other formats nor is data in other formats generally embeddable in it.
At the other extreme is SOAP, which is designed explicitly to contain a "payload" in some non-SOAP vocabulary. Another example is SVG, which is designed to be included in compound documents, and which may in turn contain information encoded in other XML vocabularies.
This characteristic is related to, but distinct from, the final-form/reusable distinction discussed above. For example, one can certainly imagine cases where it is useful for a representation to include data in multiple different formats, but be considered immutable and display-only.
In many cases, the information contained in a separation is logically separable from the choice of ways in which it may be presented to a human, and the modes of interaction it may support.
While such separation is, where possible, often advantageous, it is clearly not always possible and in some cases not desirable either.
More incoming from C. Lilley
The Web's vast network of hyperlinks is one of its defining characteristics, and resource representations are thus commonly required to contain embedded links to other resources.
This section assumes that the other resources identified by hyperlinks are represented by URI references, a basic requirement of Web Architecture. There are, however, many syntactic options available for embedding such URI-based hyperlinks in resource representations.
More incoming from N. Walsh
Many resource representations are encoded in formats which are XML vocabularies. This section discusses issues that are specific to such data formats.
Anyone seeking guidance in this area is urged to consult the IETF Best Common Practice guidelines for the use of XML in Internet Protocols. This document contains a very thorough discussion of the considerations that govern whether or not XML ought to be used, as well as specific guidelines on how it ought to be used. While it is directed at Internet applications with specific reference to protocols, the discussion is generally applicable to Web scenarios as well.
The discussion here should be seen as ancillary to the content of the IETF BCP.
XML defines textual data formats that are naturally suited to describing data objects which are hierarchical and processed in an in-order sequence. It is widely but not universally applicable for format specifications. For example, an audio or video format is unlikely to be well suited to representation in XML. Design constraints that would suggest the use of XML include:
It is often desired to place the markup in an XML vocabulary in one or more namespaces with names which by definition are URIs. These namespace names SHOULD be usable for retrieval of human-readable material aimed at meeting the needs of those who are going to be using the markup vocabulary. The simplest way to achieve this is for the namespace name to be an HTTP URI which may be dereferenced to access this material. The resource identified by such a URI is called a "namespace document".
Ideally, a namespace document ought to be usable in support of automatic retrieval of other Web resources useful in support of processing markup from this vocabulary. Such resources could include stylesheets, schemas, and executable code.
RDDL is a proposal under discussion in the community for a variant of XHTML optimized for the construction of namespace documents which meet the goals described in this section.
Suppose that the URI http://example.com/oaxaca
defines a
resource with representations encoded in XML. What, then, is the
interpretation of the
URI http://example.org/oaxaca#weather
?
RFC 2396bis makes it clear that the interpretation depends on the context of the media-type of the representation. It follows from this that designers of XML-based data formats SHOULD include the semantics of fragment identifiers in their designs. XPointer is a W3C Recommendation which provides a syntax designed for in such fragment identifiers, and it SHOULD be used for this purpose.
When a representation is provided whose media-type
is application/xml
, there are no semantics defined for
fragment identifiers, and thus they SHOULD NOT be provided for such
representations.
This is also the case if the representation is known to be XML because the
media type has a suffix of +xml
as described in RFC3023, but
there is no normative specification of fragment semantics.
It is common practice to assume that when an element has an attribute that
is declared in a DTD to be of type ID, then the fragment
identifier #abc
identifies the element which has an attribute
of that type whose value is "abc"
.
However, there is no normative support for this assumption and it is
problematic in practice, since the only defined way to establish that an
attribute is of type ID is via a DTD, which may not exist or may not be
available.
RFC 3023 defines the media-types application/xml
and text/xml
, and describes a convention whereby XML-based
data formats use media-types with a +xml
suffix, for
example image/svg+xml
.
In general, media-types beginning with text/
SHOULD NOT be
used for XML representations.
They create two problems: First, intermediate agents in the Web are allowed
to "transcode", i.e. convert one character encoding to another.
Since XML documents are designed to allow them to be self-describing, and
since this is a good and widely-followed practice, any such transcoding
will make the self-description false.
Secondly, representations whose media-types begin with text/
are required, unless the charset
parameter is specified, to be
considered to be encoded in US-ASCII.
In the case of XML, since it is self-describing, it is good practice to omit
the charset
parameter, and since XML is very often not encoded
in US-ASCII, the use of "text/
" media-types effectively
precludes this good practice.