[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [syndication] referencing a DTD

To: <syndication@yahoogroups.com>
Subject: RE: [syndication] referencing a DTD
From: "Leigh Dodds" <ldodds@ingenta.com>
Date: Wed, 14 Nov 2001 09:44:34 -0000
Importance: Normal
In-reply-to: <4.3.1.0.20011114100239.019f2810@127.0.0.1>

> -----Original Message-----
> From: Rod Davies [mailto:rdavies@orientpacific.com]
> Sent: 14 November 2001 03:08
> To: syndication@yahoogroups.com
> Subject: [syndication] referencing a DTD
>
[...]
> My question is, seeing Netscape does not have as much interest in RSS as
> previously is there another DTD we could use, which would be more
> permanent? Could anybody suggest some?
>
> We are not programmers or developers, just publishers, which
> might be very evident! We use 0.91

This is something that is worth discussing, and I'll throw in a little
history for good measure:

In the days of yore, when SGML had yet to begat XML, DTDs were
always required. If you had an SGML document, it had to have a
DTD. Otherwise your parser just spewed out a load of errors.

Obviously the SGML developers had to address this very issue: how
could they interchange SGML documents, or even move documents
from machine to machine and still have their DTD references be
correct?

Well there's two ways to reference a DTD: Using a Public Identifier,
and a System Identifier.

A System Identifier is basically a file or URL reference to your
DTD. i.e. something that the system can retrieve for you. Easy.

A Public Identifier is simply a name, with special formatting rules, that
will uniquely identify your DTD. Here's an example of a public identifier:

"-//W3C//ENTITIES Latin 1 for XHTML//EN"

In this case it's an identifier for a set of entities. You can do that
as well as identify a full DTD.

Of course, just knowing the name of your DTD isn't much good.
You need to know where it lives. For this SGML developers used
something called a Catalog. A Catalog is basically a lookup table
that says "The DTD with this Public Identifier...can be found here...".
SGML parsers then read these catalogues to find DTDs.

Public Identifiers are therefore infinitely more useful because you can
name your DTD, and then let the person processing your SGML file
worry about where they want to keep their DTDs (they maintain their
own catalogs).

Moving forward in our history lesson, XML was born from SGML. Part
of XML's appeal was the removal of the need for DTDs. You can
still use them if you wish, but you can also have your XML be
wild and free and live naked without a DTD.

XML also didn't include Catalogs as part of the core standard. I have
know idea why. However they *did* retain Public Identifiers. A bit odd
to keep a basic feature, and then throw away a standard means of
using it, but there we go.

So now, the primary way of referencing a DTD in XML is by using a
System Identifier:

e.g:

<!DOCTYPE rss SYSTEM "http://my.netscape.com/publish/formats/rss-0.91.dtd";>

This is obviously fragile if the DTD moves, or the URL isn't addressable.
If your System Identifier is actually a file, then it's potentially even
more fragile (particularly for a syndication format - after all, who has
access to your file-system?)

The good news is, is that you can still use a Public Identifier, as follows:

<!DOCTYPE rss PUBLIC
	"-//Netscape Communications//DTD RSS 0.91//EN"
	"http://my.netscape.com/publish/formats/rss-0.91.dtd";>

You must still provide a System Identifier as well, because (remember)
XML parsers aren't required to understand Public Identifiers (D'oh!), so
have to provide a fallback position in case the users parser doesn't
understand them.

Because Catalogs are unequivocally a Good Thing, a number of people
have been working on a specification for XML Catalogs [1] to allow
the benefits of Public Identifiers to be gained. A lot of code is now
available
to allow parsers to be extended by 'plugging in' support for Catalogs
(it's relatively easy to do). So the situation is improving.

Now, to answer your question!

The Right Thing, in my opinion, is:

- for authors of DTDs and schemas to assign them Public Identifiers
- for authors/publishers of XML documents to reference the Public Identifier
AND
a suitable XML identifier (choose your own definiton of 'suitable')
- for consumers of these documents to begin using Catalogs.

This will remove the fragility in the current situation. A publisher, using
a Public Identifier, can be happy in the knowledge that they are providing
all the required pieces of information for their document to be validated.

A programmer can be happy in the knowledge that with a quick update
to their Catalogs, they can begin using a local copy of a DTD and avoid
network overheads, downtime, etc. Also, because a Catalog can map
one System Identifier to another System Identifier (e.g. a URL to
a local file) they can be even happier in the knowledge that they
don't have to wait for the publishers to catch up and begin using
Public Identifiers.

One might counter the points raised here by saying: "can't we just
have multiple copies of the DTD?". Well you can. But, you're still
suffering unnecessary network overheads, you still might get
network time-outs, and can you trust all copies of that DTD to
be updated in step? Having a copy of your DTD on a webserver
somewhere is a good thing, but I'd say that Catalogs are the better option.

Sorry for the long rant, hope this helps.

[1]. http://www.oasis-open.org/committees/entity/spec.html


--
Leigh Dodds, Research Group, Ingenta | "Pluralitas non est ponenda
http://weblogs.userland.com/eclectic |    sine necessitate"
http://www.xml.com/pub/xmldeviant    |     -- William of Ockham

References:
- referencing a DTD
  - From: Rod Davies <rdavies@orientpacific.com>

Prev by Date: Re: [syndication] referencing a DTD
Next by Date: New version of GrabNews available
Previous by thread: Re: [syndication] referencing a DTD
Next by thread: Re: [syndication] Announce: Weblogs.Com changes in RSS
Index(es):
- Date
- Thread