[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] Blogger's Atom format(s)??



On Mon, 17 May 2004 08:22:59 -0400, Dave Winer <dave@userland.com> wrote:
> I'm seeing a fair amount of variety in how 
> entry-level content is expressed.
> 
> Some feeds have summary's, others have content. 
> 
> Some content is embedded in a div, others without 
> the div.

Look at it one way, and there are a whole bunch of content elements.
An entry can have a summary element, a content element, or both (or
multiple content elements of different types). For either sort of
element, it might have any MIME type, though for an entry level
aggregator you'll probably only be looking for ones that are
text/plain, text/html, and application/xhtml+xml. Then, the mode might
be escaped or (the default) xml, or Ghu forbid, base64.

Look at it the way that's worked best for me so far, parse XML to an
intermediate data structure, and then pick the parts out of that that
I really want, and you just have "what do I do to it?", "what's it
going to be?" and "how much of it?". If the mode is escaped, then your
XML parser (well, maybe not yours, but mine) will have unescaped it
already, and what you get is in the MIME type it says. If the mode is
XML (or there isn't a mode attribute), then the inline XML is already
that type, and you either don't do any unescaping, or if your parser
will have already unescaped and handed you separate elements and
content, you stitch it back together to get that type (they tell me
that makes life wonderful for people using things like XPath, and I
hope some day they'll actually show me rather than tell me). If it's
base64, you undecode it and have the MIME type it claims. Then, if the
type isn't what suits you (they gave you text, you want HTML), convert
it to what you want (run it through your autoparagraphing routine),
and use whichever quantity, summary or content, suits your needs.

You want as much content as you can get, in HTML? Look for a content
element, if there isn't one settle for summary. If it's escaped,
unescape it, if not just grab it. If it's text/plain, autoparagraph
it, if it's text/html throw it in your source as-is after stripping
dangerous stuff, if it's application/xhtml+xml, maybe strip off the
namespace declaration on the outer element and do the same.

> If we're all going to have to parse Atom, as it 
> seems we will, wouldn't it make sense to try to get it to be one format, not 
> multiple formats?

My outsider's take on the format is that it's a producer's format, and
an end-consumer's format, more than an aggregator author's format. If
you have plain text you want to stick in a feed, you can do that. If
you have well-formed XHTML, you can do that, and what should come out
the other end of the pipe is however that can best be presented on
whatever you are using, cellphone, teletype, or Mozilla. In either
case, the only munging up it's going to get will be the aggregator
author's fault ;)


> Also, I'm making some judgement calls, for 
> example, about which date is the pub date. Are there any guidelines? Should 
> the creation date be considered the pub date? Or the modification date? Is there 
> the concept of a publication date? If not, why not?

I've never been quite sure what publication date really is, other than
for the New York Times. But, in theory there will always be an issued
and a modified, and issued is pretty much pubDate.

> Is there a list of design goals for Atom somewhere? 
> Is kindness to aggregator developers on the list? If not, it should 
> be.

Aggregator developers? Those people who won't let me use a
greater-than symbol in my titles, no matter what I do to it? There's
nothing for them but forty lashes, and a strict set of instructions to
behave themselves, and pay attention to what *I* say my content is,
not what they might guess I might have meant.

Phil Ringnalda