[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: O'Reilly's "Content Syndication with XML and RSS"
> > I think that it needs to be strongly worded that if you want to use RSS,
> > use an XML parser. Do not try to parse RSS by hand.
>
> Just out of curiosity - what is wrong with parsing RSS/RDF by hand?
> There is nothing particuarly difficult or complex about parsing XML by hand,
> particuarly RSS - one of the simpler formats. My own personal tests have
> shown several different "hand parsers" run faster than several premade
> ones - on a nanosecond level anyway.
What's "wrong" with not using an XML parser is you become overly susceptible to
variations in the stream. Almost as bad as the old days of using csv, tabbed or
placement-specific data streams. Should something "new" come along, like an
attribute added to an element, hand parsing is likely to fail. For example
(hypothetically) going from <item> to <item id="xyx123"> would break most hand
parsers. A legitimate XML parser would keep right on chugging along and ignore
that id element unless you added a clause to handle it. A hand parser wouldn't
see the <item> and freak. But regexp'ing for <item*> may work but does nothing
to support more flexible attributes. Using event parsing really works nicely
here.
I can confess, however, to cheating and using perl routine to slog through the
dmoz RDF export. The dataset was too large for the expat parser used inside php
to handle as quickly as I needed. But I ended up wasting a LOT of debugging
time handling special case screw ups in the exported text. The parser-based
code, which I re-ran later, worked through the file with NO issues. It just
took longer.
It's not alwasys a matter of saving time "now" that's important. It is, of
course, important to be as efficient as is possible. But not if it means having
to revisit code some weeks/months/years from now because the XML format has
changed slightly. Better to have used an XML parser and replace just it instead
of slogging through ancient, hastilty written code, full of so-called speed
tweaks. I speak from personal experience here.
In looking at some of the code behind the portals like php and post-nuke, I
shudder at their pseudo-compliance with XML.
As for character encoding, there's nothng wrong with using proper XML encoding.
What IS wrong is the ambiguity on what's supposed to be included within given
elements. Is it just text? Is it HTML? What charset does it use? Does it use
the same charset as the entire file or is this element using it's own? These
are also really good reasons to use an XML parser instead of hand-parsing.
Better to tell your parser you want to see the contents of element X in charset
Y and to fault otherwise. Trying to regexp your way out of it will drive you
insane, not to mention being undoubtedly less efficient over time.
As for encoding, the world doesn't write only in single byte text. Putting
forth the effort into supporting multibyte encoding a la UTF-8 and UTF-16 will
help widen the audience of users. XML and decent parsers support this NOW.
Using them, while it may take a performance hit, opens a lot of doors that
hand-parsing may not handle properly. Yes, there's bloat and various hassles in
handling UTF-8 text. But when you consider opening the audience from a billion
or so single-byte language speakers to the entire 4+ billion members of the
world population, it might be worth the effort.
-Bill Kearney