[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] XML Character encoding (again)



In <oAvFgWQg$Yn+EAs2@jblaptop.voidstar.com>, Julian Bond <julian_bond@voidstar.com> writes:
> I'm really tempted to just say "tough". If you don't like the character
> put in a "?". If you're parser barfs on my feed, well don't read it.
> Programming hours are too short to start figuring out client browser
> capability, UTF-8 conversion from arbitrary encodings and so on.

For a browser based application, if the page is UTF-8 encoded you have a pretty
good chance of receiving UTF-8 encoded text in your application. You can then 
transcode UTF-8 to whatever character set you like.

In theory, RSS feeds in UTF-8 should be perfectly fine, unfortunately some RSS 
readers cannot handle UTF-8 encoded text, so if the majority of your content is
ISO-8859-1 then it may be your preferred choice (for browser based RSS readers
that ignore the encoding completely, the best option is to actually use
ISO-8859-1 encoding and UTF-8 numeric entities for characters not available in
the ISO-8859-1 character set, as long as they just pass on the numeric entities
the browser will correctly render the content).

> I'm genuinely puzzled that a CDATA block isn't enough to protect the
> text byte stream from aggressive parsers.

CDATA only stops the data from being parsed, the parser still needs to figure 
out where the CDATA section stops, which requires knowledge of the encoding. In
some encoding, Â]]> could be one character, represented by four bytes -- if the
parser does not know how to consume that character, it would wrongly assume the
CDATA section ended.

-- 
Klaus Johannes Rusch
KlausRusch@atmedia.net
http://www.atmedia.net/KlausRusch/