[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [syndication] XML Character encoding (again)
Sound like it might be a parser bug and not your bug.
Try UTF-16 long enough to see if the problem goes away. Or latin-1 to
see if the foreign characters are successfully trashed.
Doug
Julian Bond wrote:
I feel like I should have solved this years ago. But it's still causing
me trouble.
I have a situation where a motley crew of users are using all sorts of
tools to enter blog text. Which means that they cut and paste in £ pound
signs, MS Smart quotes and the occasional foreign character (even
Euros). This text can also contain embedded html. I get this as POST
data and store it in a database. Later I read this out into the
<description> section of RSS. This is all done with PHP code.
At the moment, I'm wrapping this in a CDATA section with the whole XML
block using UTF-8 encoding. This appears to break. It's invalid in Mark
Pilgrim's validator. IE6 complains about "An invalid character was found
in text content". People tell me that other validating XML parsers
complain as well, including the one used by Livejournal. Which is
puzzling when Mark's help text advises this as a technique. I thought a
CDATA block would protect against this and it's presumably why MT
starting using this. But looking at the W3C comments on CDATA it only
protects against XML special characters being unescaped. It doesn't
appear to protect against bad character encoding.
Previously, I've used UTF-8 with no CDATA but using the
htmlspecialchars() function in PHP to escape the reserved 5 XML
characters. This is otherwise fine, and deals with embedded HTML but
still fails with some invalid characters.
I've also tried using PHP's htmlentities() function to encode the text
and an ENTITY statement pointing at
http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent This contains the same
entity translation as the original Netscape RSS DTD. Again this almost
works, but some PHP translations are slightly different. In particular
' is missing in PHP and there may be others.
I've also tried alternate character sets and more MS friendly char sets,
but some Mac and Linux users have managed to enter data that breaks
those. (I think).
Next? Do I have to convert all high order characters into character
number form? What will parsers make of this? Especially the
ultra-liberal Regex ones? (like mine...)
I really thought that UTF-8 would just treat single byte characters as
single bytes and not complain. But I'm not looking at the wire to see
what PHP and Mysql are actually passing.
There has to be a way to put arbitrary bytes into a defined block within
the <description> element without having to explicitly encode each one.
Hasn't there?
Aaaargh! If anyone has a real answer to this, I'd really appreciate a
fairly detailed recipe. I suspect I'm not alone.