[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [syndication] Re: syndication and i18n
On Tue, May 22, 2001 at 05:01:31PM +0100, hpyle@agora.co.uk wrote:
>
> My take: use a decent XML parser and you'll have all the parse-side
> encoding issues completely handled for you, and your Python code will just
> see Unicode. It might mean you end up with a stricter aggregator than
> some (eg. you won't be able to accept <item>stuff<img
> src="something"></item> because it's badly formed), but IMHO that's not a
> bad thing.
That's what I'm already doing; unfortunately, it's not that easy in
practice, because unicode handling (in Python, at least) isn't that
transparent. For example, there are non-ASCII characters in both the
Standard and the W3C's RSS feeds right now, which cause Python to
raise an error unless I .encode('utf-8') them into strings.
Parse-side isn't a problem; it's doing something with the output that
is.
For those interested in the minute details...
In the W3C feed, the source HTML (the home page) is charset=us-ascii,
and the offending bit of markup is encoded:
Philippe Le Hégaret
which renders fine in Mozilla.
In the XML RSS file, the XML has an encoding of 'utf-8', and the
offending markup is:
Philippe Le Hégaret
So, PyXML will spit this out as unicode. If I try to print that to
anything, or combine it with other strings in certain ways, I get
UnicodeError: ASCII encoding error: ordinal not in range(128)
unless I .encode('utf-8') it, in which case I get something that
prints in ascii as
Philippe Le Hégaraet
which seems to render correctly, as long as I set the charset to
utf-8. Fine.
The Standard's feed has encoding="ISO-8859-1". The offending markup
is
Net 21 <96> The Survivors
which, as a Python unicode string, looks like
u'Net 21 \x96 The Survivors'
If I .encode('utf-8') it, I get
'Net 21 \xc2\x96 The Survivors' \
which doesn't look correct at all (it's supposed to be an em
dash) when rendered in Mozilla with utf-8. If I change the charset to
8859-1, the original renders correctly, but the unicode-encoded
string does not (it has an extra character prepended, understandably).
I think the root of the problem is that I have no apparent way to
determine the encoding of a unicode string coming out of the XML
parser, or a way to consolidate several different encodings into one
document (although I thought this was what unicode was supposed to
enable).
I should probably take this to the Python XML group...
--
Mark Nottingham
http://www.mnot.net/