[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

syndication and i18n



So, I've been spending spare moments here and there putting together
an aggregator for some time, in Python. I've never written an
internationalized app before, and, wanting to do The Right Thing, I
thought I'd give it a try, especially seeing as how Python 2.x
supports Unicode.

I may have bitten off more than I can chew.

It seems that the permutations of:
 - source XML charset declaration,
 - actual character content of the XML, and
 - browser's desired charset
are overwhelming. 

Many feeds occasionally have characters that pop through unescaped,
such as single-quotes from Windows, etc.

Currently, my strategy is to .encode('utf-8') EVERYTHING that comes
in, and write that out (if you mix encodings in certain ways, Python
doesn't like it). This works, but it doesn't seem too friendly to
double-byte feeds or users, who I assume would be out of luck.

Questions;
 - should I emit 'utf-8' in the appropriate HTTP headers to make
   browsers do the right thing?

 - In python, are there ways to:
   - determine what encoding an XML document uses (from SAX)
   - determine what encoding an arbitrary string is in

 - Does the above strategy doom double-byte users?

 - How does one deal with creating an HTML page from XML feeds which 
   have potentially radically different charsets (i.e., ASCII and
   double-byte chinese on the same page)?

 - Does anybody know of some Cantonese RSS feeds for testing? ;)

 - How does one catch and deal with illegal characters in the XML
   source (SAX2)?

Regards,

-- 
Mark Nottingham
http://www.mnot.net/