[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] XML Character encoding (again)




> The variety of input could be the cause of your problem. You've
> probably got some people pasting in latin-1 characters, others pasting
> in iso-8859-2 (central europe) etc. In this situation the bytes people
> are actually pasting represent different characters in different
> character sets. Without some sort of normalisation on input you're not
> going to be able to find a character encoding that's going to work.

> I can only think of two ways around it:
> 1) Get users to specify their character sets up front or sniff it from
> their browser headers and use this information to normalise the input
> to UTF-8 before saving it to the database.
> 2) Strip out anything that isn't UTF-8 when you output the XML.
>
> 1) seems impractical since what non-geek knows or cares what character
> set they use on their computer? 2) loses information but is guaranteed
> to 'work'.

Ian's right on all counts.  The only way to do this realiably is to rigorously
examine the content being input and preview it back to the users.   Convert it
all to UTF8.  Look at this feed for an example:

http://www.syndic8.com/feedinfo.php?FeedID=22568&Section=xml

It's Bulgarian, apparently.  And it's completely valid.

To paraphrase Duke Nukem and Mr. T, the thoughts are "encode 'em all and let God
sort 'em out" and "I pity the parser that has to handle this feed".

It most certainly violates the naive attitude of being 'human readable'.  But
all too often the readability crowd seems to have an awful lot of eurocentric
bias.  Some languages /can't/ be encoded in 'readable' ASCII.

As an aside, notice who's contributed a fair amount to the locale code on
linux... IBM.  Coincidence here?

Ok, so when you get the data look at what encoding their browser said it would
accept.  I'm not sure if there's a way to more thoroughly interrogate the
browser.  But that's a place to start.  If you see they're coming in with an
encoding that you /know/ you don't support then you'd be wise to offer some sort
of warning to them.  For example, your site accepts several encodings like en,
de, fr and no.  A user shows up with a browser running in jp.  You'd do well to
add in a little warning message that gives them a 'heads up' on your site's
known encodings.

Even so, when you get the data you'll need to iterate through it to make sure
it's not using characters that your encoding can't support.  Pushing them all
into UTF8 is a safe bet for nearly all situations.  For the others, well, that's
a whole other layer of issues (like right-to-left display and such).

What you /may/ also want to consider is being able to offer them some way to
transcode the output back into something they can use.  As in, you got the data
and it was in ISO-8859-11, you stored the data in UTF8 and now the user wants to
cut/paste it back out of your site into their local environment.  If that
environment doesn't support UTF8 they're going to be mighty confused by all the
gibberish.  Thus having a page that 'transcodes' the stuff back into what their
native environment supports is an option worth considering.  It's tempting to
say you could dynamically adjust the entire site based on the detected browser
encoding.  This is a knot of Gordian complexity.  Yes, it's possible, but this
level of full-on internationalization is probably SERIOUS overkill.  That and if
you think /this/ is complicated, i18n is an order of magnitude more so.

The other thing to consider is having some sort of detection filters for known
problematic data.  As in, detect that they tried pasting in some gawdawfully
formatted text from a word processor.  Especially something like a word pefect
document imported into word, saved, opened in open office, cut and pasted into
opera.  God help you.  I've found it easier to say "STOP, I can't take this
format of text, please use plain text or simple HTML instead.".  In many cases
the users won't complain as much as you might think.  They want the text, not
the endlessly delayed implementation of a nightmarishly complex
internationalized web portal.

This is why I've been a stickler for encoding.  Besides the usual crankiness of
course.

-Bill Kearney