[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] XML Character encoding (again)



On Tuesday, 15 April 2003 at 23:05, Julian Bond wrote:
> I have a situation where a motley crew of users are using all sorts of
> tools to enter blog text. Which means that they cut and paste in £ pound 
> signs, MS Smart quotes and the occasional foreign character (even 
> Euros). This text can also contain embedded html. I get this as POST 
> data and store it in a database. Later I read this out into the 
> <description> section of RSS. This is all done with PHP code.
The variety of input could be the cause of your problem. You've
probably got some people pasting in latin-1 characters, others pasting
in iso-8859-2 (central europe) etc. In this situation the bytes people
are actually pasting represent different characters in different
character sets. Without some sort of normalisation on input you're not
going to be able to find a character encoding that's going to work.

I can only think of two ways around it:
1) Get users to specify their character sets up front or sniff it from
their browser headers and use this information to normalise the input
to UTF-8 before saving it to the database.
2) Strip out anything that isn't UTF-8 when you output the XML.

1) seems impractical since what non-geek knows or cares what character
set they use on their computer? 2) loses information but is guaranteed
to 'work'.



- Ian <iand@internetalchemy.org>
"One never notices what has been done; one can only see what remains to be done."