Friday Fun: Percent Encoding
Friday, 30 June 2006
If you boil down the BNF in both RFC2396 and RFC3986, path segments can contain the following characters without percent-encoding them:
ALPHA DIGIT ! $ & ' ( ) * + , - . : ; = @ _ ~
Query components can contain these:
ALPHA DIGIT ! $ & ' ( ) * + , - . / : ; = ? @ _ ~
Which means that
" < > [\] ^ ` { | }
should always be encoded in both (discounting non-ASCII characters, for now).
If you’re specifying the format of a HTTP URI, this is important; you want to be able to tell people what characters have special meaning, and when to encode them if they’re part of content. When implementations automatically percent-encode some characters it can cause problems – especially when the behaviour is different from implementation to implementation.
Note that I’m not (necessarily) saying that the latter characters should always be escaped; Web servers seem to support them in their raw form just fine, and some less fastidious Web developers may forget to un-escape them. I’m more interested in those characters that are unnecessarily escaped, which would cause trouble in some situations.
The Test
Try using your favourite resolver to access this URL:
https://www.mnot.net/cgi-bin/echo-uri/!$&'()*+,-.:;=@_~"<>[\]^`{|}/?!$&'()*+,-./:;=?@_~"^lt;>[\]^`{|}`
and post the results in comments. I’m particularly interested in results from Java, .NET, Perl and Ruby libraries.
Here it is as a link, and using javascript (ditto).
Here are a few preliminary results:
Safari
Pasted into the location bar.
Safari will escape angle brackets (“<>”) in a followed link (e.g., a/@href, using XHR), but not if you paste it directly into the location bar.
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/418.8 (KHTML, like Gecko) Safari/419.3
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~"<>[\]^`{|}/?!$&'()*+,-./:;=?@_~"[\]^`{|}
Path
Encoded:
Unencoded: !"$&'()*+,-./:;<=>@[\]^_`bceghinoru{|}~
Query
Encoded:
Unencoded: !"$&'()*+,-./:;<=>?@[\]^_`{|}~
Firefox
Pasted into the location bar.
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~%22%3C%3E%5B%5C%5D%5E%60%7B|%7D/?!$&'()*+,-./:;=?@_~%22%3C%3E[\]^%60{|}
Path
Encoded: "<>[\]^`{}
Unencoded: !$&'()*+,-./:;=@_bceghinoru|~
Query
Encoded: "<>`
Unencoded: !$&'()*+,-./:;=?@[\]^_{|}~
However, Firefox will treat the last path segment differently (note the missing “/”);
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~%22%3C%3E[\]^%60{|}?!$&'()*+,-./:;=?@_~%22%3C%3E[\]^%60{|}
Path
Encoded: "`
Unencoded: !$&'()*+,-./:;=@[\]^_bceghinoru{|}~
Query
Encoded: "`
Unencoded: !$&'()*+,-./:;=?@[\]^_{|}~
Opera
Pasted into the location bar.
Opera silently transforms backslashes (“") to forward slashes (“/”) in the path (but not the query).
User-Agent: Opera/9.00 (Macintosh; PPC Mac OS X; U; en)
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~%22%3C%3E[/]^%60{|}/?!$&'()*+,-./:;=?@_~%22%3C%3E[\]^%60{|}
Path
Encoded: "<>`
Unencoded: !$&'()*+,-./:;=@[]^_bceghinoru{|}~
Query
Encoded: "<>`
Unencoded: !$&'()*+,-./:;=?@[\]^_{|}~
Curl
> curl -g --url
cat file.url``
User-Agent: curl/7.15.4 (powerpc-apple-darwin8.6.0) libcurl/7.15.4 OpenSSL/0.9.8b zlib/1.2.3
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~"<>[\]^{|}?!$&'()*+,-./:;=/?@_~"[\]^{|}
Path
Encoded:
Unencoded: !"$&'()*+,-./:;<=>@[\]^_bceghinoru{|}~
Query
Encoded:
Unencoded: !"$&'()*+,-./:;<=>?@[\]^_{|}~
WGet
> wget -i file.url --output-document=-
User-Agent: Wget/1.10.2
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~%22%3C%3E[%5C]%5E%7B%7C%7D?!$&'()*+,-./:;=/?@_~%22%3C%3E[%5C]%5E%7B%7C%7D
Path
Encoded: "<>\^{|}
Unencoded: !$&'()*+,-./:;=@[]_bceghinoru~
Query
Encoded: "<>\^{|}
Unencoded: !$&'()*+,-./:;=?@[]_~
Python
import urllib; print urllib.urlopen(url).read()
User-Agent: Python-urllib/1.16
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~"<>[\]^{|}?!$&'()*+,-./:;=/?@_~"[\]^{|}
Path
Encoded:
Unencoded: !"$&'()*+,-./:;<=>@[\]^_bceghinoru{|}~
Query
Encoded:
Unencoded: !"$&'()*+,-./:;<=>?@[\]^_{|}~
12 Comments
Dilip said:
Friday, June 30 2006 at 1:22 AM
Mark Nottingham said:
Friday, June 30 2006 at 1:31 AM
Tim Bray said:
Friday, June 30 2006 at 2:44 AM
Dilip said:
Friday, June 30 2006 at 6:47 AM
James Holderness said:
Friday, June 30 2006 at 8:09 AM
Brendan Taylor said:
Friday, June 30 2006 at 11:25 AM
Chris Winters said:
Friday, June 30 2006 at 12:10 PM
Stefan Eissing said:
Saturday, July 1 2006 at 4:32 AM
Ken Hirsch said:
Sunday, July 23 2006 at 10:57 AM
karl said:
Monday, July 24 2006 at 2:25 AM
karl said:
Monday, July 24 2006 at 2:27 AM
Olivier Mengué said:
Wednesday, January 31 2007 at 8:15 AM