mnot’s blog

Design depends largely on constraints.” — Charles Eames

Friday, 30 June 2006

Friday Fun: Percent Encoding

Filed under: Web

If you boil down the BNF in both RFC2396 and RFC3986, path segments can contain the following characters without percent-encoding them:

ALPHA DIGIT ! $ & ' ( ) * + , - . : ; = @ _ ~

Query components can contain these:

ALPHA DIGIT ! $ & ' ( ) * + , - . / : ; = ? @ _ ~

Which means that

" < > [\] ^ ` { | }

should always be encoded in both (discounting non-ASCII characters, for now).

If you’re specifying the format of a HTTP URI, this is important; you want to be able to tell people what characters have special meaning, and when to encode them if they’re part of content. When implementations automatically percent-encode some characters it can cause problems – especially when the behaviour is different from implementation to implementation.

Note that I’m not (necessarily) saying that the latter characters should always be escaped; Web servers seem to support them in their raw form just fine, and some less fastidious Web developers may forget to un-escape them. I’m more interested in those characters that are unnecessarily escaped, which would cause trouble in some situations.

The Test

Try using your favourite resolver to access this URL:

http://www.mnot.net/cgi-bin/echo-uri/!$&'()*+,-.:;=@_~"<>[\]^`{|}/?!$&'()*+,-./:;=?@_~"^lt;>[\]^`{|}`

and post the results in comments. I’m particularly interested in results from Java, .NET, Perl and Ruby libraries.

Here it is as a link, and using javascript (ditto).

Here are a few preliminary results:

Safari

Pasted into the location bar.

Safari will escape angle brackets (“<>”) in a followed link (e.g., a/@href, using XHR), but not if you paste it directly into the location bar.

User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/418.8 (KHTML, like Gecko) Safari/419.3
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~"<>[\]^`{|}/?!$&'()*+,-./:;=?@_~"[\]^`{|}
Path
    Encoded: 
  Unencoded: !"$&'()*+,-./:;<=>@[\]^_`bceghinoru{|}~
Query
    Encoded: 
  Unencoded: !"$&'()*+,-./:;<=>?@[\]^_`{|}~

Firefox

Pasted into the location bar.

User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~%22%3C%3E%5B%5C%5D%5E%60%7B|%7D/?!$&'()*+,-./:;=?@_~%22%3C%3E[\]^%60{|}
Path
    Encoded: "<>[\]^`{}
  Unencoded: !$&'()*+,-./:;=@_bceghinoru|~
Query
    Encoded: "<>`
  Unencoded: !$&'()*+,-./:;=?@[\]^_{|}~

However, Firefox will treat the last path segment differently (note the missing “/”);

User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~%22%3C%3E[\]^%60{|}?!$&'()*+,-./:;=?@_~%22%3C%3E[\]^%60{|}
Path
    Encoded: "`
  Unencoded: !$&'()*+,-./:;=@[\]^_bceghinoru{|}~
Query
    Encoded: "`
  Unencoded: !$&'()*+,-./:;=?@[\]^_{|}~

Opera

Pasted into the location bar.

Opera silently transforms backslashes (“") to forward slashes (“/”) in the path (but not the query).

User-Agent: Opera/9.00 (Macintosh; PPC Mac OS X; U; en)
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~%22%3C%3E[/]^%60{|}/?!$&'()*+,-./:;=?@_~%22%3C%3E[\]^%60{|}
Path
    Encoded: "<>`
  Unencoded: !$&'()*+,-./:;=@[]^_bceghinoru{|}~
Query
    Encoded: "<>`
  Unencoded: !$&'()*+,-./:;=?@[\]^_{|}~

Curl

> curl -g --url cat file.url``

User-Agent: curl/7.15.4 (powerpc-apple-darwin8.6.0) libcurl/7.15.4 OpenSSL/0.9.8b zlib/1.2.3
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~"<>[\]^{|}?!$&'()*+,-./:;=/?@_~"[\]^{|}
Path
    Encoded: 
  Unencoded: !"$&'()*+,-./:;<=>@[\]^_bceghinoru{|}~
Query
    Encoded: 
  Unencoded: !"$&'()*+,-./:;<=>?@[\]^_{|}~

WGet

> wget -i file.url --output-document=-

User-Agent: Wget/1.10.2
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~%22%3C%3E[%5C]%5E%7B%7C%7D?!$&'()*+,-./:;=/?@_~%22%3C%3E[%5C]%5E%7B%7C%7D
Path
    Encoded: "<>\^{|}
  Unencoded: !$&'()*+,-./:;=@[]_bceghinoru~
Query
    Encoded: "<>\^{|}
  Unencoded: !$&'()*+,-./:;=?@[]_~

Python

import urllib; print urllib.urlopen(url).read()

User-Agent: Python-urllib/1.16
Request URI: /cgi-bin/echo-uri/!$&'()*+,-.:;=@_~"<>[\]^{|}?!$&'()*+,-./:;=/?@_~"[\]^{|}
Path
    Encoded: 
  Unencoded: !"$&'()*+,-./:;<=>@[\]^_bceghinoru{|}~
Query
    Encoded: 
  Unencoded: !"$&'()*+,-./:;<=>?@[\]^_{|}~

12 Comments

Dilip said:

IE 6.0 (if you paste in address bar), you get:

User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727) Request URI: /cgi-bin/echo-uri/!$&’()*+,-.:;=@_~<>[/]^{|}/?!$& Path Encoded: Unencoded: !$&'()*+,-./:;=@[]^_abceghilmnoprtu{|}~ Query Encoded: Unencoded: !$&

I tried issuing a HTTP request to this url:

[]^{|}">http://www.mnot.net/cgi-bin/echo-uri/!$&amp;'()*+,-.:;=@_~&lt;&gt;[/]^{ }/?!$&’()*+,-./:;=?@_~<>[]^`{ }

.NET 2.0 System.Net libraries give me a “too many automatic redirections attempted” error.

Friday, June 30 2006 at 1:22 AM

Tim Bray said:

Camino:

User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.0.3) Gecko/20060427 Camino/1.0.1 Request URI: /cgi-bin/echo-uri/!$&’()+,-.:;=@_~%22%3C%3E%5B%5C%5D%5E%60%7B|%7D/?!$&’()+,-./:;=?@~%22%3C%3E[]^%60{|} Path Encoded: “<>[]^{} Unencoded: !$&'()*+,-./:;=@_bceghinoru|~ Query Encoded: "&lt;> Unencoded: !$&’()*+,-./:;=?@[]^{|}~

Friday, June 30 2006 at 2:44 AM

Dilip said:

Mark You are right. Sorry, I must have made a mistake.

Any thoughts on that redirection error? I tried issuing a HTTP request for this URL: [/]^{|}/?!$&'()*+,-./:;=?@_~&lt;>[\]^{|}”>http://www.mnot.net/cgi-bin/echo-uri/!$&’()*+,-.:;=@_~<>[/]^{|}/?!$&'()*+,-./:;=?@_~&lt;>[\]^{|}

Friday, June 30 2006 at 6:47 AM

James Holderness said:

Not that this is of any use to you, but I was curious what Snarfer’s socket library would do.

User-Agent: Snarfer/0.4.2 (http://www.snarfware.com/) Request URI: /cgi-bin/echo-uri/!$&’()+,-.:;=@_~%22%3C%3E%5B%5C%5D%5E%60%7B%7C%7D/?!$&’()+,-./:;=?@~%22%3C%3E%5B%5C%5D%5E%60%7B%7C%7D Path Encoded: “<>[]^{|} Unencoded: !$&'()*+,-./:;=@_bceghinoru~ Query Encoded: "&lt;>[\]^{|} Unencoded: !$&’()*+,-./:;=?@~

Friday, June 30 2006 at 8:09 AM

Brendan Taylor said:

Ruby 1.8.4’s Net::HTTP responds:

Request URI: []^{|}/?!$&'()*+,-./:;=?@_~"&lt;>[\]^{|}”>http://www.mnot.net/cgi-bin/echo-uri/!$&’()+,-.:;=@_~”<>[]^{|}/?!$&'()*+,-./:;=?@_~"&lt;>[\]^{|} Path Encoded: Unencoded: !”$&’()+,-./:;@[]^_bceghimnoprtuw{|}~ Query Encoded: Unencoded: !"$&'()*+,-./:;?@[\]^_{|}~

Interestingly, Ruby’s URI library refuses to parse it. I’ve had problems with it not properly escaping [ and ] in the past, but that doesn’t seem to be the problem here.

Friday, June 30 2006 at 11:25 AM

Chris Winters said:

lwp: :Simple on Perl ActiveState/Win32 5.8.4 responds this way:

Content: User-Agent: lwp-trivial/1.40 Request URI: /cgi-bin/echo-uri/!$&’()+,-.:;=@_~”<>[]^{|}/?!$&'()*+,-./:;=?@_~"&lt;>[\]^{|} Path Encoded: Unencoded: !”$&’()+,-./:;@[]^_bceghinoru{|}~ Query Encoded: Unencoded: !"$&'()*+,-./:;?@[\]^_{|}~

Friday, June 30 2006 at 12:10 PM

Stefan Eissing said:

:Net v2.0, self written test code. Dilip, I did not have any redirection problems - but you can set the max number of redirects to follow on the HttpWebRequest object. Maybe someone sets that for you?

User-Agent: icings .netv2.0 tester Request URI: /cgi-bin/echo-uri/!$&’()+,-.:;=@_~%22%3C%3E%5B/%5D%5E%60%7B%7C%7D/?!$&’()+,-./:;=?@~%22%3C%3E%5B%5C%5D%5E%60%7B%7C%7D Path Encoded: “<>[]^{|} Unencoded: !$&'()*+,-./:;=@_bceghinoru~ Query Encoded: "&lt;>[\]^{|} Unencoded: !$&’()*+,-./:;=?@~

Saturday, July 1 2006 at 4:32 AM

Ken Hirsch said:

Active Perl 5.8.8 was different:

User-Agent: LWP::Simple/5.805 Request URI: /cgi-bin/echo-uri/!$&’()+,-.:;=@_~%22%3C%3E[%5C]%5E%60%7B%7C%7D/?!$&’()+,-./:;=?@~%22%3C%3E[%5C]%5E%60%7B%7C%7D Path Encoded: “<>\^{|} Unencoded: !$&'()*+,-./:;=@[]_bceghinoru~ Query Encoded: "&lt;>\^{|} Unencoded: !$&’()*+,-./:;=?@[]~

Sunday, July 23 2006 at 10:57 AM

karl said:

User-Agent: Lynx/2.8.6dev.18 libwww-FM/2.14 Request URI: /cgi-bin/echo-uri/!$&’()+,-.:;=@_~<>[]^{|}/?!$&'()*+,-./:;=?@_~[\]^{|} Path Encoded: Unencoded: !$&’()+,-./:;@[]^_bceghinoru{|}~ Query Encoded: Unencoded: !$&'()*+,-./:;?@[\]^_{|}~

Monday, July 24 2006 at 2:25 AM

karl said:

User-Agent: w3m/0.5.1+cvs-1.968 Request URI: /cgi-bin/echo-uri/!$&’()+,-.:;=@_~<>[]^{|}/?!$&'()*+,-./:;=?@_~@[\]^_bceghinoru{|}~ Query Encoded: Unencoded: !$&’()+,-./:;?@[]^_`{|}~

Monday, July 24 2006 at 2:27 AM

Olivier Mengué said:

It would be interresting to have the proxy headers information in the output.

Proxies are an other layer that could transform URLs.

Wednesday, January 31 2007 at 8:15 AM

Creative Commons