The data URL scheme

Latest version:
http://simonsapin.github.com/data-urls/
This version:
Last updated on 13 March 2013.
Participate:
File a bug
IRC: #whatwg on Freenode
Version History:
https://github.com/SimonSapin/data-urls/commits
Editor:
Simon Sapin

1 Introduction

The data URL scheme is defined by RFC 2397, which unfortunately is vague regarding many details of the syntax. This document describes a more precise parsing algorithm for data: URLs.

See also Bug 19494 on the W3C Bugzilla and other stuff linked from there.

2 “Fetching” a data: URL

This algorithm returns either a failure or two byte strings: a MIME type with parameters (as it would appear in a Content-Type HTTP header) and the decoded data.

To obtain a resource from a parsed URL with the "data" scheme, run these steps:

  1. Let input be the URL’s scheme data.
  2. If the URL’s query is not null, append "?" and the query to input.
  3. If input does not contain a U+002C COMMA code point, return a failure and abort these steps.

    The comma can come either from the scheme data or the query.

  4. Split input at the first comma. Let mime_type and body be the parts before and after the comma, respectively.

    What if the comma is an a MIME quoted string for a parameter value? Example: data:text/plain;foo="bar,baz";charset=utf8,body

  5. Let data be the result of running percent decode on body.
  6. If mime_type ends with ";base64" then:

    Match how strictly? Case sensitive or not? Allow whitespace? Percent-encoding?

    1. Remove the matched substring from mime_type
    2. Set data to the result of decoding data with the Base 64 Encoding.

      Return a failure on "invalid" base64? What is invalid? Also accept the URL and Filename Safe Alphabet? Mixed alphabets in the same body? Ignore which non-alphabet bytes? Missing/too little/too much padding?

  7. Return mime_type and data.

TODO: The algorithm is missing this part of RFC2397: If <mediatype> is omitted, it defaults to text/plain;charset=US-ASCII. As a shorthand, "text/plain" can be omitted but the charset parameter supplied.

This definition does not impose any length limit on data: URLs.

When doing URL parsing followed by this algorithm, implementation are allowed to skip some intermediate steps in order to process large URLs efficiently, as long as the "black box" behavior the same.