URL Specification

This document provides a concise formal definition of URLs and the language of URLs. It offers a rephrasing of the WHATWG URL standard in an effort to better reveal the internal structure of URLs and the operations on them. It provides a formal grammar and a set of elementary operations that are eventually combined together in such a way that the end result is equivalent with the WHATWG standard.

The goals of this project are:

  • To provide a modular, concise formal specification of URLs that is behaviourally equivalent with, and covers all of the WHATWG URL standard with the exception of the web API's setters and the url-encoded form format.
  • To provide a general model for URLs that can express relative references and to define reference resolution by means of a number of elementary operations, in such a way that the end result is in agreement with the WHATWG standard.
  • To provide a concise way to describe the differences between the WHATWG standard and RFC 3986 and RFC 3987 as part of a larger effort;
  • To enable and support work towards an update of RFC 3987 that resolves the incompatibilities with the WHATWG standard.
  • To enable and support editorial changes to the WHATWG standard in the hope to recover the structural framework for relative references, normalisation and the hierarchy of equivalence relations that was put forward in RFC 3986.

Status

Prelude

This section introduces basic concepts and notational conventions that are used throughout the rest of this document.

Components

typevalue The word component is used throughout this document to mean a tagged value, denoted by (type value), where type is taken from a predefined set of component-types.

The component-type is typeset in boldface and it may be used stand-alone as a component-type or in prefix position to denote a component.

For example, scheme denotes a component-type, whilst e.g. (scheme http) denotes a scheme-component that has the string http as its value.

When we are dealing with a collection of components, and the collection contains exactly one component with a given component-type, then we may use the type of the component to refer to its value directly.

For example, if S is a sequence of components (dir x・file y・query z) then the phrase “the query of S” is used to refer to the value z whereas “the query component of S” is used to refer to the component (query z) as a whole.

The value of a component may be a sequence of components itself, in which case the component may be referred to as a compound-component for clarity.

Sequences

This specification uses a notation for sequences that makes no distinction between one-element sequences and single elements. It uses the following notation:

  • The empty sequence is denoted by ε.
  • The concatenation of two sequences S and T is denoted by S・T.

Strings and Code-Points

For the purpose of this specification,

  • A string is a sequence of characters.
  • A character is a single Unicode code point.
  • A code point, is a natural number n < 17 × 216.
  • The empty string is a sequence of zero code points. It is denoted by ε.

Code points are denoted by a number in hexadecimal notation preceded by u+, in boldface. Code points that correspond to printable ASCII characters are often denoted by their corresponding glyph, typeset in monospace and on a screened background. For example, u+41 and A denote the same code point. Strings that contain only printable ASCII characters are often denoted as a connected sequence of glyphs, typeset likewise.

The printable ASCII characters are code points in the range u+20 to u+7E, inclusive. Note that this includes the space character u+20.

Character Sets

A character-set is a set of characters. A character-range is the largest character-set that includes a given least character c and a greatest character d. Such a character-range is denoted by { c–d }. The union of e.g. { a–b } and { c–d }, is denoted by { a–b, c–d }, and this notation is generalised to n-ary unions. Common character-sets that are used throughout this document are defined below:

Fix the notaton of 216 above

Grammars

The notation name ::= expression is used to define a production rule of a grammar, where the expression uses square brackets ( [ … ] ) for optional rules, a postfix star ( * ) for zero-or-more, a postfix plus ( + ) for one-or-more, an infix vertical line ( | ) for alternatives, monospaced type for literal strings and an epsilon ( ε ) for the empty string. Concatenation takes the highest precedence, followed by ( * ) and ( + ), and ( | ) is used with lowest operator precedence. Parentheses are used for grouping and disambiguation.

Pattern Testing

The shorthand notation string :: rule is used to express that a string string can be generated by the production rule rule of a given grammar.
Likewise, the notation string :: expression is used to express that string can be generated by expression.

Percent Coding

This subsection is analogous to the section Percent-Encoding of RFC 3986 and the section Percent-encoded bytes of the WHATWG standard.

Bytes

For the purpose of this specification,

  • A byte is a natural number n < 28.
  • A byte-sequence is a sequence of bytes.

Percent Encoded Bytes

A non-empty byte-sequence may be percent-encoded as a string by rendering each individual byte in two-digit hexadecimal notation, prefixed by %.

Percent Encoded String

A percent-encoded–string is a string that may have zero or more percent-encoded–byte-sequences embedded within, as follows:

The grammar above may be relaxed in situations where error recovery is desired. This is achieved by adding a pct-invalid rule as follows:

There is an ambiguity in the loose grammar due to the overlap between pct-encoded-bytes and pct-invalid combined with uncoded. The ambiguity must be resolved by using the pct-encoded-bytes rule instead of the pct-invalid rule wherever possible.

Percent Encoding

To percent-encode a string string using a given character-set encode-set …

The only tricky bit in this is is the encoding of multi-byte UTF8 sequences, and the optional encoding override to specify a non-UTF8 encoding for the query component.

Percent Decoding

Percent decoding of strings is stratified into the following phases:

  • Analyse the string into pct-encoded-bytes, pct-invalid and uncoded parts according to the pct-encoded-string grammar.
  • Convert each of the the pct-encoded-bytes parts to a byte-sequence and decode the byte sequence to a string assuming that byte-sequence uses the Unicode UTF-8 encoding. Leave the pct-invalid and uncoded parts unmodified.
  • Recompose the string by concatenating each of the parts

URL Model

An URL, or specifically an URL-structure, is conceptually distinct from an URL-string. An URL-structure is a collection of components that adheres to a number of constraints, whereas an URL-string is a string that represents an URL. An URL-string has an internal structure that can be parsed as an URL-structure. Conversely, an URL-structure may be converted to an URL-string.

URL

An URL is a sequence of components. The components are ordered by their type in an ascending order where the types are taken from the ordered set:

scheme  <  authority  < drive  <  path-root  <  dir  <  file  <  query  <  fragment.

An URL must contain at most one component of each type except for dir components, of which it may have any finite amount. It must uphold the path-root-constraint: If an URL has an authority or a drive component, and it also has a dir or a file component, then it must also have a path-root component. If present, then the value of each of the components have the following structure:

The scheme is a string scheme :: alpha (alpha | digit | + | - | .)*.
The authority component is a compound-component and its value is an Authority.
The drive is a two-character string drive :: alpha (: | |).
The path-root is the single character string /.
The dir, file, query and fragment are percent-encoded-strings.
However, the file must be a nonempty percent-encoded-string, if present.

In an URL-string the component values are delineated from each other by means of sigils. Most component-types have an associated sigil that is used either in a prefix– or postfix position to its value. However, in the case of the path-root the sigil is the single / character itself, and the file, host and username components do not have an associated sigil at all. The components of an URL are paired with sigils as follows:

scheme scheme  ⟼  scheme :
authority auth  ⟼  // auth
drive x  ⟼  / x
path-root s  ⟼  /
dir name  ⟼  name /
file name  ⟼  name
query query  ⟼  ? query
fragment frag  ⟼  # frag

Note that the second character : or | of a drive component value is not considered to be its sigil, but is instead a part of its value.

Types and Shapes

There is a further categorisation of URLs that is used in several of the operations that will be defined later.

Web-URL

A web-URL is an URL that has a web-scheme. A web-scheme is a string scheme such that (lowercase scheme) :: http | https | ws | wss | ftp.

File-URL

A file-URL is an URL that has a file-scheme. A file-scheme is a string scheme such that (lowercase scheme) :: file.

Schemeless-URL

A schemeless-URL is an URL that does not have a scheme component.

Authority Model

The authority component of an URL is a compound-component and its value is an Authority structure:

Authority

An Authority is a sequence of components ordered by their type, taken from the ordered set:

userinfo  <  host  <  port.

An Authoritiy must have at most one component per type, but if it does have a userinfo or a port component then it must also have a host component. If present, the value of the components have the following structure:

The userinfo is a Userinfo structure.
The password is a percent-encoded-string.
The host is a Host structure.
The port is a natural number n  <  216 or the empty string ε.

In an URL-string, the components of an Authority are paired with sigils as follows:

userinfo info  ⟼  info @
host value  ⟼  value
port ε  ⟼  :
port n  ⟼  : n

Userinfo

The Userinfo is a sequence of components ordered by their type, taken from the ordered set:

username < password.

The Userinfo must have a single username component, and at most one password component.

Finally, the components of the Userinfo are paired with sigils as follows:

username name  ⟼  name
password pass  ⟼  : pass

Host

A Host is either:

  • An ipv6-address,
  • an opaque-host,
  • an ipv4-address, or
  • a domain-name.

The WHATWG standard specifies scheme-dependent expectations on the Host of an URL. For any generic URL it merely enforces that its opaque-host does not contain certain characters, as is specified by the URL-grammar. However, for file and web-URLs it attempts to parse their opaque-host as a domain or ipv4-address. This additional Host parsing is stratified into the following phases:

  • Percent decode.
  • Apply domain-to-ASCII
  • Err on 'forbidden domain codepoints'.
  • Detect and interpret IPv4 addresses.

IPv6 Address

IPv6 Address — Strict

An IPv6 address-string is a representation of a natural number n < 2128 that is used as an identifier. It is accurately described by the production rule IPv6address in the Host section of RFC 3986.

The IPv6 address parser in the WHATWG standard however implies a more tolerant definition of IPv6 address-strings.

IPv6 Address — Loose

Domain and IPv4 Address

IPv4 Address

An IPv4 address-string consists of one up to four dot-separated numbers with an optional trailing dot. The numbers may use decimal, octal or hexadecimal notation, as follows:

Note that 0x is parsed as a hexadecimal number; it is interpreted as 0.

There is an additional semantic constraint that can render invalid an ipv4-address that uses the loose grammar; specifically, it must not represent a number that exceeds the addressable range of 232; Where the amount of segments determines the magnitude of the segment. …

IPv4 Address – Strict

A strict IPv4 address-string consists of four dot-separated numbers, without a trailing dot, where in addition each number must use the shortest possible decimal representation of a natural number n ≤ 255.

The use of the shortest decimal representation can be expressed as follows:

Component Characters

When it comes to handling characters that may occur in URL-component values we have ‘for historical reasons’ ended up in a situation where we distinguish per component, characters that are:

  • v:  valid;
  • E:  valid but percent-encoded;
  • T:  invalid but tolerated;
  • F:  invalid and fixed by percent-encoding;
  • R:  invalid and rejected.

In order to specify the how characters must be handled in components, it us useful to first divide the entire space of characters any into non-overlapping character sets as follows:

This collection of character sets covers the entire character space. To exhaustively specify how characters in components should be handled we divide each of these sets into further subdivisions and specify per component the status of the characters in this subset:

username password opaque-host dir and file opaque path query fragment
unreserved - . _ ~ v v v v v v v
alpha ∪ digit v v v v v v v
other-unicode E E E E E E E
sub-delims ! $ & ( ) * + , v v v v v v v
' v v v v v E1 v
; = E E v v v v v
gen-delims @ F F R v v v v
: n/a F n/a v v v v
/ n/a n/a n/a n/a T v v
? n/a n/a n/a n/a n/a v v
# n/a n/a n/a n/a n/a n/a T
pct % n/a n/a n/a n/a n/a n/a n/a
invalid u+9, u+A, u+D n/a n/a n/a n/a n/a n/a n/a
u+0 F F R F F F F
other control F F F F F F F
F F R F T F F
" F F T F T F F
< > F F R F T F F
` F F T F T T F
^ F F R F T T T
[ \ ] | F F R T T T T
{ } F F T F T T T
username password opaque-host dir and file opaque path query fragment

Note that applying the rules for encoding component characters does not necessarily produce valid URLs, specifically, if a component contains a character that is marked as T: invalid but tolerated.

Note that the apostrophe ' must instead be left untouched in the query of non-special urls; and the path of generic urls that have no auth nor root are handled as per the opaque path column.

Reference Resolution

This section defines reference resolution operations that are analogous to the algorithms that are described in the chapter Reference Resolution of RFC 3986. The operations that are defined in this section are used in a subsequent section to define a parse-resolve-and-normalise operation that accurately describes the behaviour of the ‘basic URL parser’ as defined in the WHATWG URL standard.

Reference resolution as defined in this section does not involve URL-strings. It operates on URLs as defined in the URL Model section above. In contrast with RFC 3986 and with the WHATWG standard, it does not do additional normalisation, which is relegated to the section Equivalences and Normalisation instead.

Order and Prefix

A property that is particularly useful is the order of an URL. Colloquially, the order is the type of the first component of an URL. The order may be used as an argument to specify various prefixes of an URL.

The Order of an URL

The order of an URL (ord url) is defined to be:

  • fragment if url is the empty URL.
  • The type of its first component otherwise.

Order Prefix

The order-prefix (url upto order) is defined to be
the shortest prefix of url that contains:

  • all components of url with a type strictly smaller than order and
  • all dir components with a type weakly smaller than order.

Reference Transformation

Based on the order and the order-prefix we define a “rebase” operation that slightly generalises reference transformation as specified in the section Transform References of RFC 3986.

Rebase

The rebase operation (url onto base-url) is defined to return
the shortest URL that has base-url upto (ord url) as a prefix and url as a postfix.

Be aware that rebase does a bit more than simple sequence concatenation. It may add a path-root component to satisfy the path-root-constraint of the URL model. For example, if base is the URL represented by //host and input is the URL represented by foo/bar then (input onto base) is represented by //host/foo/bar, thus containing a path-root even though neither base nor input has one.

Rebase Properties

The rebase operation has a number of pleasing mathematical properties, as follows:

  • ord (url2 onto url1)  is the least type of  { ord url1, ord url2 }
  • (url3 onto url2) onto url1  =  url3 onto (url2 onto url1)
  • url1 onto ε  =  url1
  • ε onto url1  =  url1  — if url1 does not have a fragment.

Forcing

Forcing is used as a final step in the process of resolving a web-URL or a file-URL. It ensures that the forced URL has an authority component and a path-root component. Note that it is possible for the force operations to fail.

Forcing a File URL

To force a file-URL url:

  • If url does not have an authority then set its authority component to (authority ε).
  • If otherwise the authority of url has a username or a port then fail.
  • If url does not have a drive then set its path-root component to (path-root /).

Forcing a Web URL

To force a web-URL url:

  • Set the path-root component of url to (path-root /).
  • If url has a non-empty authority then return.
  • Otherwise let component be the first dir or file component whose value is not ε.
    If no such component exists, fail.
  • Remove all dir or file components that precede component and remove component as well.
  • Let auth be the value of component parsed as an Authority and set the authority of url to auth. If the value cannot be parsed as an Authority then fail.

Force Properties

The force operation does not have pleasing mathematical properties with respect to the rebase operation. The following equalities do not hold in general:

force ((force url1) onto (force url2)) ≠ force (url1 onto url2)
force ((force url1) onto url2) ≠ force (url1 onto url2)
force (url1 onto (force url2)) ≠ force (url1 onto url2)

Resolution

The subsection Transform References of RFC 3986 specifies two variants of reference resolution. A generic, strict variant and an alternative non-strict variant. I have chosen to rename the non-strict variant to legacy resolution. This specification adds a third variant that characterises the behaviour that is specified in the WHATWG standard.

resolution resolve strict-resolution legacy-resolution

Strict Resolution

The strict-resolution (strict-resolve url base) of an URL url onto an URL base is defined to be url onto base — if url has a scheme or base has a scheme. Otherwise resolution fails.

Legacy Resolution

The legacy-resolution (legacy-resolve url base) is defined to be the strict-resolution (strict-resolve ~url base) where ~url is:

  • url with its scheme removed if both url and base have a scheme and the value of their scheme components case-insensitively compare equal, or
  • url itself otherwise.
WHATWG-resolve WHATWG-resolution

WHATWG Resolution

The whatwg-resolution of url onto base is defined to be:

  • force (legacy-resolve url base)
    • if url is a web-URL or a file-URL, or
    • if url does not have a scheme and base is a web-URL or a file-URL;
  • strict-resolve url base
    • if otherwise the first component of url is a scheme or a fragment, or
    • if base has an authority or a path-root;
  • otherwise, the operation fails.

If force modifies its input then “applications are encouraged” to issue a validation warning. If in the process of whatwg-resolution either the force operation, or the internally used strict-resolution operation fails, then the whatwg-resolution fails as well.

Parsing

Parsing is the process of converting an URL-string to an URL.
Parsing is stratified into the following phases:

  1. Preprocessing.
  2. Selecting a parser mode.
  3. Parsing.
  4. Decoding and parsing the host.

Preprocessing

The appendix Delimiting a URI in Context of RFC 3986 states that surrounding white-space should be removed from an URI when it is extracted from its surrounding context. The WHATWG standard makes this more explicit and specifies a preprocessing step that removes removes specific control– and white-space–characters from the input string before parsing.

Preprocessing

Before parsing, the input string input must be preprocessed:

  1. Remove all leading and trailing c0-space characters from input.
  2. Remove all u+9 (tab), u+A (line-feed) and u+D (carriage-return) characters from input.
  3. If the result starts with a web-scheme followed by : or with a file-scheme followed by : but it does not start with a scheme followed by : then furthermore replace all occurrences of \ that occur before the first character that is either an ? or an #, or before the end of the string otherwise, with the / character.

URL Grammar

We can now specify the full grammar for URL-strings, using an alternative version for the auth-path rule for file-URLs:

url ::= [ scheme : ] ( auth-path | path ) [ ? query ] [ # fragment ]
The general auth-path rule:
auth-path ::= // authority [ path-root  path-rel ]
The auth-path rule for file-URLs:
auth-path ::= auth-drive [ path-root  path-rel ]
auth-drive ::= auth-drive-invalid  |  // authority [ / drive ]  |  / drive
auth-drive-invalid ::= [ // authorityε ] drive
drive ::= alpha (: | |)

A forced file-URL however does not have a username, password or a port, and will have a path-root. Moreover if it has a non-empty authority then its host will not be an opaque host as it will have been processed and verified to be a domain.

Rules for the authority and the path:
authorityε ::= ε
authority ::= ε  |  [ userinfo @ ] host [ : port ]
userinfo ::= username [ : password ]
host ::= [ ip6-address ]  |  opaque-host
port ::= ε | digit+
path ::= [ path-root ]  path-rel
path-root ::= /
path-rel ::= ( dir / )* [ file ]
If the port is not the empty string, then it must in addition be a decimal representation of a natural number n < 216 where leading zeroes are allowed.

A forced web-URL will have a non-empty authority and a path-root. Moreover if it has a host then it will not be an opaque host as it will have been processed and verified to be a domain.

Rules for the components:
scheme ::= alpha (alpha | digit | + | - | .)*
username ::= ( uchar | pct )*
password ::= ( pchar | pct )*
opaque-host ::= ( hchar | pct )+
dir ::= ( pchar | pct )*
file ::= ( pchar | pct )+
query ::= ( qchar | pct )*
fragment ::= ( fchar | pct )*
Using the following rule for valid– or invalid percent encoded bytes:
pct ::= pct-encoded-byte  |  pct-invalid
The rules are based on the following character-sets:
uchar := any \ { %, #, ?, /, : }
hchar := any \ { %, #, ?, /, :, @ }  \ { u+0, , <, >, [, \, ], ^, | }
pchar := any \ { %, #, ?, / }
qchar := any \ { %, # }
fchar := any \ { % }

This grammar provides a lenient and a strict description of URL strings at once, as follows: A stricty valid URL string does not use the pct-invalid, nor auth-drive-invalid rules and it must not contain in its components any characters that are specified as R, T or F for that component-type as per the Component Characters section. Moreover, a strictly valid URL must not contain a Userinfo component in its authority.

Equivalences and Normalisation

This section is analogous to the section Normalization and Comparison of RFC 3986. The RFC however, does not prescribe a particular normal form. The WHATWG standard does, however implicitly.

Path Segment Normalisation

Path segment normalisation involves the interpretation of dotted-segments. Colloquially, a single-dot segment has the meaning of “select the current directory” whereas a double-dot segment has the meaning of “select the parent directory”. Dotted segments are defined by the following rules, where the addition of %2e and %2E has been motivated by security concerns. Again this is in accordance with the WHATWG standard.

dotted-segment

Path equivalence is defined by the following equations. The equations can be exhaustively applied from left-to-right to normalise an URL.

drive  x | ≈ drive  x :
dir x ≈ ε — if x :: dot
dir x・dir y ≈ ε — if y :: dots and not x :: dots
path-root /・dir y ≈ path-root / — if y :: dots

Authority Normalisation

Authority equivalence is defined by the following equations. Like path normalisation, the equations must be exhaustively applied from left-to-right to normalise an URL. This has the same effect as removing any empty port or password component, and if the URL does not have a password after that, to also remove any empty username component.

password ε ≈ ε
userinfo (username ε) ≈ ε
port ε ≈ ε

Scheme-Based Authority Validation

If an URL has a scheme, then a number of additional requirements may be enforced on the authority; specifically these should be enforced or fixed whilst resolving (as opposed to rebasing) and URL.

Type it out nicely

  • file and web URLs must have an authority
  • file URL authority must not have a userinfo nor a port component
  • web URL must have a non-empty authority
  • file- and web URL host must not be an opaque host

Scheme-Based Authority Normalisation

If an URL has a scheme, then a number of additional equivalences apply to the authority. Normalisation according to these rules involves the removal of default ports, and similarly, removing the host from a file-URL if its value is localhost.

scheme http ・authority (xs・port 80) ≈ scheme http ・authority xs
scheme ws ・authority (xs・port 80) ≈ scheme ws ・authority xs
scheme ftp ・authority (xs・port 21) ≈ scheme ftp ・authority xs
scheme wss ・authority (xs・port 443) ≈ scheme wss ・authority xs
scheme https ・authority (xs・port 443) ≈ scheme https ・authority xs
scheme file ・authority (host localhost) ≈ scheme file ・authority ε

These rules apply in combination with the following rule that states that scheme equivalence is case-insensitive.

scheme scheme ≈ scheme (lowercase scheme)

Printing

Printing is the process of converting an URL to an URL-string. Printing can be stratified into the following three phases:

  1. Normalising the URL for printing.
  2. Converting each of the components of the URL to a string.
  3. Composing the final URL-string from the printed components.

Normalise for Printing

To print an URL, it must first be subjected to an additional normalisation operation, and it must be percent encoded as follows.

Normalise for Printing

If an URL does not have a drive nor an authority but it does have an empty first dir component (dir ε) then it must be normalised for printing by inserting a component (dir .) immediately before its first dir component.

Otherwise, if the first component of an URL is a dir or a file component and its value value starts with a scheme-like string, i.e.

value :: alpha (alpha | digit | + | - | .)* : any*,

then the first occurrence of : in value must be replaced with its percent-encoding %3A.

This additional normalisation step is necessary because it is not possible to represent the affected URL as an URL-string otherwise.

Consider for example the URL (path-root /)・(dir ε)・(file foo). Printing this URL without an additional normalisation step would result in //foo which represents the URL (authority (host foo)) instead.

There is a similar issue around URLs with drive letters, but these issues are not properly addressed in the current WHATWG standard, making it difficult for us to prescribe appropriate counter measures.

Printing URLs

To convert an Authority to an Authority-string, replace each of the sub-components with their component-value concatenated with their prefix– or postfix-sigil, if any. This results in a sequence of strings, which is converted to the final result by concatenating them in order.

To convert an URL to an URL-string, first convert the value of its authority component (if any) to an Authority-string. Then continue to replace each component with their component-value concatenated with their identifying sigil, if any. This results in a sequence of strings, which is converted to a single string by simply concatenating them in order.

Concluding

This final section specifies the behaviour of a parse-resolve-and-normalise operation that characterises the behaviour of the ‘basic URL parser’ as described in the WHATWG URL standard.

It's easy!
parse-resolve-and-normalise (string, base) :=
  • Preprocess string
  • Detect the parser mode, with the fallback mode as indicated by the base
  • Parse the preprocessed string according to the grammar to obtain url1
  • Force resolve url1 against base
  • Normalise and percent encode the result.

That results in an URL-structure. One can then wrap around that to describe the URL class of web browsers:

  • The href getter returns the printed URL.
  • The protocol getter returns the scheme + : or ε if absent.
  • The username and password getters return ε if absent or the value of the corresponding component otherwise
  • The host getter returns ε if absent, the value of the host if the port is absent, and the value of the host + : + the value of the port otherwise.
  • The hostname getter returns ε if the URL has no host, otherwise it returns the value of the host.
  • pathname (…)
  • The search getter returns ε if the URL has no query, otherwise it returns ? + the value of the query.
  • The hash getter returns returns ε if the URL has no fragment, otherwise it returns # + the value of the fragment.