URL Specification

This document provides a concise formal definition of URLs and the language of URLs. It offers a rephrasing of the WHATWG URL standard in an effort to better reveal the internal structure of URLs and the operations on them. It provides a formal grammar and a set of elementary operations that are eventually combined together in such a way that the end result is equivalent with the WHATWG standard.

The goals of this document are,

  • To provide a modular, concise formal specification of URLs that is behaviourally equivalent with, and covers all of the WHATWG URL standard with the exception of the web API's setters and the url-encoded form format.
  • To provide a general model for URLs that can express relative references and to define reference resolution by means of a number of elementary operations, in such a way that the end result is in agreement with the WHATWG standard.
  • To provide a concise way to describe the differences between the WHATWG standard and RFC 3986 and RFC 3987 as part of a larger effort;
  • To enable and support work towards an update of RFC 3987 that resolves the incompatibilities with the WHATWG standard.
  • To enable and support editorial changes to the WHATWG standard in the hope to recover the structural framework for relative references, normalisation and the hierarchy of equivalence relations on URLs that was put forward in RFC 3986.

Status of this document

This is a development version of the specification.

Structure of this document

As of this writing, this document has a compact format that is focused specifically on the definition and specification of URLs and operations on URLs.

The sections are laid out as follows.

Preliminaries

This section introduces basic concepts and notational conventions which are used throughout the rest of this document.

Sequences

This specification uses an algebraic notation for sequences that makes no distinction between one-element sequences and their first element. It uses the following notation:

  • sequence The empty sequence is denoted by ε.
  • The concatenation of two sequences S and T is denoted by S・T.

Furthermore, the notation e : T is used to denote the concatenation of a single-element e in front of a sequence T.

The notation e : T may be used to inductively specify operations on sequences. For example, the length of a sequence is defined by length ε = 0 and length (e : T) = 1 + length T.

Parentheses are used for disambiguation and as visual aides. Paired parentheses are not part of a sequence-constructing operation.

Components

type The word component is used throughout this document to mean a tagged value, denoted by (type value), where type is taken from a predefined set of component-types. The value of a component may be a sequence of components itself, in which case the component may be referred to as a compound-component for clarity.

value

Both component-types and their corresponding component-constructing operator are typeset in boldface. For example, scheme denotes a component-type, whilst (scheme value) denotes a scheme-component. Again, parentheses are used to disambiguate and as visual aides. They are not part of the component-constructing operator.

When a collection of components contains exactly one component for a given component-type, then the type may be used as a reference to the component's value. For example, given a sequence of components S := (dir x・file y・query z), the phrase “the query of S” identifies the value z whereas the phrase “the query component of S” identifies (query z).

Strings and Grammars

Strings and Code-Points

For the purpose of this specification,

  • A string is a sequence of characters.
  • A character is a single Unicode code point.
  • A code point, is a natural number n < 17 × 216.
  • The empty string is a sequence of zero code points. It is denoted by ε.

Code points are denoted by a number in hexadecimal notation preceded by u+, in boldface. In addition, code points that correspond to printable ASCII characters are often denoted by their corresponding glyph, typeset in monospace and on a screened background. For example, u+41 and A denote the same code point. Strings that contain only printable ASCII characters are often denoted as a connected sequence of glyphs, typeset likewise.

The printable ASCII characters are codepoints in the range u+20 to u+7E, inclusive. Note that this includes the space character u+20.

Character Sets

A character-set is a set of characters. A character-range is the largest character-set that includes a given least character c and a greatest character d. Such a character-range is denoted by { c–d }. The union of e.g. { a–b } and { c–d }, is denoted by { a–b, c–d }, and this notation is generalised to n-ary unions. Common character-sets that are used throughout this document are defined below:

any := { u+0–u+10FFFF }
control-c0 := { u+0–u+1F }
c0-space := { u+0–u+20 }
printable-ASCII := { u+20–u+7E } — i.e. { –~ }
octal-digit := { 0–7 }
digit := { 0–9 }
hex-digit := { 0–9,  A–F,  a–f }
digit-nonzero := { 1–9 }
alpha := { A–Z, a–z }
del-c1 := { u+7F–u+9F }
control-c1 := { u+80–u+9F }
latin-1 := { u+A0–u+FF }
surrogate := { u+D800–u+DFFF }
non-char := { u+FDD0–u+FDEF } ∪ { c | c in any and (c + 2) mod 216  ≤  1 }
base-char := any \ control-c0 \ del-c1 \ surrogate \ non-char

Grammars

The notation name ::= expression is used to define a production rule of a grammar, where the expression uses square brackets ( [ … ] ) for optional rules, a postfix star ( * ) for zero-or-more, a postfix plus ( + ) for one-or-more, an infix vertical line ( | ) for alternatives, monospaced type for literal strings and an epsilon ( ε ) for the empty string. Parentheses are used for grouping and disambiguation.

Pattern Testing

The shorthand notation string :: rule is used to express that a string string can be generated by the production rule rule of a given grammar.
Likewise, the notation string :: expression is used to express that string can be generated by expression.

Percent Coding

This subsection is analogous to the section Percent-Encoding of RFC 3986 and the section Percent-encoded bytes of the WHATWG standard.

Bytes

For the purpose of this specification,

  • A byte is a natural number n < 28.
  • A byte-sequence is a sequence of bytes.

Percent Encoded Bytes

A non-empty byte-sequence may be percent-encoded as a string by rendering each individual byte in two-digit hexadecimal notation, prefixed by %.

pct-encoded-byte ::= % hex-digit hex-digit
pct-encoded-bytes ::= pct-encoded-byte+

Percent Encoded String

A percent-encoded–string is a string that may have zero or more percent-encoded–byte-sequences embedded within, as follows:

pct-encoded-string ::= ( pct-encoded-bytes  |  uncoded )*
uncoded ::= non-pct+
non-pct := any \ { % }

The grammar above may be relaxed in situations where error recovery is desired. This is achieved by adding a pct-invalid rule as follows:

pct-encoded-string ::= ( pct-encoded-bytes  |  pct-invalid  |  uncoded )*
pct-invalid ::= % [ hex-digit ]

There is an ambiguity in the loose grammar due to the overlap between pct-encoded-byte and pct-invalid combined with uncoded. The ambiguity must be resolved by using the pct-encoded-byte rule instead of the pct-invalid rule wherever possible.

Percent Encoding

To percent-encode a string string using a given character-set encode-set …

Wrap this up. Furthermore, this needs to discuss the encoding override, which is only allowed for the query; all other components must use UTF8.

Percent Decoding

Percent decoding of strings is stratified into the following phases:

  • Analyse the string as into pct-encoded-bytes, pct-invalid and uncoded parts according to the pct-encoded-string grammar.
  • Convert the percent-coded-bytes to a byte-sequence and decode the byte sequence to a string whilst leaving the pct-invalid and uncoded parts unmodified.
  • Recompose the string.
Wrap this up. Note that the decoding of the percent-coded-bytes must use UTF8.

URL Model

An URL is a sequence of components that is subject to a number of additional constraints. The ordering of the components in the sequence is analogous to the hierarchical syntax of an URI as described in Hierarchical Identifiers in RFC 3986.

It is important to stress the distinction between an URL and an URL-string.
An URL is a structure, indeed a special sequence of components, whereas an URL-string is a special kind of string that represents an URL. Conversions between URLs and URL-strings are described in the sections Parsing and Printing.

URL

An URL is a sequence of components that occur in ascending order by component-type, where component-type is taken from the ordered set:

scheme < authority < drive < path-root < dir < file < query < fragment.

  • An URL contains at most one component per type, except for dir components, of which it may have any finite amount.
  • path-root-constraint If an URL has an authority or a drive component, and it has a dir or a file component, then it also has a path-root component.

Authority

An Authorityauthority is a sequence of components ordered by their type, taken from the ordered set:

username < password < host < port.

  • Authorities have at most one component per type.
  • If an Authority has a password component then it also has a username component.
  • If an Authority has a username or a port component then it also has a host component.

Host

A Host is either:

  • An ipv6-address,
  • an opaque-host,
  • an ipv4-address, or
  • a domain-name.

URL components

Whenever present, the components of an URL are subject to the following constraints:

  • scheme-string The scheme of an URL is a string scheme :: alpha (alpha | digit | + | - | .)*.
  • The authority of an URL is an Authority.
  • The path-root of an URL is the string /.
  • The drive of an URL is a string drive :: alpha (: | |).
  • The file of an URL is a nonempty string.
  • The host of an Authority is a Host.
  • The port of an Authority is either ε or a natural number n < 216.
  • For all other components present, the components' values are strings.
  • A dir component is a component (dir name) where name is a string.
  • The query of an URL, is a string.
  • The fragment of an URL, is a string.

Note that arbitrary code points are allowed in component values unless explicitly specified otherwise. The restrictions on the code points in an URL-string is discussed in the Parsing section.

Types and Shapes

There is an additional number of soft constraints that must be met for an URL in order to be called valid. However, implementations must tolerate URLs that are not valid as an error recovery strategy.

Valid URL

A valid URL must not have a username or a password component and it must not have components that contain invalid percent-encode sequences.

There is a further categorisation of URLs that is used in several of the operations that will be defined later.

Web-URL

A web-URL is an URL that has a web-scheme. A web-scheme is a string scheme such that (lowercase scheme) :: http | https | ws | wss | ftp.

File-URL

A file-URL is an URL that has a file-scheme. A file-scheme is a string scheme such that (lowercase scheme) :: file.

Reference Resolution

This section defines reference resolution operations that are analogous to the algorithms that are described in the chapter Reference Resolution of RFC 3986. The operations that are defined in this section are used in a subsequent section to define a parse-resolve-and-normalise operation that accurately describes the behaviour of the ‘basic URL parser’ as defined in the WHATWG URL standard.

Reference resolution as defined in this section does not involve URL-strings. It operates on URLs as defined in the URL Model section above. In contrast with RFC 3986 and with the WHATWG standard, it does not do additional normalisation, which is relegated to the section Equivalences and Normalisation instead.

Order and Prefix operations

A property that is particularly useful is the order of an URL. Colloquially, the order is the type of the first component of an URL. The order may be used as an argument to specify various prefixes of an URL.

The Order of an URL

The order of an URL (ord url) is defined to be:

  • fragment if url is the empty URL.
  • The type of its first component otherwise.

Order-Limited Prefix

The order-limited prefix (url upto order) is defined to be
the shortest prefix of url that contains:

  • all components of url with a type strictly smaller than order and
  • all dir components with a type weakly smaller than order.

The Goto Operation

Based on the order and the order-limited prefix one can define a “goto” operation that is analogous to the “merge” operation that is defined in the subsection Transform References of RFC 3986. I have chosen the name “goto” to avert incorrect assumptions about commutativity. The operation is not commutative, but it is associative.

Goto

The goto operation (url1 goto url2) is defined to return
the shortest URL that has url1 upto (ord url2) as a prefix and url2 as a postfix.

Goto Properties

The goto operation has a number of pleasing mathematical properties, as follows:

  • ord (url1 goto url2) is the least type of {ord url1, ord url2}.
  • (url1 goto url2) goto url3 = url1 goto (url2 goto url3).
  • ε goto url2 = url2.
  • url1 goto ε = url1 — if url1 does not have a fragment.
  • url2 is a postfix of (url1 goto url2).

Be aware that the goto operation does a bit more than sequence concatenation. In some cases it creates a path-root component to satisfy the path-root-constraint of the URL model. For example, if url1 is the URL represented by //host and url2 is the URL represented by foo/bar then (url1 goto url2) is represented by //host/foo/bar, thus containing a path-root even though neither url1 nor url2 has one.

Forcing

There is an additional operation on URLs that I have named force. The force operation is used as a final step in the process of resolving a web-URL or a file-URL. The operation ensures that the resulting URL has an authority token and a path-root token. Note that it is possible for the force operations to fail.

Forced File URL

To force a file-URL url:

  • If url does not have an authority then set its authority component to (authority ε).
  • If otherwise the authority of url has a username or a port then fail.
  • If url does not have a drive then set its path-root component to (path-root /).

Forced Web URL

To force a web-URL url:

  • Set the path-root component of url to (path-root /).
  • If url has a non-empty authority then return.
  • Otherwise let component be the first dir or file component whose value is not ε.
    If no such component exists, fail.
  • Remove all dir or file components that precede component and remove component as well.
  • Let auth be the value of component parsed as an Authority and set the authority of url to auth. If the value cannot be parsed as an Authority then fail.

Force Properties

The force operation lacks a number of nice properties with respect to the goto operation. The following equalities do not hold in general:

force ((force url1) goto (force url2)) ≠ force (url1 goto url2)
force ((force url1) goto url2) ≠ force (url1 goto url2)
force (url1 goto (force url2)) ≠ force (url1 goto url2)

Resolution

The subsection Transform References of RFC 3986 specifies two variants of reference resolution. A generic, strict variant and an alternative non-strict variant. This specification adds a third variant that is used to specify the behaviour of web-browsers. I have chosen to name the strict variant generic resolution and the non-strict variant legacy resolution, so that the words the words strict and non-strict can be used as modifiers later on.

resolution resolve generic resolution legacy resolution

Generic Resolution

The generic resolution (generic-resolve url1 url2) of an URL url1 against an URL url2 is defined to be url2 goto url1 — if url2 has a scheme or url1 has a scheme. Otherwise resolution fails.

Legacy Resolution

The legacy resolution (legacy-resolve url1 url2) is defined to be the generic resolution (generic-resolve ~url1 url2) where ~url1 is:

  • url1 with its scheme removed if both url1 and url2 have a scheme, and the value of their scheme components case-insensitively compare equal, or
  • url1 itself otherwise.

Based on the generic resolution, and the legacy resolution and the force operations as defined above, it is now possible to define a reference resolution operation that can be used to characterise the behaviour that is specified in the WHATWG standard.

WHATWG-resolve WHATWG-resolution

WHATWG Resolution

The WHATWG-resolution of url1 against url2 is defined as:

  • force (legacy-resolve url1 url2)
    • if url1 is a web-URL or a file-URL, or
    • if url1 does not have a scheme and url2 is a web-URL or a file-URL;
  • generic-resolve url1 url2
    • if otherwise the first component of url1 is a scheme or a fragment, or
    • if url2 has an authority or a path-root;
  • otherwise, the operation fails.

Any application of the force operation that modifies its input must issue a validation warning. If in the process of WHATWG-resolution either the force operation, or the internally used generic resolution operation fails, then the WHATWG-resolution fails as well.

Parsing

Parsing is the process of converting an URL-string to an URL.
Parsing is stratified into the following phases:

  1. Preprocessing.
  2. Selecting a parser mode.
  3. Parsing.
  4. Decoding and parsing the host.

Preprocessing

The appendix Delimiting a URI in Context of RFC 3986 states that surrounding white-space should be removed from an URI when it is extracted from its surrounding context. The WHATWG standard makes this more explicit and specifies a preprocessing step that removes removes specific control– and white-space–characters from the input string before parsing.

Preprocessing

Before parsing, the input string input must be preprocessed:

  1. Remove all leading and trailing c0-space characters from input.
  2. Remove all u+9 (tab), u+A (line-feed) and u+D (carriage-return) characters from input.

Parser modes

Unfortunately, URL-parsing depends on the scheme of the URL being parsed. As a consequence, scheme-less URL parsing is ambiguous. This can be resolved by explicitly specifying a parser-mode. The parser-mode only influences how scheme-less URLs are parsed.

Parser Mode

There are three distinct parser-modes:

web-mode,  file-mode, and generic-mode.

The parser-mode-for an URL-string input, given a fallback parser-mode supplied-mode is then defined to be:

  • web-mode — if input starts with a web-scheme followed by :
  • file-mode — if input starts with a file-scheme followed by :
  • generic-mode — if otherwise input starts with a scheme-string followed by :
  • supplied-mode — otherwise.

In practice, it is possible and advisable to begin parsing with a supplied parser-mode, and to update the parser-mode whilst parsing, as soon as a scheme has been detected.

URL Grammar

This subsection specifies the grammar for URL-strings. Implementations must use this grammar for parsing URLs from strings. The grammar is parameterised by a parser-mode. Specifically, the auth-path rule, the authority rule and the s rule have different versions for different parser-modes.

url ::= [ scheme : ] ( auth-path | path ) [ ? query ] [ # fragment ]
The auth-path rule in web-mode and in generic-mode:
auth-path ::= s s authority [ path-root  path-rel ]
The auth-path rule in file-mode:
auth-path ::= auth-drive [ path-root  path-rel ]
auth-drive ::= [ s s authorityε] [ s ] drive  |  s s authority [ s drive ]
Rules for the authority and the path:
authority ::= ε  |  [ credentials @ ] host [ : port ]
authorityε ::= ε
credentials ::= username [ : password ]
path ::= [ path-root ]  path-rel
path-rel ::= ( dir s )* [ file ]
Rules for the components:
scheme ::= alpha (alpha | digit | + | - | .)*
username ::= ( pcts | uchar )*
password ::= ( pcts | pchar )*
host ::= [ ip6-address ]  |  opaque-host
opaque-host ::= ( pcts | hchar )+
port ::= ε | digit+
drive ::= alpha (: | |)
path-root ::= s
dir ::= ( pcts | pchar )*
file ::= ( pcts | pchar )+
query ::= ( pcts | qchar )*
fragment ::= ( pcts | url-char )*
Where pcts is defined as follows:
pcts ::= pct-encoded-bytes | pct-invalid
The rules are based on the following character-sets:
url-char := any \ { u+9, u+A, u+D }
uchar := url-char \  s  \ { #, ?, : }
hchar := url-char \  s  \ { #, ?, : , @ }  \ { u+0, , <, >, [, \, ], ^, | }
pchar := url-char \  s  \ { #, ? }
qchar := url-char \ { # }
Where s depends on the parser-mode:
s := { / }  —  in generic-mode
ss := { /, \ }  —  otherwise

Strict URL Grammar

For an URL-string to be considered valid, it must conform to a more restricted grammar. The strict grammar is obtained by modifying the auth-drive and pcts rules and the url-char and s character-sets as follows:

auth-drive ::= s drive  |  s s authority  [ s drive ]
pcts ::= pct-encoded-bytes  —  note the absence of pct-invalid
The rules are based on the following character-sets:
url-char := base-char \ { , ", #, %, <, >, [, \, ], ^, `, {, |, } }
s := { / }  —  in all modes.

IPv6 Addresses

Double check the following, and include it here

IPv6 Address — Strict

An IPv6 address-string is a representation of a natural number n < 2128 that is used as an identifier. It is accurately described by the production rule IPv6address in the Host section of RFC 3986.

The IPv6 address parser in the WHATWG standard however implies a more tolerant definition of IPv6 address-strings.

IPv6 Address — loose

There are additional, semantic constraints on the ipv4-address, ipv6-address and the port.

A note about multiple slashes

Web browsers interpret any amount of slashes after a web-scheme as the start of the authority component. Consider the following URL-strings:

  1. http:foo/bar
  2. http:/foo/bar
  3. http://foo/bar
  4. http:///foo/bar
.

Web browsers treat all these examples as equivalent to http://foo/bar. It is tempting to try to express this behaviour on the level of the grammar. For example one might consider using the following rule:

auth-path  ::=  s* authority [ path-root [ dir s ]* [ file ] ]

However, the examples above do behave differently with respect to reference resolution. For example, if they are resolved against the base-URL that is represented by http://host/, then the results are as follows:

  1. http://host/foo/bar
  2. http://host/foo/bar
  3. http://foo/bar
  4. http://foo/bar
.

As such, collapsing the multiples of slashes, cannot be expressed within the grammar. Instead, the force operation as defined in the section on Reference Resolution implements this behaviour.

Host Parsing

This section still has to be written.
  • Percent decode.
  • Puny decode.
  • Apply IDNA/ Nameprep normalisation.
  • Detect and interpret IPv4 addresses.
  • Err on 'forbidden host codepoints'.

IPv4 Address

An IPv4 address-string consists of one up to four dot-separated numbers with an optional trailing dot. The numbers may use decimal, octal or hexadecimal notation, as follows:

ip4-address ::= num [. num [. num [. num ] ] ] [.]
num ::= num-dec | num-oct | num-hex
num-dec ::= 0 | (digit-nonzero digit*)
num-oct ::= 0 octal-digit*
num-hex ::= (0x | 0X) hex-digit*

Note that 0x is parsed as a hexadecimal number. (It will be interpreted as 0).

IPv4 Address – Strict

A valid IPv4 address-string consists of four dot-separated numbers, without a trailing dot. The numbers must use decimal notation, as follows:

ip4-address-strict ::= num-dec . num-dec . num-dec . num-dec
In addition, each of the components must represent a number n ≤ 255

Equivalences and Normalisation

This section is analogous to the section Normalization and Comparison of RFC 3986. The RFC however, does not prescribe a particular normal form. The WHATWG standard does, however implicitly.

Path Segment Normalisation

Path segment normalisation involves the interpretation of dotted-segments. Colloquially, a single-dot segment has the meaning of “select the current directory” whereas a double-dot segment has the meaning of “select the parent directory”. Dotted segments are defined by the following rules, where the addition of %2e and %2E has been motivated by security concerns. Again this is in accordance with the WHATWG standard.

dotted-segment
dot ::= . | %2e | %2E
dots ::= dot  dot

Path equivalence is defined by the following equations. The equations must be exhaustively applied from left-to-right to normalise an URL.

drive  x | ≈ drive  x :
path-root /・dir y ≈ path-root / — if y :: dots
path-root /・file y ≈ path-root / — if y :: dots
dir x ≈ ε — if x :: dot
file x ≈ ε — if x :: dot
dir x・dir y ≈ ε — if y :: dots and not x :: dots
dir x・file y ≈ ε — if y :: dots and not x :: dots

Authority Normalisation

Authority equivalence is defined by the following equations. Like path normalisation, the equations must be exhaustively applied from left-to-right to normalise an URL. This has the same effect as removing any empty port or password component, and if the URL does not have a password after that, to also remove any empty username component.

password ε ≈ ε
username ε・host h ≈ host h
port ε ≈ ε

Scheme-Based Authority Normalisation

If an URL has a scheme, then a number of additional equivalences apply to the authority. Normalisation according to these rules involves the removal of default ports, and similarly, removing the host from a file-URL if its value is localhost.

scheme http ・authority (xs・port 80) ≈ scheme http ・authority xs
scheme ws ・authority (xs・port 80) ≈ scheme ws ・authority xs
scheme ftp ・authority (xs・port 21) ≈ scheme ftp ・authority xs
scheme wss ・authority (xs・port 443) ≈ scheme wss ・authority xs
scheme https ・authority (xs・port 443) ≈ scheme https ・authority xs
scheme file ・authority (host localhost) ≈ scheme file ・authority ε

These rules apply in combination with the following rule that states that scheme equivalence is case-insensitive.

scheme scheme ≈ scheme (lowercase scheme)

Percent Coding Normalisation

There is a natural notion of equivalence for percent-encoded-strings: one could consider two percent-encoded strings to be equivalent if their percent-decodings are equal. However, for URL-components, this notion of equivalence is too strong.

… RFC3986 specifies reserved characters to allow for additional, application-specific interpretation. For reserved characters, one cannot assume that they are equivalent to their percent-encoding.

The WHATWG standard does not specify the semantics of percent-encoded-bytes in the components of an URL other than the host at all, and any percent-encoded bytes that are present in other components are not decoded. It does however specify sets of characters that must be encoded.

Percent Encode Profile

A percent-encode-profile is a mapping that maps each component-type in the set { username, password, host, dir, file, query, fragment } to a percent-encode-set.

Percent Encode Profiles

There are four distinct percent-encode-profiles that are relevant for this specification. They are specified by the following tables. A table can be selected by clicking on one of the profile names below.

generic, special, minimal and minimal-special.

u+20–u+27 u+3A–u+40 u+5B–u+60 u+7B–u+7E

Unfortunately, like parsing, the percent-encoding of URLs as prescribed by the WHATWG standard, is scheme dependent.

Encode Profile Selection

The percent-encode-profile-for a given URL is

  • the special profile — if the URL is a file-URL or a web-URL,
  • the special profile — if the URL does not have a scheme,
  • the generic profile — if otherwise the URL has an authority or a path-root, or
  • the minimal profile — otherwise.
Note that these profiles do not lead to valid URL-strings after printing. This is as is specified by the WHATWG.
I prefer renaming generic to normal, special to normal-special, add a valid profile, and change the minimal profile to be truly minimal, only encoding characters that would cause reparse bugs. Furthermore.

Printing

Printing is the process of converting an URL to an URL-string. To print an URL, it must first be subjected to an additional normalisation operation, and it must be percent encoded as follows.

Normalise for Printing

If an URL does not have a drive nor an authority but it does have an empty first dir component (dir ε) then it must be normalised for printing by inserting a component (dir .) immediately before its first dir component.

This additional normalisation step is necessary because it is not possible to represent the affected URLs as an URL-string.

Consider for example the URL (path-root /)・(dir ε)・(file foo). Printing this URL without an additional normalisation step would result in //foo which represents the URL (authority (host foo)) instead.

This is not sufficient. There are similar issues around URLs with drive letters, which is an open issue.

  • To print to an ASCII URL-string, percent encode any \ printable-ASCII
  • Otherwise percent encode control-c0, del-c1, surrogates and non-chars.
  • In addition percent encode a subset of printable-ASCII, depending on the component, as indicated by the percent coding table.
  • Let output be the empty string, then, for each of the components (tag, value) of url2 in tree order, convert the component to a string, depending on its type according to the following table, and append it to the output.
scheme authority drive root dir file query fragment
value : //
credentialshostport
userpass
value : value
@ value: value
/ value / value / value ? value # value

Concluding

This final section specifies the behaviour of a parse-resolve-and-normalise operation that characterises the behaviour of the ‘basic URL parser’ as described in the WHATWG URL standard.

It's easy!
parse-resolve-and-normalise (string, base) :=
  • preprocess string
  • detect the parser mode, with the fallback mode as indicated by the base
  • parse the preprocessed string according to the grammar to obtain url1
  • force resolve url1 against base
  • normalise and percent encode the result.
That results in an URL (model). One can then wrap around that to describe the URL class of web browsers:
  • The href getter returns the printed URL.
  • The protocol getter returns the scheme + : or ε if absent.
  • The username and password getters return ε if absent or the value of the corresponding component otherwise
  • The host getter returns ε if absent, the value of the host if the port is absent, and the value of the host + : + the value of the port otherwise.
  • The hostname getter returns ε if the URL has no host, otherwise it returns the value of the host.
  • pathname (…)
  • The search getter returns ε if the URL has no query, otherwise it returns the value of the query.
  • The hash getter returns returns ε if the URL has no query, otherwise it returns the value of the fragment.

Note that there is a loss of information about the structure of the URL.