URL Specification
URL Specification
This document provides a concise formal definition of URLs and the language of URLs. It offers a rephrasing of the WHATWG URL standard in an effort to better reveal the internal structure of URLs and the operations on them. It provides a formal grammar and a set of elementary operations that are eventually combined together in such a way that the end result is equivalent with the WHATWG standard.
The goals of this document are,
- To provide a modular, concise formal specification of URLs that is behaviourally equivalent with, and covers all of the WHATWG URL standard with the exception of the web API's setters and the url-encoded form format.
- To provide a general model for URLs that can express relative references and to define reference resolution by means of a number of elementary operations, in such a way that the end result is in agreement with the WHATWG standard.
- To provide a concise way to describe the differences between the WHATWG standard and RFCÂ 3986 and RFCÂ 3987 as part of a larger effort;
- To enable and support work towards an update of RFCÂ 3987 that resolves the incompatibilities with the WHATWG standard.
- To enable and support editorial changes to the WHATWG standard in the hope to recover the structural framework for relative references, normalisation and the hierarchy of equivalence relations on URLs that was put forward in RFCÂ 3986.
Status of this document
This is a development version of the specification.
- This is version 0.9.0 of the specification. This is a development version.
- The versioning scheme is specified by Semantic Versioning.
- This document is licensed under a Creative Commons - CC BY-SA 2.0 license.
- The development of this specification can be discussed on GitHub.
- There is a reference implementation to accompany this specification.
- The reference implementation currently passes 630 out of the 632 URL constructor tests from the web-platform-tests.
Structure of this document
As of this writing, this document has a compact format that is focused specifically on the definition and specification of URLs and operations on URLs.
The sections are laid out as follows.
Preliminaries
This section introduces basic concepts and notational conventions which are used throughout the rest of this document.
Sequences
This specification uses an algebraic notation for sequences that makes no distinction between one-element sequences and their first element. It uses the following notation:
- sequence The empty sequence is denoted by ε.
- The concatenation of two sequences S and T is denoted by S・T.
Furthermore, the notation e : T is used to denote the concatenation of a single-element e in front of a sequence T.
The notation e : T may be used to inductively specify operations on sequences. For example, the length of a sequence is defined by length ε = 0 and length (e : T) = 1 + length T.
Parentheses are used for disambiguation and as visual aides. Paired parentheses are not part of a sequence-constructing operation.
Components
type The word component is used throughout this document to mean a tagged value, denoted by (type value), where type is taken from a predefined set of component-types. The value of a component may be a sequence of components itself, in which case the component may be referred to as a compound-component for clarity.
Both component-types and their corresponding component-constructing operator are typeset in boldface. For example, scheme denotes a component-type, whilst (scheme value) denotes a scheme-component. Again, parentheses are used to disambiguate and as visual aides. They are not part of the component-constructing operator.
When a collection of components contains exactly one component for a given component-type, then the type may be used as a reference to the component's value. For example, given a sequence of components S := (dir x・file y・query z), the phrase “the query of S” identifies the value z whereas the phrase “the query component of S” identifies (query z).
Strings and Grammars
Strings and Code-Points
For the purpose of this specification,
- A string is a sequence of characters.
- A character is a single Unicode code point.
- A code point, is a natural number n < 17 × 216.
- The empty string is a sequence of zero code points. It is denoted by ε.
Code points are denoted by a number in hexadecimal notation
preceded by u+, in boldface. In addition, code points
that correspond to printable ASCII characters are
often denoted by their corresponding glyph, typeset in monospace and on a
screened background.
For example, u+41 and A
denote the same
code point. Strings that contain only
printable ASCII characters
are often denoted as a connected sequence of glyphs, typeset likewise.
The printable ASCII characters are codepoints in the range u+20 to u+7E, inclusive. Note that this includes the space character u+20.
Character Sets
A character-set is a set of characters. A character-range is the largest character-set that includes a given least character c and a greatest character d. Such a character-range is denoted by { c–d }. The union of e.g. { a–b } and { c–d }, is denoted by { a–b, c–d }, and this notation is generalised to n-ary unions. Common character-sets that are used throughout this document are defined below:
any | := | { u+0–u+10FFFF } |
control-c0 | := | { u+0–u+1F } |
c0-space | := | { u+0–u+20 } |
printable-ASCII | := | { u+20–u+7E } — i.e. { –~ }
|
octal-digit | := | { 0 –7 }
|
digit | := | { 0 –9  }
|
hex-digit | := | { 0 –9 ,Â
A –F , a –f }
|
digit-nonzero | := | { 1 –9 }
|
alpha | := | { A –Z ,
a –z }
|
del-c1 | := | { u+7F–u+9F } |
control-c1 | := | { u+80–u+9F } |
latin-1 | := | { u+A0–u+FF } |
surrogate | := | { u+D800–u+DFFF } |
non-char | := | { u+FDD0–u+FDEF } ∪ { c | c in any and (c + 2) mod 216  ≤  1 } |
base-char | := | any \ control-c0 \ del-c1 \ surrogate \ non-char |
Grammars
The notation name ::= expression is used to define a production rule of a grammar, where the expression uses square brackets ( [ … ] ) for optional rules, a postfix star ( * ) for zero-or-more, a postfix plus ( + ) for one-or-more, an infix vertical line ( | ) for alternatives, monospaced type for literal strings and an epsilon ( ε ) for the empty string. Parentheses are used for grouping and disambiguation.
Pattern Testing
The shorthand notation string :: rule is used to
express that a string string can be
generated by the production rule rule of a given grammar.
Likewise, the notation
string :: expression is used to express that
string can be generated by expression.
Percent Coding
This subsection is analogous to the section Percent-Encoding of RFCÂ 3986 and the section Percent-encoded bytes of the WHATWG standard.
Bytes
For the purpose of this specification,
- A byte is a natural number n < 28.
- A byte-sequence is a sequence of bytes.
Percent Encoded Bytes
A non-empty byte-sequence may be percent-encoded as
a string by rendering each individual byte
in two-digit hexadecimal notation, prefixed by %
.
pct-encoded-byte | ::= | %  hex-digit hex-digit
|
pct-encoded-bytes | ::= | pct-encoded-byte+ |
Percent Encoded String
A percent-encoded–string is a string that may have zero or more percent-encoded–byte-sequences embedded within, as follows:
pct-encoded-string | ::= | ( pct-encoded-bytes  | uncoded )* |
uncoded | ::= | non-pct+ |
non-pct | := | any \ { % }
|
The grammar above may be relaxed in situations where error recovery is desired. This is achieved by adding a pct-invalid rule as follows:
pct-encoded-string | ::= | ( pct-encoded-bytes  | pct-invalid  | uncoded )* |
pct-invalid | ::= | % [ hex-digit ] |
There is an ambiguity in the loose grammar due to the overlap between pct-encoded-byte and pct-invalid combined with uncoded. The ambiguity must be resolved by using the pct-encoded-byte rule instead of the pct-invalid rule wherever possible.
Percent Encoding
To percent-encode a string string using a given character-set encode-set …
Wrap this up. Furthermore, this needs to discuss the encoding override, which is only allowed for the query; all other components must use UTF8.
Percent Decoding
Percent decoding of strings is stratified into the following phases:
- Analyse the string as into pct-encoded-bytes, pct-invalid and uncoded parts according to the pct-encoded-string grammar.
- Convert the percent-coded-bytes to a byte-sequence and decode the byte sequence to a string whilst leaving the pct-invalid and uncoded parts unmodified.
- Recompose the string.
URL Model
An URL is a sequence of components that is subject to a number of additional constraints. The ordering of the components in the sequence is analogous to the hierarchical syntax of an URI as described in Hierarchical Identifiers in RFCÂ 3986.
It is important to stress the distinction between an
URL and an
URL-string.
An URL is a structure, indeed a special
sequence of components, whereas an URL-string
is a special kind of string that represents an
URL. Conversions between URLs
and URL-strings are described in
the sections Parsing
and Printing.
URL
An URL is a sequence of components that occur in ascending order by component-type, where component-type is taken from the ordered set:
scheme < authority < drive < path-root < dir < file < query < fragment.
- An URL contains at most one component per type, except for dir components, of which it may have any finite amount.
- path-root-constraint If an URL has an authority or a drive component, and it has a dir or a file component, then it also has a path-root component.
Authority
An Authorityauthority is a sequence of components ordered by their type, taken from the ordered set:
username < password < host < port.
- Authorities have at most one component per type.
- If an Authority has a password component then it also has a username component.
- If an Authority has a username or a port component then it also has a host component.
Host
A Host is either:
- An ipv6-address,
- an opaque-host,
- an ipv4-address, or
- a domain-name.
URL components
Whenever present, the components of an URL are subject to the following constraints:
-
scheme-string
The scheme of an URL
is a string
scheme :: alpha (alpha
| digit |
+
|-
|.
)*. - The authority of an URL is an Authority.
-
The path-root of an URL
is the string
/
. -
The drive of an URL
is a
string drive ::
alpha (
:
||
). - The file of an URL is a nonempty string.
- The host of an Authority is a Host.
- The port of an Authority is either ε or a natural number n < 216.
- For all other components present, the components' values are strings.
Note that arbitrary code points are allowed in component values unless explicitly specified otherwise. The restrictions on the code points in an URL-string is discussed in the Parsing section.
Types and Shapes
There is an additional number of soft constraints that must be met for an URL in order to be called valid. However, implementations must tolerate URLs that are not valid as an error recovery strategy.
Valid URL
A valid URL must not have a username or a password component and it must not have components that contain invalid percent-encode sequences.
There is a further categorisation of URLs that is used in several of the operations that will be defined later.
Web-URL
A web-URL is an
URL that has a web-scheme.
A web-scheme is a string
scheme such that
(lowercase scheme) ::
http
| https
| ws
|
wss
| ftp
.
File-URL
A file-URL is an
URL that has a file-scheme.
A file-scheme is a string
scheme such that
(lowercase scheme) :: file
.
Reference Resolution
This section defines reference resolution operations that are analogous to the algorithms that are described in the chapter Reference Resolution of RFCÂ 3986. The operations that are defined in this section are used in a subsequent section to define a parse-resolve-and-normalise operation that accurately describes the behaviour of the ‘basic URL parser’ as defined in the WHATWG URL standard.
Reference resolution as defined in this section does not involve URL-strings. It operates on URLs as defined in the URL Model section above. In contrast with RFCÂ 3986 and with the WHATWG standard, it does not do additional normalisation, which is relegated to the section Equivalences and Normalisation instead.
Order and Prefix operations
A property that is particularly useful is the order of an URL. Colloquially, the order is the type of the first component of an URL. The order may be used as an argument to specify various prefixes of an URL.
The Order of an URL
The order of an URL (ord url) is defined to be:
- fragment if url is the empty URL.
- The type of its first component otherwise.
Order-Limited Prefix
The order-limited prefix
(url upto order)
is defined to be
the shortest prefix of url that contains:
- all components of url with a type strictly smaller than order and
- all dir components with a type weakly smaller than order.
The Goto Operation
Based on the order and the order-limited prefix one can define a “goto” operation that is analogous to the “merge” operation that is defined in the subsection Transform References of RFCÂ 3986. I have chosen the name “goto” to avert incorrect assumptions about commutativity. The operation is not commutative, but it is associative.
Goto
The goto operation
(url1 goto
url2) is defined to return
the shortest URL that has
url1 upto
(ord url2) as a
prefix and url2 as a postfix.
Goto Properties
The goto operation has a number of pleasing mathematical properties, as follows:
- ord (url1 goto url2) is the least type of {ord url1, ord url2}.
- (url1 goto url2) goto url3 = url1 goto (url2 goto url3).
- ε goto url2 = url2.
- url1 goto ε = url1 — if url1 does not have a fragment.
- url2 is a postfix of (url1 goto url2).
Be aware that the goto operation does a bit more than
sequence concatenation. In some cases it creates a path-root
component to satisfy the path-root-constraint
of the URL model.
For example, if url1 is the
URL represented by //host
and url2
is the URL represented by foo/bar
then (url1 goto url2)
is represented by //host/foo/bar
, thus containing a path-root even
though neither url1 nor url2 has one.
Forcing
There is an additional operation on URLs that I have named force. The force operation is used as a final step in the process of resolving a web-URL or a file-URL. The operation ensures that the resulting URL has an authority token and a path-root token. Note that it is possible for the force operations to fail.
Forced File URL
To force a file-URL url:
- If url does not have an authority then set its authority component to (authority ε).
- If otherwise the authority of url has a username or a port then fail.
-
If url does not have a drive then set its
path-root component to (path-rootÂ
/
).
Forced Web URL
To force a web-URL url:
-
Set the path-root component of url to
(path-rootÂ
/
). - If url has a non-empty authority then return.
-
Otherwise let component be the first dir or
file component whose value is not ε.
If no such component exists, fail. - Remove all dir or file components that precede component and remove component as well.
- Let auth be the value of component parsed as an Authority and set the authority of url to auth. If the value cannot be parsed as an Authority then fail.
Force Properties
The force operation lacks a number of nice properties with respect to the goto operation. The following equalities do not hold in general:
force ((force url1) goto (force url2)) | ≠| force (url1 goto url2) |
force ((force url1) goto url2) | ≠| force (url1 goto url2) |
force (url1 goto (force url2)) | ≠| force (url1 goto url2) |
Resolution
The subsection Transform References of RFCÂ 3986 specifies two variants of reference resolution. A generic, strict variant and an alternative non-strict variant. This specification adds a third variant that is used to specify the behaviour of web-browsers. I have chosen to name the strict variant generic resolution and the non-strict variant legacy resolution, so that the words the words strict and non-strict can be used as modifiers later on.
Generic Resolution
The generic resolution (generic-resolve url1 url2) of an URL url1 against an URL url2 is defined to be url2 goto url1 — if url2 has a scheme or url1 has a scheme. Otherwise resolution fails.
Legacy Resolution
The legacy resolution (legacy-resolve url1 url2) is defined to be the generic resolution (generic-resolve ~url1 url2) where ~url1 is:
- url1 with its scheme removed if both url1 and url2 have a scheme, and the value of their scheme components case-insensitively compare equal, or
- url1 itself otherwise.
Based on the generic resolution, and the legacy resolution and the force operations as defined above, it is now possible to define a reference resolution operation that can be used to characterise the behaviour that is specified in the WHATWG standard.
WHATWG Resolution
The WHATWG-resolution of url1 against url2 is defined as:
-
force (legacy-resolve
url1 url2)
- if url1 is a web-URL or a file-URL, or
- if url1 does not have a scheme and url2 is a web-URL or a file-URL;
-
generic-resolve
url1 url2
- if otherwise the first component of url1 is a scheme or a fragment, or
- if url2 has an authority or a path-root;
- otherwise, the operation fails.
Any application of the force operation that modifies its input must issue a validation warning. If in the process of WHATWG-resolution either the force operation, or the internally used generic resolution operation fails, then the WHATWG-resolution fails as well.
Parsing
Parsing is the process of converting an
URL-string to an
URL.
Parsing is stratified into the following phases:
- Preprocessing.
- Selecting a parser mode.
- Parsing.
- Decoding and parsing the host.
Preprocessing
The appendix Delimiting a URI in Context of RFC 3986 states that surrounding white-space should be removed from an URI when it is extracted from its surrounding context. The WHATWG standard makes this more explicit and specifies a preprocessing step that removes removes specific control– and white-space–characters from the input string before parsing.
Preprocessing
Before parsing, the input string input must be preprocessed:
- Remove all leading and trailing c0-space characters from input.
- Remove all u+9 (tab), u+A (line-feed) and u+D (carriage-return) characters from input.
Parser modes
Unfortunately, URL-parsing depends on the scheme of the URL being parsed. As a consequence, scheme-less URL parsing is ambiguous. This can be resolved by explicitly specifying a parser-mode. The parser-mode only influences how scheme-less URLs are parsed.
Parser Mode
There are three distinct parser-modes:
web-mode, file-mode, and generic-mode.
The parser-mode-for an URL-string input, given a fallback parser-mode supplied-mode is then defined to be:
-
web-mode — if input starts with a
web-scheme followed by
:
-
file-mode — if input starts with a
file-scheme followed by
:
-
generic-mode — if otherwise input starts with a
scheme-string followed by
:
- supplied-mode — otherwise.
In practice, it is possible and advisable to begin parsing with a supplied parser-mode, and to update the parser-mode whilst parsing, as soon as a scheme has been detected.
URL Grammar
This subsection specifies the grammar for URL-strings. Implementations must use this grammar for parsing URLs from strings. The grammar is parameterised by a parser-mode. Specifically, the auth-path rule, the authority rule and the s rule have different versions for different parser-modes.
url | ::= | [ scheme : ] ( auth-path | path )
[ ? query ] [ # fragment ]
|
The auth-path rule in web-mode and in generic-mode: | ||
auth-path | ::= | s s authority [ path-root  path-rel ] |
The auth-path rule in file-mode: | ||
auth-path | ::= | auth-drive [ path-root  path-rel ] |
auth-drive | ::= | [ s s authorityε] [ s ] drive  | s s authority [ s drive ] |
Rules for the authority and the path: | ||
authority | ::= | ε  | [ credentials @ ] host [ : port ]
|
authorityε | ::= | ε |
credentials | ::= | username [ : password ]
|
path | ::= | [ path-root ] Â path-rel |
path-rel | ::= | ( dir s )* [ file ] |
Rules for the components: | ||
scheme | ::= | alpha (alpha | digit
| + | - | . )*
|
username | ::= | ( pcts | uchar )* |
password | ::= | ( pcts | pchar )* |
host | ::= | [ ip6-address ] Â |Â opaque-host
|
opaque-host | ::= | ( pcts | hchar )+ |
port | ::= | ε | digit+ |
drive | ::= | alpha (: | | )
|
path-root | ::= | s |
dir | ::= | ( pcts | pchar )* |
file | ::= | ( pcts | pchar )+ |
query | ::= | ( pcts | qchar )* |
fragment | ::= | ( pcts | url-char )* |
Where pcts is defined as follows: | ||
pcts | ::= | pct-encoded-bytes | pct-invalid |
The rules are based on the following character-sets: | ||
url-char | := | any \ |
uchar | := | url-char \ s \
# , ? , :  } |
hchar | := | url-char \ s \
{ # , ? , :  , @  } \
{ u+0,
,
< , > , [ , \ , ] ,
^ , | }
|
pchar | := | url-char \ s \ # , ? } |
qchar | := | url-char \ { # }
|
Where s depends on the parser-mode: | ||
s | := | { / }
 — in generic-mode
|
ss | := | { / , \ }
 — otherwise
|
Strict URL Grammar
For an URL-string to be considered valid, it must conform to a more restricted grammar. The strict grammar is obtained by modifying the auth-drive and pcts rules and the url-char and s character-sets as follows:
auth-drive | ::= | s drive  | s s authority  [ s drive ] | |||
pcts | ::= | pct-encoded-bytes  — note the absence of pct-invalid | |||
The rules are based on the following character-sets: | |||||
url-char | := | base-char \ {
,
" ,
# ,
% ,
< ,
> ,
[ ,
\ ,
] ,
^ ,
` ,
{ ,
| ,
} }
| |||
s | := | { / }
 — in all modes.
|
IPv6 Addresses
Double check the following, and include it here
IPv6 Address — Strict
An IPv6 address-string is a representation of a natural number n < 2128 that is used as an identifier. It is accurately described by the production rule IPv6address in the Host section of RFC 3986.
The IPv6 address parser in the WHATWG standard however implies a more tolerant definition of IPv6 address-strings.
IPv6 Address — loose
A note about multiple slashes
Web browsers interpret any amount of slashes after a web-scheme as the start of the authority component. Consider the following URL-strings:
http:foo/bar
http:/foo/bar
http://foo/bar
http:///foo/bar
Web browsers treat all these examples as equivalent to
http://foo/bar
.
It is tempting to try to express this behaviour on the level of the
grammar. For example one might consider using the following rule:
auth-path  ::= s* authority [ path-root [ dir s ]* [ file ] ]
However, the examples above do behave differently with
respect to reference resolution. For example, if they are resolved
against the base-URL that is represented by
http://host/
, then the results are as follows:
http://host/foo/bar
http://host/foo/bar
http://foo/bar
http://foo/bar
As such, collapsing the multiples of slashes, cannot be expressed within the grammar. Instead, the force operation as defined in the section on Reference Resolution implements this behaviour.
Host Parsing
- Percent decode.
- Puny decode.
- Apply IDNA/ Nameprep normalisation.
- Detect and interpret IPv4 addresses.
- Err on 'forbidden host codepoints'.
IPv4 Address
An IPv4 address-string consists of one up to four dot-separated numbers with an optional trailing dot. The numbers may use decimal, octal or hexadecimal notation, as follows:
ip4-address | ::= | num
[. num
[. num
[. num ] ] ]
[. ]
|
num | ::= | num-dec | num-oct | num-hex |
num-dec | ::= | 0 | (digit-nonzero digit*)
|
num-oct | ::= | 0 octal-digit*
|
num-hex | ::= | (0x | 0X ) hex-digit*
|
Note that 0x
is parsed as a hexadecimal number.
(It will be interpreted as 0).
IPv4 Address – Strict
A valid IPv4 address-string consists of four dot-separated numbers, without a trailing dot. The numbers must use decimal notation, as follows:
ip4-address-strict | ::= |
num-dec .
num-dec .
num-dec .
num-dec
|
Equivalences and Normalisation
This section is analogous to the section Normalization and Comparison of RFCÂ 3986. The RFC however, does not prescribe a particular normal form. The WHATWG standard does, however implicitly.
Path Segment Normalisation
Path segment normalisation involves the interpretation of
dotted-segments. Colloquially, a single-dot segment has
the meaning of “select the current directory”
whereas a double-dot segment has the meaning
of “select the parent directory”.
Dotted segments are defined by the following rules, where
the addition of %2e
and %2E
has been motivated
by security concerns. Again this is in accordance with
the WHATWG standard.
dot | ::= | . | %2e | %2E
| |||||||||
dots | ::= | dot dot |
Path equivalence is defined by the following equations. The equations must be exhaustively applied from left-to-right to normalise an URL.
drive x |
| ≈ | drive x :
| ||
path-root / ・dir y
| ≈ | path-root /
| — | if y :: dots |
path-root / ・file y
| ≈ | path-root /
| — | if y :: dots |
dir x | ≈ | ε | — | if x :: dot |
file x | ≈ | ε | — | if x :: dot |
dir x・dir y | ≈ | ε | — | if y :: dots and not x :: dots |
dir x・file y | ≈ | ε | — | if y :: dots and not x :: dots |
Authority Normalisation
Authority equivalence is defined by the following equations. Like path normalisation, the equations must be exhaustively applied from left-to-right to normalise an URL. This has the same effect as removing any empty port or password component, and if the URL does not have a password after that, to also remove any empty username component.
password ε | ≈ | ε |
username ε・host h | ≈ | host h |
port ε | ≈ | ε |
Scheme-Based Authority Normalisation
If an URL has a scheme, then a number of additional
equivalences apply to the authority. Normalisation according to these rules
involves the removal of default ports, and similarly,
removing the host from a file-URL if its
value is localhost
.
scheme http
| ・ | authority (xs・port 80 )
| ≈ | scheme http
| ・ | authority xs |
scheme ws
| ・ | authority (xs・port 80 )
| ≈ | scheme ws
| ・ | authority xs |
scheme ftp
| ・ | authority (xs・port 21 )
| ≈ | scheme ftp
| ・ | authority xs |
scheme wss
| ・ | authority (xs・port 443 )
| ≈ | scheme wss
| ・ | authority xs |
scheme https
| ・ | authority (xs・port 443 )
| ≈ | scheme https
| ・ | authority xs |
scheme file
| ・ | authority (host localhost )
| ≈ | scheme file
| ・ | authority ε |
These rules apply in combination with the following rule that states that scheme equivalence is case-insensitive.
scheme scheme | ≈ | scheme (lowercase scheme) |
Percent Coding Normalisation
There is a natural notion of equivalence for percent-encoded-strings: one could consider two percent-encoded strings to be equivalent if their percent-decodings are equal. However, for URL-components, this notion of equivalence is too strong.
… RFC3986 specifies reserved characters to allow for additional, application-specific interpretation. For reserved characters, one cannot assume that they are equivalent to their percent-encoding.
The WHATWG standard does not specify the semantics of percent-encoded-bytes in the components of an URL other than the host at all, and any percent-encoded bytes that are present in other components are not decoded. It does however specify sets of characters that must be encoded.
Percent Encode Profile
A percent-encode-profile is a mapping that maps each component-type in the set { username, password, host, dir, file, query, fragment } to a percent-encode-set.
Percent Encode Profiles
There are four distinct percent-encode-profiles that are relevant for this specification. They are specified by the following tables. A table can be selected by clicking on one of the profile names below.
generic, special, minimal and minimal-special.
u+20–u+27 | u+3A–u+40 | u+5B–u+60 | u+7B–u+7E |
---|
Unfortunately, like parsing, the percent-encoding of URLs as prescribed by the WHATWG standard, is scheme dependent.
Encode Profile Selection
The percent-encode-profile-for a given URL is
- the special profile — if the URL is a file-URL or a web-URL,
- the special profile — if the URL does not have a scheme,
- the generic profile — if otherwise the URL has an authority or a path-root, or
- the minimal profile — otherwise.
Printing
Printing is the process of converting an URL to an URL-string. To print an URL, it must first be subjected to an additional normalisation operation, and it must be percent encoded as follows.
Normalise for Printing
If an URL does not have a drive nor an authority
but it does have an empty first dir component (dir ε) then it must be
normalised for printing by inserting a component (dir .
)
immediately before its first dir component.
This additional normalisation step is necessary because it is not possible to represent the affected URLs as an URL-string.
Consider for example the URL
(path-root /
)・(dir ε)・(file foo
).
Printing this URL without an additional normalisation step
would result in //foo
which represents the URL
(authority (host foo
)) instead.
This is not sufficient. There are similar issues around URLs with drive letters, which is an open issue.
- To print to an ASCII URL-string, percent encode any \ printable-ASCII
- Otherwise percent encode control-c0, del-c1, surrogates and non-chars.
- In addition percent encode a subset of printable-ASCII, depending on the component, as indicated by the percent coding table.
- Let output be the empty string, then, for each of the components (tag, value) of url2 in tree order, convert the component to a string, depending on its type according to the following table, and append it to the output.
scheme | authority | drive | root | dir | file | query | fragment | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
value :
| // |
| / value
| /
| value /
| value | ? value
| # value
|
Concluding
This final section specifies the behaviour of a parse-resolve-and-normalise operation that characterises the behaviour of the ‘basic URL parser’ as described in the WHATWG URL standard.
parse-resolve-and-normalise (string, base) :=
- preprocess string
- detect the parser mode, with the fallback mode as indicated by the base
- parse the preprocessed string according to the grammar to obtain url1
- force resolve url1 against base
- normalise and percent encode the result.
- The href getter returns the printed URL.
- The protocol getter returns the scheme +
:
or ε if absent. - The username and password getters return ε if absent or the value of the corresponding component otherwise
- The host getter returns ε if absent, the value
of the host if the port is absent, and the value
of the host +
:
+ the value of the port otherwise. - The hostname getter returns ε if the URL has no host, otherwise it returns the value of the host.
- pathname (…)
- The search getter returns ε if the URL has no query, otherwise it returns the value of the query.
- The hash getter returns returns ε if the URL has no query, otherwise it returns the value of the fragment.
Note that there is a loss of information about the structure of the URL.