URL Specification
URL Specification
This document provides a concise formal definition of URLs and the language of URLs. It offers a rephrasing of the WHATWG URL standard in an effort to better reveal the internal structure of URLs and the operations on them. It provides a formal grammar and a set of elementary operations that are eventually combined together in such a way that the end result is equivalent with the WHATWG standard.
The goals of this project are:
- To provide a modular, concise formal specification of URLs that is behaviourally equivalent with, and covers all of the WHATWG URL standard with the exception of the web API's setters and the url-encoded form format.
- To provide a general model for URLs that can express relative references and to define reference resolution by means of a number of elementary operations, in such a way that the end result is in agreement with the WHATWG standard.
- To provide a concise way to describe the differences between the WHATWG standard and RFCÂ 3986 and RFCÂ 3987 as part of a larger effort;
- To enable and support work towards an update of RFCÂ 3987 that resolves the incompatibilities with the WHATWG standard.
- To enable and support editorial changes to the WHATWG standard in the hope to recover the structural framework for relative references, normalisation and the hierarchy of equivalence relations that was put forward in RFCÂ 3986.
Status
- This is version 0.10.1 of the specification.
- This document is licensed under a Creative Commons - CC BY-SA 2.0 license.
- Development can be discussed on GitHub. Questions are welcome there as well.
- There is a Reference Implementation to accompany this specification.
- The reference implementation passes all the URL-constructor tests in the web-platform test-suite.
Prelude
This section introduces basic concepts and notational conventions that are used throughout the rest of this document.
Components
typevalue The word component is used throughout this document to mean a tagged value, denoted by (type value), where type is taken from a predefined set of component-types.
The component-type is typeset in boldface and it may be used stand-alone as a component-type or in prefix position to denote a component.
For example, scheme denotes a component-type,
whilst e.g. (scheme http
) denotes a
scheme-component that has the
string http
as its value.
When we are dealing with a collection of components, and the collection contains exactly one component with a given component-type, then we may use the type of the component to refer to its value directly.
For example, if S is a sequence of components (dir x・file y・query z) then the phrase “the query of S” is used to refer to the value z whereas “the query component of S” is used to refer to the component (query z) as a whole.
The value of a component may be a sequence of components itself, in which case the component may be referred to as a compound-component for clarity.
Sequences
This specification uses a notation for sequences that makes no distinction between one-element sequences and single elements. It uses the following notation:
- The empty sequence is denoted by ε.
- The concatenation of two sequences S and T is denoted by S・T.
Strings and Code-Points
For the purpose of this specification,
- A string is a sequence of characters.
- A character is a single Unicode code point.
- A code point, is a natural number n < 17 × 216.
- The empty string is a sequence of zero code points. It is denoted by ε.
Code points are denoted by a number in hexadecimal notation preceded by
u+, in boldface. Code points that correspond to
printable ASCII characters are often
denoted by their corresponding glyph, typeset in monospace and on a screened background.
For example, u+41 and A
denote the same code point. Strings that
contain only printable ASCII characters
are often denoted as a connected sequence of glyphs, typeset likewise.
The printable ASCII characters are code points in the range u+20 to u+7E, inclusive. Note that this includes the space character u+20.
Character Sets
A character-set is a set of characters. A character-range is the largest character-set that includes a given least character c and a greatest character d. Such a character-range is denoted by { c–d }. The union of e.g. { a–b } and { c–d }, is denoted by { a–b, c–d }, and this notation is generalised to n-ary unions. Common character-sets that are used throughout this document are defined below:
Fix the notaton of 216 above
Grammars
The notation name ::= expression is used to define a production rule of a grammar, where the expression uses square brackets ( [ … ] ) for optional rules, a postfix star ( * ) for zero-or-more, a postfix plus ( + ) for one-or-more, an infix vertical line ( | ) for alternatives, monospaced type for literal strings and an epsilon ( ε ) for the empty string. Concatenation takes the highest precedence, followed by ( * ) and ( + ), and ( | ) is used with lowest operator precedence. Parentheses are used for grouping and disambiguation.
Pattern Testing
The shorthand notation string :: rule is used to
express that a string string can be
generated by the production rule rule of a given grammar.
Likewise, the notation
string :: expression is used to express that
string can be generated by expression.
Percent Coding
This subsection is analogous to the section Percent-Encoding of RFCÂ 3986 and the section Percent-encoded bytes of the WHATWG standard.
Bytes
For the purpose of this specification,
- A byte is a natural number n < 28.
- A byte-sequence is a sequence of bytes.
Percent Encoded Bytes
A non-empty byte-sequence may be percent-encoded as
a string by rendering each individual byte
in two-digit hexadecimal notation, prefixed by %
.
Percent Encoded String
A percent-encoded–string is a string that may have zero or more percent-encoded–byte-sequences embedded within, as follows:
The grammar above may be relaxed in situations where error recovery is desired. This is achieved by adding a pct-invalid rule as follows:
There is an ambiguity in the loose grammar due to the overlap between pct-encoded-bytes and pct-invalid combined with uncoded. The ambiguity must be resolved by using the pct-encoded-bytes rule instead of the pct-invalid rule wherever possible.
Percent Encoding
To percent-encode a string string using a given character-set encode-set …
The only tricky bit in this is is the encoding of multi-byte UTF8 sequences, and the optional encoding override to specify a non-UTF8 encoding for the query component.
Percent Decoding
Percent decoding of strings is stratified into the following phases:
- Analyse the string into pct-encoded-bytes, pct-invalid and uncoded parts according to the pct-encoded-string grammar.
-
Convert each of the the pct-encoded-bytes parts to a byte-sequence and decode
the byte sequence to a string assuming that byte-sequence uses the
Unicode
UTF-8 encoding. Leave the pct-invalid and uncoded parts unmodified.
- Recompose the string by concatenating each of the parts
URL Model
An URL, or specifically an URL-structure, is conceptually distinct from an URL-string. An URL-structure is a collection of components that adheres to a number of constraints, whereas an URL-string is a string that represents an URL. An URL-string has an internal structure that can be parsed as an URL-structure. Conversely, an URL-structure may be converted to an URL-string.
URL
An URL is a sequence of components. The components are ordered by their type in an ascending order where the types are taken from the ordered set:
scheme  < authority  < drive  < path-root  < dir  < file  < query  < fragment.
An URL must contain at most one component of each type except for dir components, of which it may have any finite amount. It must uphold the path-root-constraint: If an URL has an authority or a drive component, and it also has a dir or a file component, then it must also have a path-root component. If present, then the value of each of the components have the following structure:
The scheme is a string
scheme :: alpha (alpha | digit |
+
| -
| .
)*.
The authority component is a compound-component and its value is an
Authority.
The drive is a two-character string
:
| |
)
The path-root is the single character string /
.
The dir, file, query and fragment are percent-encoded-strings.
However, the file must be a nonempty percent-encoded-string, if present.
In an URL-string the component values are delineated
from each other by means of sigils. Most component-types have an associated
sigil that is used either in a prefix– or postfix position to its value.
However, in the case of the path-root the sigil is the single /
character itself, and the file, host and username components do not
have an associated sigil at all.
The components of an URL are paired with sigils as follows:
scheme scheme |  ⟼ | scheme :
|
authority auth |  ⟼ | // auth
|
drive x |  ⟼ | / x
|
path-root s |  ⟼ | /
|
dir name |  ⟼ | name /
|
file name |  ⟼ | name |
query query |  ⟼ | ? query
|
fragment frag |  ⟼ | # frag
|
Note that the second character :
or |
of a drive component value is not considered to be its sigil,
but is instead a part of its value.
Types and Shapes
There is a further categorisation of URLs that is used in several of the operations that will be defined later.
Web-URL
A web-URL is an
URL that has a web-scheme.
A web-scheme is a string
scheme such that
(lowercase scheme) ::
http
| https
| ws
|
wss
| ftp
.
File-URL
A file-URL is an
URL that has a file-scheme.
A file-scheme is a string
scheme such that
(lowercase scheme) :: file
.
Schemeless-URL
A schemeless-URL is an URL that does not have a scheme component.
Authority Model
The authority component of an URL is a compound-component and its value is an Authority structure:
Authority
An Authorityauthority is a sequence of components ordered by their type, taken from the ordered set:
userinfo  < host  < port.
An Authoritiy must have at most one component per type, but if it does have a userinfo or a port component then it must also have a host component. If present, the value of the components have the following structure:
The userinfo is a Userinfo structure.
The password is a percent-encoded-string.
The host is a Host structure.
The port is a natural number n  < 216
or the empty string ε.
In an URL-string, the components of an Authority are paired with sigils as follows:
userinfo info |  ⟼ | info @
|
host value |  ⟼ | value |
port ε |  ⟼ | :
|
port n |  ⟼ | : n
|
Userinfo
userinfo The Userinfo is a sequence of components ordered by their type, taken from the ordered set:
username < password.
The Userinfo must have a single username component, and at most one password component.
Finally, the components of the Userinfo are paired with sigils as follows:
username name |  ⟼ | name |
password pass |  ⟼ | : pass
|
Host
A Host is either:
- An ipv6-address,
- an opaque-host,
- an ipv4-address, or
- a domain-name.
The WHATWG standard specifies scheme-dependent expectations on the Host of an URL. For any generic URL it merely enforces that its opaque-host does not contain certain characters, as is specified by the URL-grammar. However, for file and web-URLs it attempts to parse their opaque-host as a domain or ipv4-address. This additional Host parsing is stratified into the following phases:
- Percent decode.
- Apply domain-to-ASCII
- Err on 'forbidden domain codepoints'.
- Detect and interpret IPv4 addresses.
IPv6 Address
IPv6 Address — Strict
An IPv6 address-string is a representation of a natural number n < 2128 that is used as an identifier. It is accurately described by the production rule IPv6address in the Host section of RFC 3986.
The IPv6 address parser in the WHATWG standard however implies a more tolerant definition of IPv6 address-strings.
IPv6 Address — Loose
Domain and IPv4 Address
IPv4 Address
An IPv4 address-string consists of one up to four dot-separated numbers with an optional trailing dot. The numbers may use decimal, octal or hexadecimal notation, as follows:
Note that 0x
is parsed as a hexadecimal number;
it is interpreted as 0.
There is an additional semantic constraint that can render invalid an ipv4-address that uses the loose grammar; specifically, it must not represent a number that exceeds the addressable range of 232; Where the amount of segments determines the magnitude of the segment. …
IPv4 Address – Strict
A strict IPv4 address-string consists of four dot-separated numbers, without a trailing dot, where in addition each number must use the shortest possible decimal representation of a natural number n ≤ 255.
The use of the shortest decimal representation can be expressed as follows:
Component Characters
When it comes to handling characters that may occur in URL-component values we have ‘for historical reasons’ ended up in a situation where we distinguish per component, characters that are:
- v:Â valid;
- E:Â valid but percent-encoded;
- T:Â invalid but tolerated;
- F:Â invalid and fixed by percent-encoding;
- R:Â invalid and rejected.
In order to specify the how characters must be handled in components, it us useful to first divide the entire space of characters any into non-overlapping character sets as follows:
This collection of character sets covers the entire character space. To exhaustively specify how characters in components should be handled we divide each of these sets into further subdivisions and specify per component the status of the characters in this subset:
username | password | opaque-host | dir and file | opaque path | query | fragment | ||
---|---|---|---|---|---|---|---|---|
unreserved | - . _ ~
| v | v | v | v | v | v | v |
alpha ∪ digit | v | v | v | v | v | v | v | |
other-unicode | E | E | E | E | E | E | E | |
sub-delims | ! $ &
( )
* + ,
| v | v | v | v | v | v | v |
'
| v | v | v | v | v | E1 | v | |
; =
| E | E | v | v | v | v | v | |
gen-delims | @
| F | F | R | v | v | v | v |
:
| n/a | F | n/a | v | v | v | v | |
/
| n/a | n/a | n/a | n/a | T | v | v | |
?
| n/a | n/a | n/a | n/a | n/a | v | v | |
#
| n/a | n/a | n/a | n/a | n/a | n/a | T | |
pct | % |
n/a | n/a | n/a | n/a | n/a | n/a | n/a |
invalid | u+9, u+A, u+D | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
u+0 | F | F | R | F | F | F | F | |
other control | F | F | F | F | F | F | F | |
| F | F | R | F | T | F | F | |
"
| F | F | T | F | T | F | F | |
< >
| F | F | R | F | T | F | F | |
`
| F | F | T | F | T | T | F | |
^
| F | F | R | F | T | T | T | |
[ \ ] |
| F | F | R | T | T | T | T | |
{ }
| F | F | T | F | T | T | T | |
username | password | opaque-host | dir and file | opaque path | query | fragment |
Note that applying the rules for encoding component characters does not necessarily produce valid URLs, specifically, if a component contains a character that is marked as T: invalid but tolerated.
Note that the apostrophe ' must instead be left untouched in the query of non-special urls; and the path of generic urls that have no auth nor root are handled as per the opaque path column.
Reference Resolution
This section defines reference resolution operations that are analogous to the algorithms that are described in the chapter Reference Resolution of RFCÂ 3986. The operations that are defined in this section are used in a subsequent section to define a parse-resolve-and-normalise operation that accurately describes the behaviour of the ‘basic URL parser’ as defined in the WHATWG URL standard.
Reference resolution as defined in this section does not involve URL-strings. It operates on URLs as defined in the URL Model section above. In contrast with RFCÂ 3986 and with the WHATWG standard, it does not do additional normalisation, which is relegated to the section Equivalences and Normalisation instead.
Order and Prefix
A property that is particularly useful is the order of an URL. Colloquially, the order is the type of the first component of an URL. The order may be used as an argument to specify various prefixes of an URL.
The Order of an URL
The order of an URL (ord url) is defined to be:
- fragment if url is the empty URL.
- The type of its first component otherwise.
Order Prefix
The order-prefix
(url upto order)
is defined to be
the shortest prefix of url that contains:
- all components of url with a type strictly smaller than order and
- all dir components with a type weakly smaller than order.
Reference Transformation
Based on the order and the order-prefix we define a “rebase” operation that slightly generalises reference transformation as specified in the section Transform References of RFCÂ 3986.
Rebase
The rebase operation
(url onto base-url)
is defined to return
the shortest URL that has base-url upto
(ord url) as a prefix and url as a postfix.
Be aware that rebase does a bit more than simple sequence concatenation.
It may add a path-root component to satisfy the
path-root-constraint of the URL model.
For example, if base is the URL represented by //host
and input is the URL represented by foo/bar
then (input onto base)
is represented by //host/foo/bar
, thus containing a path-root even
though neither base nor input has one.
Rebase Properties
The rebase operation has a number of pleasing mathematical properties, as follows:
- ord (url2 onto url1)  is the least type of { ord url1, ord url2 }
- (url3 onto url2) onto url1 Â =Â url3 onto (url2 onto url1)
- url1 onto ε  = url1
- ε onto url1  = url1  — if url1 does not have a fragment.
Forcing
Forcing is used as a final step in the process of resolving a web-URL or a file-URL. It ensures that the forced URL has an authority component and a path-root component. Note that it is possible for the force operations to fail.
Forcing a File URL
To force a file-URL url:
- If url does not have an authority then set its authority component to (authority ε).
- If otherwise the authority of url has a username or a port then fail.
-
If url does not have a drive then set its
path-root component to (path-rootÂ
/
).
Forcing a Web URL
To force a web-URL url:
-
Set the path-root component of url to
(path-rootÂ
/
). - If url has a non-empty authority then return.
-
Otherwise let component be the first dir or
file component whose value is not ε.
If no such component exists, fail. - Remove all dir or file components that precede component and remove component as well.
- Let auth be the value of component parsed as an Authority and set the authority of url to auth. If the value cannot be parsed as an Authority then fail.
Force Properties
The force operation does not have pleasing mathematical properties with respect to the rebase operation. The following equalities do not hold in general:
force ((force url1) onto (force url2)) | ≠| force (url1 onto url2) |
force ((force url1) onto url2) | ≠| force (url1 onto url2) |
force (url1 onto (force url2)) | ≠| force (url1 onto url2) |
Resolution
The subsection Transform References of RFCÂ 3986 specifies two variants of reference resolution. A generic, strict variant and an alternative non-strict variant. I have chosen to rename the non-strict variant to legacy resolution. This specification adds a third variant that characterises the behaviour that is specified in the WHATWG standard.
Strict Resolution
The strict-resolution (strict-resolve url base) of an URL url onto an URL base is defined to be url onto base — if url has a scheme or base has a scheme. Otherwise resolution fails.
Legacy Resolution
The legacy-resolution (legacy-resolve url base) is defined to be the strict-resolution (strict-resolve ~url base) where ~url is:
- url with its scheme removed if both url and base have a scheme and the value of their scheme components case-insensitively compare equal, or
- url itself otherwise.
WHATWG Resolution
The whatwg-resolution of url onto base is defined to be:
-
force (legacy-resolve
url base)
- if url is a web-URL or a file-URL, or
- if url does not have a scheme and base is a web-URL or a file-URL;
-
strict-resolve
url base
- if otherwise the first component of url is a scheme or a fragment, or
- if base has an authority or a path-root;
- otherwise, the operation fails.
If force modifies its input then “applications are encouraged” to issue a validation warning. If in the process of whatwg-resolution either the force operation, or the internally used strict-resolution operation fails, then the whatwg-resolution fails as well.
Parsing
Parsing is the process of converting an
URL-string to an
URL.
Parsing is stratified into the following phases:
- Preprocessing.
- Selecting a parser mode.
- Parsing.
- Decoding and parsing the host.
Preprocessing
The appendix Delimiting a URI in Context of RFC 3986 states that surrounding white-space should be removed from an URI when it is extracted from its surrounding context. The WHATWG standard makes this more explicit and specifies a preprocessing step that removes removes specific control– and white-space–characters from the input string before parsing.
Preprocessing
Before parsing, the input string input must be preprocessed:
- Remove all leading and trailing c0-space characters from input.
- Remove all u+9 (tab), u+A (line-feed) and u+D (carriage-return) characters from input.
-
If the result starts with a web-scheme followed by
:
or with a file-scheme followed by:
but it does not start with a scheme followed by:
then furthermore replace all occurrences of\
that occur before the first character that is either an?
or an#
, or before the end of the string otherwise, with the/
character.
URL Grammar
We can now specify the full grammar for URL-strings, using an alternative version for the auth-path rule for file-URLs:
url | ::= | [ scheme : ] ( auth-path | path )
[ ? query ] [ # fragment ]
|
The general auth-path rule: | ||
auth-path | ::= | // authority [ path-root  path-rel ]
|
The auth-path rule for file-URLs: | ||
auth-path | ::= | auth-drive [ path-root  path-rel ] |
auth-drive | ::= | auth-drive-invalid
 | // authority [ / drive ]
 | / drive
|
auth-drive-invalid | ::= | [ // authorityε ] drive
|
drive | ::= | alpha (: | | )
|
A forced file-URL however does not have a username, password or a port, and will have a path-root. Moreover if it has a non-empty authority then its host will not be an opaque host as it will have been processed and verified to be a domain. | ||
Rules for the authority and the path: | ||
authorityε | ::= | ε |
authority | ::= | ε  | [ userinfo @ ] host [ : port ]
|
userinfo | ::= | username [ : password ]
|
host | ::= | [ ip6-address ] Â |Â opaque-host
|
port | ::= | ε | digit+ |
path | ::= | [ path-root ] Â path-rel |
path-root | ::= | /
|
path-rel | ::= | ( dir / )* [ file ]
|
If the port is not the empty string, then it must in addition be a decimal representation of a natural number n < 216 where leading zeroes are allowed. | ||
A forced web-URL will have a non-empty authority and a path-root. Moreover if it has a host then it will not be an opaque host as it will have been processed and verified to be a domain. | ||
Rules for the components: | ||
scheme | ::= | alpha (alpha | digit
| + | - | . )*
|
username | ::= | ( uchar | pct )* |
password | ::= | ( pchar | pct )* |
opaque-host | ::= | |
dir | ::= | ( pchar | pct )* |
file | ::= | ( pchar | pct )+ |
query | ::= | ( qchar | pct )* |
fragment | ::= | ( fchar | pct )* |
Using the following rule for valid– or invalid percent encoded bytes: | ||
pct | ::= | pct-encoded-byte  | pct-invalid |
The rules are based on the following character-sets: | ||
uchar | := | any \
% , # , ? , / , : Â } |
hchar | := | any \
{Â % , # , ? , / , : , @ Â }Â
\ { u+0, ,
< , > , [ , \ , ] ,
^ , | }
|
pchar | := | any \ % , # , ? , / Â } |
qchar | := | any \ {Â % , # Â }
|
fchar | := | any \ {Â % Â }
|
This grammar provides a lenient and a strict description of URL strings at once, as follows: A stricty valid URL string does not use the pct-invalid, nor auth-drive-invalid rules and it must not contain in its components any characters that are specified as R, T or F for that component-type as per the Component Characters section. Moreover, a strictly valid URL must not contain a Userinfo component in its authority.
Equivalences and Normalisation
This section is analogous to the section Normalization and Comparison of RFCÂ 3986. The RFC however, does not prescribe a particular normal form. The WHATWG standard does, however implicitly.
Path Segment Normalisation
Path segment normalisation involves the interpretation of
dotted-segments. Colloquially, a single-dot segment has the meaning of
“select the current directory” whereas a double-dot segment has
the meaning of “select the parent directory”.
Dotted segments are defined by the following rules, where the addition of
%2e
and %2E
has been motivated
by security concerns. Again this is in accordance with
the WHATWG standard.
Path equivalence is defined by the following equations. The equations can be exhaustively applied from left-to-right to normalise an URL.
drive x |
| ≈ | drive x :
| ||
dir x | ≈ | ε | — | if x :: dot |
dir x・dir y | ≈ | ε | — | if y :: dots and not x :: dots |
path-root / ・dir y
| ≈ | path-root /
| — | if y :: dots |
Authority Normalisation
Authority equivalence is defined by the following equations. Like path normalisation, the equations must be exhaustively applied from left-to-right to normalise an URL. This has the same effect as removing any empty port or password component, and if the URL does not have a password after that, to also remove any empty username component.
password ε | ≈ | ε |
userinfo (username ε) | ≈ | ε |
port ε | ≈ | ε |
Scheme-Based Authority Validation
If an URL has a scheme, then a number of additional requirements may be enforced on the authority; specifically these should be enforced or fixed whilst resolving (as opposed to rebasing) and URL.
Type it out nicely
- file and web URLs must have an authority
- file URL authority must not have a userinfo nor a port component
- web URL must have a non-empty authority
- file- and web URL host must not be an opaque host
Scheme-Based Authority Normalisation
If an URL has a scheme, then a number of additional
equivalences apply to the authority. Normalisation according to these rules
involves the removal of default ports, and similarly,
removing the host from a file-URL if its
value is localhost
.
scheme http
| ・ | authority (xs・port 80 )
| ≈ | scheme http
| ・ | authority xs |
scheme ws
| ・ | authority (xs・port 80 )
| ≈ | scheme ws
| ・ | authority xs |
scheme ftp
| ・ | authority (xs・port 21 )
| ≈ | scheme ftp
| ・ | authority xs |
scheme wss
| ・ | authority (xs・port 443 )
| ≈ | scheme wss
| ・ | authority xs |
scheme https
| ・ | authority (xs・port 443 )
| ≈ | scheme https
| ・ | authority xs |
scheme file
| ・ | authority (host localhost )
| ≈ | scheme file
| ・ | authority ε |
These rules apply in combination with the following rule that states that scheme equivalence is case-insensitive.
scheme scheme | ≈ | scheme (lowercase scheme) |
Printing
Printing is the process of converting an URL to an URL-string. Printing can be stratified into the following three phases:
- Normalising the URL for printing.
- Converting each of the components of the URL to a string.
- Composing the final URL-string from the printed components.
Normalise for Printing
To print an URL, it must first be subjected to an additional normalisation operation, and it must be percent encoded as follows.
Normalise for Printing
If an URL does not have a drive nor an authority
but it does have an empty first dir component (dir ε) then it must be
normalised for printing by inserting a component (dir .
)
immediately before its first dir component.
Otherwise, if the first component of an URL is a dir or a file component and its value value starts with a scheme-like string, i.e.
value :: alpha (alpha | digit | +
| -
| .
)* :
any*,
then the first occurrence of :
in value must be replaced with its percent-encoding
%3A
.
This additional normalisation step is necessary because it is not possible to represent the affected URL as an URL-string otherwise.
Consider for example the URL
(path-root /
)・(dir ε)・(file foo
).
Printing this URL without an additional normalisation step
would result in //foo
which represents the URL
(authority (host foo
)) instead.
There is a similar issue around URLs with drive letters, but these issues are not properly addressed in the current WHATWG standard, making it difficult for us to prescribe appropriate counter measures.
Printing URLs
To convert an Authority to an Authority-string, replace each of the sub-components with their component-value concatenated with their prefix– or postfix-sigil, if any. This results in a sequence of strings, which is converted to the final result by concatenating them in order.
To convert an URL to an URL-string, first convert the value of its authority component (if any) to an Authority-string. Then continue to replace each component with their component-value concatenated with their identifying sigil, if any. This results in a sequence of strings, which is converted to a single string by simply concatenating them in order.
Concluding
This final section specifies the behaviour of a parse-resolve-and-normalise operation that characterises the behaviour of the ‘basic URL parser’ as described in the WHATWG URL standard.
parse-resolve-and-normalise (string, base) :=
- Preprocess string
- Detect the parser mode, with the fallback mode as indicated by the base
- Parse the preprocessed string according to the grammar to obtain url1
- Force resolve url1 against base
- Normalise and percent encode the result.
That results in an URL-structure. One can then wrap around that to describe the URL class of web browsers:
- The href getter returns the printed URL.
-
The protocol getter returns the scheme +
:
or ε if absent. - The username and password getters return ε if absent or the value of the corresponding component otherwise
-
The host getter returns ε if absent, the value
of the host if the port is absent, and the value
of the host +
:
+ the value of the port otherwise. - The hostname getter returns ε if the URL has no host, otherwise it returns the value of the host.
- pathname (…)
-
The search getter returns ε if the URL has no query, otherwise it
returns
?
+ the value of the query. -
The hash getter returns returns ε if the URL has no fragment,
otherwise it returns
#
+ the value of the fragment.