- Absolutize(uriRef, baseUri)
-
Resolves a URI reference to absolute form, effecting the result of RFC
3986 section 5. The URI reference is considered to be relative to the
given base URI.
It is the caller's responsibility to ensure that the base URI matches
the absolute-URI syntax rule of RFC 3986, and that its path component
does not contain '.' or '..' segments if the scheme is hierarchical.
Unexpected results may occur otherwise.
This function only conducts a minimal sanity check in order to determine
if relative resolution is possible: it raises a UriException if the base
URI does not have a scheme component. While it is true that the base URI
is irrelevant if the URI reference has a scheme, an exception is raised
in order to signal that the given string does not even come close to
meeting the criteria to be usable as a base URI.
It is the caller's responsibility to make a determination of whether the
URI reference constitutes a "same-document reference", as defined in RFC
2396 or RFC 3986. As per the spec, dereferencing a same-document
reference "should not" involve retrieval of a new representation of the
referenced resource. Note that the two specs have different definitions
of same-document reference: RFC 2396 says it is *only* the cases where the
reference is the empty string, or "#" followed by a fragment; RFC 3986
requires making a comparison of the base URI to the absolute form of the
reference (as is returned by the spec), minus its fragment component,
if any.
This function is similar to urlparse.urljoin() and urllib.basejoin().
Those functions, however, are (as of Python 2.3) outdated, buggy, and/or
designed to produce results acceptable for use with other core Python
libraries, rather than being earnest implementations of the relevant
specs. Their problems are most noticeable in their handling of
same-document references and 'file:' URIs, both being situations that
come up far too often to consider the functions reliable enough for
general use.
- BaseJoin(base, uriRef)
-
Merges a base URI reference with another URI reference, returning a
new URI reference.
It behaves exactly the same as Absolutize(), except the arguments
are reversed, and it accepts any URI reference (even a relative URI)
as the base URI. If the base has no scheme component, it is
evaluated as if it did, and then the scheme component of the result
is removed from the result, unless the uriRef had a scheme. Thus, if
neither argument has a scheme component, the result won't have one.
This function is named BaseJoin because it is very much like
urllib.basejoin(), but it follows the current RFC 3986 algorithms
for path merging, dot segment elimination, and inheritance of query
and fragment components.
WARNING: This function exists for 2 reasons: (1) because of a need
within the 4Suite repository to perform URI reference absolutization
using base URIs that are stored (inappropriately) as absolute paths
in the subjects of statements in the RDF model, and (2) because of
a similar need to interpret relative repo paths in a 4Suite product
setup.xml file as being relative to a path that can be set outside
the document. When these needs go away, this function probably will,
too, so it is not advisable to use it.
- GetScheme(uriRef)
-
Obtains, with optimum efficiency, just the scheme from a URI reference.
Returns a string, or if no scheme could be found, returns None.
- IsAbsolute(identifier)
-
Given a string believed to be a URI or URI reference, tests that it is
absolute (as per RFC 3986), not relative -- i.e., that it has a scheme.
- MakeUrllibSafe(uriRef)
-
Makes the given RFC 3986-conformant URI reference safe for passing
to legacy urllib functions. The result may not be a valid URI.
As of Python 2.3.3, urllib.urlopen() does not fully support
internationalized domain names, it does not strip fragment components,
and on Windows, it expects file URIs to use '|' instead of ':' in the
path component corresponding to the drivespec. It also relies on
urllib.unquote(), which mishandles unicode arguments. This function
produces a URI reference that will work around these issues, although
the IDN workaround is limited to Python 2.3 only. May raise a
UnicodeEncodeError if the URI reference is Unicode and erroneously
contains non-ASCII characters.
- MatchesUriRefSyntax(s)
-
This function returns true if the given string could be a URI reference,
as defined in RFC 3986, just based on the string's syntax.
A URI reference can be a URI or certain portions of one, including the
empty string, and it can have a fragment component.
- MatchesUriSyntax(s)
-
This function returns true if the given string could be a URI, as defined
in RFC 3986, just based on the string's syntax.
A URI is by definition absolute (begins with a scheme) and does not end
with a #fragment. It also must adhere to various other syntax rules.
- NormalizeCase(uriRef, doHost=False)
-
Returns the given URI reference with the case of the scheme,
percent-encoded octets, and, optionally, the host, all normalized,
implementing section 6.2.2.1 of RFC 3986. The normal form of
scheme and host is lowercase, and the normal form of
percent-encoded octets is uppercase.
The URI reference can be given as either a string or as a sequence as
would be provided by the SplitUriRef function. The return value will
be a string or tuple.
- NormalizePathSegments(path)
-
Given a string representing the path component of a URI reference having a
hierarchical scheme, returns the string with dot segments ('.' and '..')
removed, implementing section 6.2.2.3 of RFC 3986. If the path is
relative, it is returned with no changes.
- NormalizePathSegmentsInUri(uri)
-
Given a string representing a URI or URI reference having a hierarchical
scheme, returns the string with dot segments ('.' and '..') removed from
the path component, implementing section 6.2.2.3 of RFC 3986. If the
path is relative, the URI or URI reference is returned with no changes.
- NormalizePercentEncoding(s)
-
Given a string representing a URI reference or a component thereof,
returns the string with all percent-encoded octets that correspond to
unreserved characters decoded, implementing section 6.2.2.2 of RFC
3986.
- OsPathToUri(path, attemptAbsolute=True, osname=None)
-
This function converts an OS-specific file system path to a URI of
the form 'file:///path/to/the/file'.
In addition, if the path is absolute, any dot segments ('.' or '..') will
be collapsed, so that the resulting URI can be safely used as a base URI
by functions such as Absolutize().
The given path will be interpreted as being one that is appropriate for
use on the local operating system, unless a different osname argument is
given.
If the given path is relative, an attempt may be made to first convert
the path to absolute form by interpreting the path as being relative
to the current working directory. This is the case if the attemptAbsolute
flag is True (the default). If attemptAbsolute is False, a relative
path will result in a URI of the form file:relative/path/to/a/file .
attemptAbsolute has no effect if the given path is not for the
local operating system.
On Windows, the drivespec will become the first step in the path component
of the URI. If the given path contains a UNC hostname, this name will be
used for the authority component of the URI.
Warning: Some libraries, such as urllib.urlopen(), may not behave as
expected when given a URI generated by this function. On Windows you may
want to call re.sub('(/[A-Za-z]):', r'\1|', uri) on the URI to prepare it
for use by functions such as urllib.url2pathname() or urllib.urlopen().
This function is similar to urllib.pathname2url(), but is more featureful
and produces better URIs.
- PathResolve(paths)
-
This function takes a list of file URIs. The first can be
absolute or relative to the URI equivalent of the current working
directory. The rest must be relative to the first.
The function converts them all to OS paths appropriate for the local
system, and then creates a single final path by resolving each path
in the list against the following one. This final path is returned
as a URI.
- PercentDecode(s, encoding='utf-8', decodable=None)
-
[*** Experimental API ***] Reverses the percent-encoding of the given
string.
This function is similar to urllib.unquote(), but can also process a
Unicode string, not just a regular byte string.
By default, all percent-encoded sequences are decoded, but if a byte
string is given via the 'decodable' argument, only the sequences
corresponding to those octets will be decoded.
If the string is Unicode, the percent-encoded sequences are converted to
bytes, then converted back to Unicode according to the encoding given in
the encoding argument. For example, by default, u'abc%E2%80%A2' will be
converted to u'abc\u2022', because byte sequence E2 80 A2 represents
character U+2022 in UTF-8.
If the string is not Unicode, the percent-encoded octets are just
converted to bytes, and the encoding argument is ignored. For example,
'abc%E2%80%A2' will be converted to 'abc•'.
This function is intended for use on the portions of a URI that are
delimited by reserved characters (see PercentEncode), or on a value from
data of media type application/x-www-form-urlencoded.
- PercentEncode(s, encoding='utf-8', encodeReserved=True, spaceToPlus=False, nlChars=None, reservedChars="/=&+?#;@,:$!*[]()'")
-
[*** Experimental API ***] This function applies percent-encoding, as
described in RFC 3986 sec. 2.1, to the given string, in order to prepare
the string for use in a URI. It replaces characters that are not allowed
in a URI. By default, it also replaces characters in the reserved set,
which normally includes the generic URI component delimiters ":" "/"
"?" "#" "[" "]" "@" and the subcomponent delimiters "!" "$" "&" "'" "("
")" "*" "+" "," ";" "=".
Ideally, this function should be used on individual components or
subcomponents of a URI prior to assembly of the complete URI, not
afterward, because this function has no way of knowing which characters
in the reserved set are being used for their reserved purpose and which
are part of the data. By default it assumes that they are all being used
as data, thus they all become percent-encoded.
The characters in the reserved set can be overridden from the default by
setting the reservedChars argument. The percent-encoding of characters
in the reserved set can be disabled by unsetting the encodeReserved flag.
Do this if the string is an already-assembled URI or a URI component,
such as a complete path.
If the given string is Unicode, the name of the encoding given in the
encoding argument will be used to determine the percent-encoded octets
for characters that are not in the U+0000 to U+007F range. The codec
identified by the encoding argument must return a byte string.
If the given string is not Unicode, the encoding argument is ignored and
the string is interpreted to represent literal octets, rather than
characters. Octets above \x7F will be percent-encoded as-is, e.g., \xa0
becomes %A0, not, say, %C2%A0.
The spaceToPlus flag controls whether space characters are changed to
"+" characters in the result, rather than being percent-encoded.
Generally, this is not required, and given the status of "+" as a
reserved character, is often undesirable. But it is required in certain
situations, such as when generating application/x-www-form-urlencoded
content or RFC 3151 public identifier URNs, so it is supported here.
The nlChars argument, if given, is a sequence type in which each member
is a substring that indicates a "new line". Occurrences of this substring
will be replaced by '%0D%0A' in the result, as is required when generating
application/x-www-form-urlencoded content.
This function is similar to urllib.quote(), but is more conformant and
Unicode-friendly. Suggestions for improvements welcome.
- PublicIdToUrn(publicid)
-
Converts a public identifier to a URN that conforms to RFC 3151.
- Relativize(targetUri, againstUri, subPathOnly=False)
-
This method returns a relative URI that is consistent with `targetURI`
when resolved against `againstUri`. If no such relative URI exists, for
whatever reason, this method returns `None`.
To be precise, if a string called `rel` exists such that
``Absolutize(rel, againstUri) == targetUri``, then `rel` is returned by
this function. In these cases, `Relativize` is in a sense the inverse
of `Absolutize`. In all other cases, `Relativize` returns `None`.
The following idiom may be useful for obtaining compliant relative
reference strings (e.g. for `path`) for use in other methods of this
package::
path = Relativize(OsPathToUri(path), OsPathToUri('.'))
If `subPathOnly` is `True`, then this method will only return a relative
reference if such a reference exists relative to the last hierarchical
segment of `againstUri`. In particular, this relative reference will
not start with '/' or '../'.
- RemoveDotSegments(path)
-
Supports Absolutize() by implementing the remove_dot_segments function
described in RFC 3986 sec. 5.2. It collapses most of the '.' and '..'
segments out of a path without eliminating empty segments. It is intended
to be used during the path merging process and may not give expected
results when used independently. Use NormalizePathSegments() or
NormalizePathSegmentsInUri() if more general normalization is desired.
- SplitAuthority(authority)
-
Given a string representing the authority component of a URI, returns
a tuple consisting of the subcomponents (userinfo, host, port). No
percent-decoding is performed.
- SplitFragment(uri)
-
Given a URI or URI reference, returns a tuple consisting of
(base, fragment), where base is the portion before the '#' that
precedes the fragment component.
- SplitUriRef(uriref)
-
Given a valid URI reference as a string, returns a tuple representing the
generic URI components, as per RFC 3986 appendix B. The tuple's structure
is (scheme, authority, path, query, fragment).
All values will be strings (possibly empty) or None if undefined.
Note that per RFC 3986, there is no distinction between a path and
an "opaque part", as there was in RFC 2396.
- StripFragment(uriRef)
-
Returns the given URI or URI reference with the fragment component, if
any, removed.
- UnsplitUriRef(uriRefSeq)
-
Given a sequence as would be produced by SplitUriRef(), assembles and
returns a URI reference as a string.
- UriToOsPath(uri, attemptAbsolute=True, encoding='utf-8', osname=None)
-
This function converts a URI reference to an OS-specific file system path.
If the URI reference is given as a Unicode string, then the encoding
argument determines how percent-encoded components are interpreted, and
the result will be a Unicode string. If the URI reference is a regular
byte string, the encoding argument is ignored and the result will be a
byte string in which percent-encoded octets have been converted to the
bytes they represent. For example, the trailing path segment of
u'file:///a/b/%E2%80%A2' will by default be converted to u'\u2022',
because sequence E2 80 A2 represents character U+2022 in UTF-8. If the
string were not Unicode, the trailing segment would become the 3-byte
string '\xe2\x80\xa2'.
The osname argument determines for what operating system the resulting
path is appropriate. It defaults to os.name and is typically the value
'posix' on Unix systems (including Mac OS X and Cygwin), and 'nt' on
Windows NT/2000/XP.
This function is similar to urllib.url2pathname(), but is more featureful
and produces better paths.
If the given URI reference is not relative, its scheme component must be
'file', and an exception will be raised if it isn't.
In accordance with RFC 3986, RFC 1738 and RFC 1630, an authority
component that is the string 'localhost' will be treated the same as an
empty authority.
Dot segments ('.' or '..') in the path component are NOT collapsed.
If the path component of the URI reference is relative and the
attemptAbsolute flag is True (the default), then the resulting path
will be made absolute by considering the path to be relative to the
current working directory. There is no guarantee that such a result
will be an accurate interpretation of the URI reference.
attemptAbsolute has no effect if the
result is not being produced for the local operating system.
Fragment and query components of the URI reference are ignored.
If osname is 'posix', the authority component must be empty or just
'localhost'. An exception will be raised otherwise, because there is no
standard way of interpreting other authorities. Also, if '%2F' is in a
path segment, it will be converted to r'\/' (a backslash-escaped forward
slash). The caller may need to take additional steps to prevent this from
being interpreted as if it were a path segment separator.
If osname is 'nt', a drivespec is recognized as the first occurrence of a
single letter (A-Z, case-insensitive) followed by '|' or ':', occurring as
either the first segment of the path component, or (incorrectly) as the
entire authority component. A UNC hostname is recognized as a non-empty,
non-'localhost' authority component that has not been recognized as a
drivespec, or as the second path segment if the first path segment is
empty. If a UNC hostname is detected, the result will begin with
'\\<hostname>\'. If a drivespec was detected also, the first path segment
will be '$<driveletter>$'. If a drivespec was detected but a UNC hostname
was not, then the result will begin with '<driveletter>:'.
Windows examples:
'file:x/y/z' => r'x\y\z';
'file:/x/y/z' (not recommended) => r'\x\y\z';
'file:///x/y/z' => r'\x\y\z';
'file:///c:/x/y/z' => r'C:\x\y\z';
'file:///c|/x/y/z' => r'C:\x\y\z';
'file:///c:/x:/y/z' => r'C:\x:\y\z' (bad path, valid interpretation);
'file://c:/x/y/z' (not recommended) => r'C:\x\y\z';
'file://host/share/x/y/z' => r'\\host\share\x\y\z';
'file:////host/share/x/y/z' => r'\\host\share\x\y\z'
'file://host/x:/y/z' => r'\\host\x:\y\z' (bad path, valid interp.);
'file://localhost/x/y/z' => r'\x\y\z';
'file://localhost/c:/x/y/z' => r'C:\x\y\z';
'file:///C:%5Cx%5Cy%5Cz' (not recommended) => r'C:\x\y\z'
- UrlOpen(url, *args, **kwargs)
-
A replacement/wrapper for urllib2.urlopen().
Simply calls MakeUrllibSafe() on the given URL and passes the result
and all other args to urllib2.urlopen().
- UrnToPublicId(urn)
-
Converts a URN that conforms to RFC 3151 to a public identifier.
For example, the URN
"urn:publicid:%2B:IDN+example.org:DTD+XML+Bookmarks+1.0:EN:XML"
will be converted to the public identifier
"+//IDN example.org//DTD XML Bookmarks 1.0//EN//XML"
Raises a UriException if the given URN cannot be converted.
Query and fragment components, if present, are ignored.