diff --git a/doc/library-detail.adoc b/doc/library-detail.adoc new file mode 100644 index 000000000..870fa0486 --- /dev/null +++ b/doc/library-detail.adoc @@ -0,0 +1,22 @@ +Boost.URL is a portable C++ library which provides containers and algorithms for handling URLs as described by https://tools.ietf.org/html/rfc3986[RFC 3986]. +It supports parsing, inspection, modification, normalization, and resolution of URLs, with interfaces designed for network programs that need to process URLs efficiently and securely from untrusted sources. + +[source,cpp] +---- +url_view uv("https://www.example.com/path/to/file.txt?id=1001&name=John%20Doe"); + +for (auto v : uv.params()) + std::cout << v.key << "=" << v.value << "\n"; +// id=1001 +// name=John Doe + +url u = uv; +u.set_scheme("http") + .set_encoded_host("boost.org") + .set_encoded_path("/index.htm") + .remove_query() + .params().append({"key", "value"}); + +std::cout << u; +// http://boost.org/index.htm?key=value +---- diff --git a/doc/modules/ROOT/nav.adoc b/doc/modules/ROOT/nav.adoc index 956f359fc..21a5ce818 100644 --- a/doc/modules/ROOT/nav.adoc +++ b/doc/modules/ROOT/nav.adoc @@ -23,5 +23,6 @@ ** xref:examples/file-router.adoc[] ** xref:examples/router.adoc[] ** xref:examples/sanitize.adoc[] +* xref:design.adoc[] * xref:reference.adoc[Reference] * xref:HelpCard.adoc[] diff --git a/doc/modules/ROOT/pages/design.adoc b/doc/modules/ROOT/pages/design.adoc new file mode 100644 index 000000000..0d85c9f07 --- /dev/null +++ b/doc/modules/ROOT/pages/design.adoc @@ -0,0 +1,76 @@ +// +// Copyright (c) 2023 Alan de Freitas (alandefreitas@gmail.com) +// +// Distributed under the Boost Software License, Version 1.0. (See accompanying +// file LICENSE_1_0.txt or copy at https://www.boost.org/LICENSE_1_0.txt) +// +// Official repository: https://github.com/boostorg/url +// + += Design Rationale +:navtitle: Design Rationale + +This section documents the rationale behind design decisions in Boost.URL that are not obvious from the API alone. +For a general overview of the library's goals and features, see the xref:index.adoc[introduction]. + +== Character Type + +Boost.URL uses `char` as its character type. +The library does not provide class templates parameterized on character type (e.g. `basic_url_view`). + +URLs are sequences of ASCII octets as defined by https://tools.ietf.org/html/rfc3986[RFC 3986,window=blank_]. +In practice, URLs are always handled as `char` strings: in HTTP headers, in JSON, in configuration files, and in every major programming language's URL library. +Wide character types (`wchar_t`, `char16_t`, `char32_t`) are not used for URLs in any real-world context, so supporting them would add complexity with no practical benefit. + +This also means the library does not provide a `char8_t` (C++20) instantiation. +While `char8_t` is portably correct for ASCII/UTF-8 text, its adoption in the C++ ecosystem remains limited: the standard library does not fully support it for I/O or formatting, and no major framework has adopted it in public APIs. +Using `char` means Boost.URL interoperates directly with `std::string`, `std::string_view`, string literals, and the rest of the ecosystem without conversion. + +=== EBCDIC + +The C++ standard does not require that `char` use an ASCII-compatible encoding. +On EBCDIC platforms (primarily IBM z/OS), the character literal `'/'` does not have the value `0x2F`, so a URL parser that compares `char` values against ASCII constants would malfunction. + +In practice, this is not a concern for Boost.URL: + +* z/OS is the only remaining platform where EBCDIC is relevant for C++ compilation. +* The z/OS C++ compilers support an ASCII compilation mode (`-qascii` or `-fzos-le-char-mode=ascii`) that makes `char` literals use ASCII values. This mode exists specifically for open-source software that assumes ASCII. +* Real-world C++ libraries that handle URLs and HTTP on z/OS (such as cpp-httplib and DuckDB) use this ASCII mode rather than adding EBCDIC transcoding. +* The z/OS REST and web services ecosystem is almost entirely Java-based. No evidence exists of C++ code parsing RFC 3986 URIs in EBCDIC `char` encoding. +* WG21 is moving in this direction as well: P3688 (ASCII character utilities) proposes `char`-based functions that treat input as ASCII regardless of literal encoding. + +On EBCDIC platforms where ASCII mode is not used, `char8_t` provides a portably correct alternative since it is guaranteed to use UTF-8 (an ASCII superset). +A future extension to support `char8_t` constructor overloads on the concrete `char`-based types could address this without requiring templates, since both `char` and `char8_t` are single-byte types and the conversion between them is trivial for ASCII content. + +== No Dynamic Allocation by Default + +The library is designed so that most operations do not require dynamic memory allocation. + +cpp:url_view[] does not retain ownership of the underlying string buffer and does not allocate memory. +Like a cpp:string_view[], it references the original string directly. +As long as the contents of the original string are unmodified, constructed URL views always contain a valid URL in its correctly serialized form. + +Accessor functions return views referring to substrings and sub-ranges of the underlying URL. +By referencing the relevant portion of the URL string internally, components can represent percent-decoded strings and be converted to other types without allocation. +cpp:decode_view[] and its decoding functions perform no memory allocations unless the result needs to be stored in another container. +Objects can be recycled to reuse their memory, deferring allocations until the application actually needs them. + +This makes the library suitable for performance-sensitive network programs and embedded devices. + +== Error Handling + +The library uses error codes rather than exceptions as its primary error reporting mechanism. +If input does not match the URL grammar, an error code is reported through cpp:result[] rather than throwing. +This allows the library to be used in environments that disable exceptions (`-fno-exceptions`), which is detected automatically. + +== URL Validity Invariant + +All modifications to a cpp:url[] leave it in a valid state. +It is not possible for a cpp:url[] to hold syntactically illegal text. +All modifying functions perform validation on their input: attempting to set the scheme or port to an invalid string results in an exception, while other components are automatically percent-encoded as needed. +All non-const operations offer the strong exception safety guarantee. + +== No IRIs + +The library does not handle https://www.rfc-editor.org/rfc/rfc3987.html[Internationalized Resource Identifiers,window=blank_] (IRIs). +IRIs are different from URLs: they come from Unicode strings instead of low-ASCII strings and are covered by a separate specification. diff --git a/doc/modules/ROOT/pages/index.adoc b/doc/modules/ROOT/pages/index.adoc index 62f8025ef..dc0a9a33e 100644 --- a/doc/modules/ROOT/pages/index.adoc +++ b/doc/modules/ROOT/pages/index.adoc @@ -28,6 +28,7 @@ While the library is general purpose, special care has been taken to ensure that Interfaces are provided for using error codes instead of exceptions as needed, and most algorithms have the means to opt out of dynamic memory allocation. Another feature of the library is that all modifications leave the URL in a valid state. Code which uses this library is easy to read, flexible, and performant. +See the xref:design.adoc[design rationale] for more on these design principles. Boost.URL offers these features: @@ -42,7 +43,7 @@ Boost.URL offers these features: [NOTE] ==== -Currently the library does not handle +The library does not handle https://www.rfc-editor.org/rfc/rfc3987.html[Internationalized Resource Identifiers,window=blank_] (IRIs). These are different from URLs, come from Unicode strings instead of low-ASCII strings, and are covered by a separate specification. ==== diff --git a/doc/modules/ROOT/pages/quicklook.adoc b/doc/modules/ROOT/pages/quicklook.adoc index 98fd04dbc..f26cfdb0d 100644 --- a/doc/modules/ROOT/pages/quicklook.adoc +++ b/doc/modules/ROOT/pages/quicklook.adoc @@ -234,8 +234,8 @@ id=42&name=John Doe Jingleheimer-Schmidt -- ==== -cpp:decode_view[] and its decoding functions are designed to perform no memory allocations unless the algorithm where it's being used needs the result to be in another container. -The design also permits recycling objects to reuse their memory, and at least minimize the number of allocations by deferring them until the result is in fact needed by the application. +cpp:decode_view[] and its decoding functions perform no memory allocations unless the result needs to be stored in another container. +Objects can be recycled to reuse their memory, deferring allocations until the application actually needs them. In the example above, the memory owned by `str` can be reused to store other results. This is also useful when manipulating URLs: