-
Notifications
You must be signed in to change notification settings - Fork 11
Description
While RDF 1.2 already builds on Unicode and related IRI specifications, it currently does not explicitly restrict the use of so-called problematic Unicode characters. These include, among others:
- Surrogates (e.g., unpaired UTF-16 surrogates, which must not appear in UTF-8),
- Legacy control codes (e.g., C0/C1 controls, which have no clear semantics in RDF literals),
- Noncharacters (e.g.,
U+7FFFFand others that Unicode designates as permanently invalid).
The recently published RFC 9839 provides a clear and concise framework for identifying these problematic classes and offers subsets of Unicode suitable for use in text fields of protocols and data formats.
Why this matters for RDF
-
Literals: RDF literals may currently contain any Unicode string, but if problematic characters are present, different parsers or libraries may handle them inconsistently. For example, ordering, regex filtering, or equality comparisons in SPARQL may yield nondeterministic results.
-
IRIs: While IRIs already exclude some code points, RFC 9839 could serve as a consistent additional reference for implementers to avoid interoperability issues with surrogate code points or noncharacters.
-
Serializations: RDF serializations such as Turtle, N-Triples, and RDF/XML inherit restrictions from their host grammars (e.g., XML partially excludes certain controls, Turtle allows escaping). However, these rules are uneven. Aligning them with RFC 9839 would provide a coherent policy across serializations and improve robustness of RDF processing.
Proposal
- Reference RFC 9839 in RDF 1.2 (possibly in the section on Literals and IRIs) as a recommended baseline for excluding problematic Unicode characters.
- Clarify how this applies across serializations (Turtle, N-Triples, RDF/XML), so that implementers consistently reject or sanitize problematic code points.
- Provide non-normative guidance that RDF data should conform to one of the RFC 9839 subsets (e.g., Scalars) to ensure interoperability.
Benefits
- Increased interoperability between RDF implementations,
- Simplified parser behavior and clearer error handling,
- Prevention of subtle bugs due to inconsistent Unicode handling,
- Alignment of RDF 1.2 with current IETF best practices for Unicode in protocols and data formats.