Skip to content

Consider referencing RFC 9839 to restrict problematic Unicode characters in RDF 1.2 literals and IRIs #177

@domel

Description

@domel

While RDF 1.2 already builds on Unicode and related IRI specifications, it currently does not explicitly restrict the use of so-called problematic Unicode characters. These include, among others:

  • Surrogates (e.g., unpaired UTF-16 surrogates, which must not appear in UTF-8),
  • Legacy control codes (e.g., C0/C1 controls, which have no clear semantics in RDF literals),
  • Noncharacters (e.g., U+7FFFF and others that Unicode designates as permanently invalid).

The recently published RFC 9839 provides a clear and concise framework for identifying these problematic classes and offers subsets of Unicode suitable for use in text fields of protocols and data formats.

Why this matters for RDF

  1. Literals: RDF literals may currently contain any Unicode string, but if problematic characters are present, different parsers or libraries may handle them inconsistently. For example, ordering, regex filtering, or equality comparisons in SPARQL may yield nondeterministic results.

  2. IRIs: While IRIs already exclude some code points, RFC 9839 could serve as a consistent additional reference for implementers to avoid interoperability issues with surrogate code points or noncharacters.

  3. Serializations: RDF serializations such as Turtle, N-Triples, and RDF/XML inherit restrictions from their host grammars (e.g., XML partially excludes certain controls, Turtle allows escaping). However, these rules are uneven. Aligning them with RFC 9839 would provide a coherent policy across serializations and improve robustness of RDF processing.

Proposal

  • Reference RFC 9839 in RDF 1.2 (possibly in the section on Literals and IRIs) as a recommended baseline for excluding problematic Unicode characters.
  • Clarify how this applies across serializations (Turtle, N-Triples, RDF/XML), so that implementers consistently reject or sanitize problematic code points.
  • Provide non-normative guidance that RDF data should conform to one of the RFC 9839 subsets (e.g., Scalars) to ensure interoperability.

Benefits

  • Increased interoperability between RDF implementations,
  • Simplified parser behavior and clearer error handling,
  • Prevention of subtle bugs due to inconsistent Unicode handling,
  • Alignment of RDF 1.2 with current IETF best practices for Unicode in protocols and data formats.

Metadata

Metadata

Assignees

No one assigned

    Labels

    i18n-trackerGroup bringing to attention of Internationalization, or tracked by i18n but not needing response.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions