Consider referencing RFC 9839 to restrict problematic Unicode characters in RDF 1.2 literals and IRIs

While RDF 1.2 already builds on Unicode and related IRI specifications, it currently does not explicitly restrict the use of so-called *problematic Unicode characters*. These include, among others:  

- **Surrogates** (e.g., unpaired UTF-16 surrogates, which must not appear in UTF-8),  
- **Legacy control codes** (e.g., C0/C1 controls, which have no clear semantics in RDF literals),  
- **Noncharacters** (e.g., `U+7FFFF` and others that Unicode designates as permanently invalid).  

The recently published [RFC 9839](https://www.rfc-editor.org/rfc/rfc9839.html) provides a clear and concise framework for identifying these problematic classes and offers subsets of Unicode suitable for use in text fields of protocols and data formats.  

### Why this matters for RDF  
1. **Literals:** RDF literals may currently contain any Unicode string, but if problematic characters are present, different parsers or libraries may handle them inconsistently. For example, ordering, regex filtering, or equality comparisons in SPARQL may yield nondeterministic results.  

2. **IRIs:** While IRIs already exclude some code points, RFC 9839 could serve as a consistent additional reference for implementers to avoid interoperability issues with surrogate code points or noncharacters.  

3. **Serializations:** RDF serializations such as Turtle, N-Triples, and RDF/XML inherit restrictions from their host grammars (e.g., XML partially excludes certain controls, Turtle allows escaping). However, these rules are uneven. Aligning them with RFC 9839 would provide a coherent policy across serializations and improve robustness of RDF processing.  

### Proposal  
- Reference RFC 9839 in RDF 1.2 (possibly in the section on Literals and IRIs) as a **recommended baseline** for excluding problematic Unicode characters.  
- Clarify how this applies across serializations (Turtle, N-Triples, RDF/XML), so that implementers consistently reject or sanitize problematic code points.  
- Provide non-normative guidance that RDF data should conform to one of the RFC 9839 subsets (e.g., *Scalars*) to ensure interoperability.  

### Benefits  
- Increased interoperability between RDF implementations,  
- Simplified parser behavior and clearer error handling,  
- Prevention of subtle bugs due to inconsistent Unicode handling,  
- Alignment of RDF 1.2 with current IETF best practices for Unicode in protocols and data formats.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consider referencing RFC 9839 to restrict problematic Unicode characters in RDF 1.2 literals and IRIs #177

Why this matters for RDF

Proposal

Benefits

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consider referencing RFC 9839 to restrict problematic Unicode characters in RDF 1.2 literals and IRIs #177

Description

Why this matters for RDF

Proposal

Benefits

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions