Sobre este problema HTML
Unicode allows some characters to be represented in multiple ways. For example, the accented letter “é” can be stored as a single precomposed character (U+00E9) or as two separate code points: the base letter “e” (U+0065) followed by a combining acute accent (U+0301). While these look identical when rendered, they are fundamentally different at the byte level. Unicode Normalization Form C (NFC) is the canonical form that prefers the single precomposed representation whenever one exists.
The HTML specification and the W3C Character Model for the World Wide Web require that all text in HTML documents be in NFC. This matters for several reasons:
- String matching and search: Non-NFC text can cause failures when browsers or scripts try to match strings, compare attribute values, or process CSS selectors. Two visually identical strings in different normalization forms won’t match with simple byte comparison.
- Accessibility: Screen readers and assistive technologies may behave inconsistently when encountering decomposed character sequences.
- Interoperability: Different browsers, search engines, and tools may handle non-NFC text differently, leading to unpredictable behavior.
-
Fragment identifiers and IDs: If an
idattribute contains non-NFC characters, fragment links (#id) may fail to work correctly.
This issue most commonly appears when text is copied from word processors, PDFs, or other applications that use decomposed Unicode forms (NFD), or when content is generated by software that doesn’t normalize its output.
How to Fix It
- Identify the affected text: The validator will point to the specific line containing non-NFC characters. The characters will often look normal visually, so you’ll need to inspect them at the code-point level.
- Convert to NFC: Use a text editor or command-line tool that supports Unicode normalization. Many programming languages provide built-in normalization functions.
- Prevent future issues: Configure your text editor or build pipeline to save files in NFC. When accepting user input, normalize it server-side before storing or embedding in HTML.
In Python, you can normalize a string:
import unicodedata
normalized = unicodedata.normalize('NFC', original_string)
In JavaScript (Node.js or browser):
const normalized = originalString.normalize('NFC');
On the command line (using uconv from ICU):
uconv -x NFC input.html > output.html
Examples
Incorrect (decomposed form — NFD)
In this example, the letter “é” is represented as two code points (e + combining acute accent), which triggers the validation warning. The source may look identical to the correct version, but the underlying bytes differ:
<!-- "é" here is stored as U+0065 U+0301 (decomposed) -->
<p>Résumé available upon request.</p>
Correct (precomposed form — NFC)
Here, the same text uses the single precomposed character é (U+00E9):
<!-- "é" here is stored as U+00E9 (precomposed) -->
<p>Résumé available upon request.</p>
Incorrect in attributes
Non-NFC text in attribute values also triggers this issue:
<!-- The id contains a decomposed character -->
<h2 id="resumé">Résumé</h2>
Correct in attributes
<!-- The id uses the precomposed NFC character -->
<h2 id="resumé">Résumé</h2>
While these examples look the same in rendered text, the difference is in how the characters are encoded. To verify your text is in NFC, you can paste it into a Unicode inspector tool or use the normalization functions mentioned above. For further reading, the W3C provides an excellent guide on Normalization in HTML and CSS.
Encontre problemas como este automaticamente
O Rocket Validator analisa milhares de páginas em segundos, detetando problemas HTML em todo o seu site.
Saiba mais: