HTML Guides for Unicode Normalization Form C
Learn how to identify and fix common HTML validation errors flagged by the W3C Validator — so your pages are standards-compliant and render correctly across every browser. Also check our Accessibility Guides.
Unicode allows some characters to be represented in multiple ways. For example, the accented letter “é” can be stored as a single precomposed character (U+00E9) or as two separate code points: the base letter “e” (U+0065) followed by a combining acute accent (U+0301). While these look identical when rendered, they are fundamentally different at the byte level. Unicode Normalization Form C (NFC) is the canonical form that prefers the single precomposed representation whenever one exists.
The HTML specification and the W3C Character Model for the World Wide Web require that all text in HTML documents be in NFC. This matters for several reasons:
- String matching and search: Non-NFC text can cause failures when browsers or scripts try to match strings, compare attribute values, or process CSS selectors. Two visually identical strings in different normalization forms won’t match with simple byte comparison.
- Accessibility: Screen readers and assistive technologies may behave inconsistently when encountering decomposed character sequences.
- Interoperability: Different browsers, search engines, and tools may handle non-NFC text differently, leading to unpredictable behavior.
- Fragment identifiers and IDs: If an id attribute contains non-NFC characters, fragment links (#id) may fail to work correctly.
This issue most commonly appears when text is copied from word processors, PDFs, or other applications that use decomposed Unicode forms (NFD), or when content is generated by software that doesn’t normalize its output.
How to Fix It
- Identify the affected text: The validator will point to the specific line containing non-NFC characters. The characters will often look normal visually, so you’ll need to inspect them at the code-point level.
- Convert to NFC: Use a text editor or command-line tool that supports Unicode normalization. Many programming languages provide built-in normalization functions.
- Prevent future issues: Configure your text editor or build pipeline to save files in NFC. When accepting user input, normalize it server-side before storing or embedding in HTML.
In Python, you can normalize a string:
import unicodedata
normalized = unicodedata.normalize('NFC', original_string)
In JavaScript (Node.js or browser):
const normalized = originalString.normalize('NFC');
On the command line (using uconv from ICU):
uconv -x NFC input.html > output.html
Examples
Incorrect (decomposed form — NFD)
In this example, the letter “é” is represented as two code points (e + combining acute accent), which triggers the validation warning. The source may look identical to the correct version, but the underlying bytes differ:
<!-- "é" here is stored as U+0065 U+0301 (decomposed) -->
<p>Résumé available upon request.</p>
Correct (precomposed form — NFC)
Here, the same text uses the single precomposed character é (U+00E9):
<!-- "é" here is stored as U+00E9 (precomposed) -->
<p>Résumé available upon request.</p>
Incorrect in attributes
Non-NFC text in attribute values also triggers this issue:
<!-- The id contains a decomposed character -->
<h2 id="resumé">Résumé</h2>
Correct in attributes
<!-- The id uses the precomposed NFC character -->
<h2 id="resumé">Résumé</h2>
While these examples look the same in rendered text, the difference is in how the characters are encoded. To verify your text is in NFC, you can paste it into a Unicode inspector tool or use the normalization functions mentioned above. For further reading, the W3C provides an excellent guide on Normalization in HTML and CSS.
Unicode allows certain characters — especially accented letters and other composed characters — to be represented in multiple ways. For example, the letter “é” can be a single precomposed character (U+00E9, NFC form) or a base letter “e” (U+0065) followed by a combining acute accent (U+0301, NFD form). While they may look identical on screen, they are different byte sequences. The HTML specification requires that all attribute values use NFC to ensure consistent behavior across browsers, search engines, and assistive technologies.
This matters for several important reasons:
- String matching and comparison: Browsers and scripts may compare attribute values byte-by-byte. An id value in NFD form won’t match a CSS selector or fragment identifier targeting the NFC form, causing broken links and broken styles.
- Accessibility: Screen readers and other assistive technologies may process NFC and NFD strings differently, potentially mispronouncing text or failing to match ARIA references.
- Interoperability: Different operating systems produce different normalization forms by default (macOS file systems historically use NFD, for example). Copying text from various sources can introduce non-NFC characters without any visual indication.
- Standards compliance: The WHATWG HTML specification and W3C guidance on normalization explicitly recommend NFC for all HTML content.
The issue most commonly appears when attribute values contain accented characters (like in id, class, alt, title, or value attributes) that were copied from a source using NFD normalization, or when files are created on systems that default to NFD.
To fix the problem, you need to convert the affected attribute values to NFC. You can do this by:
- Retyping the characters directly in your editor, which usually produces NFC by default.
- Using a programming tool such as Python’s unicodedata.normalize('NFC', text), JavaScript’s text.normalize('NFC'), or similar utilities in your language of choice.
- Using a text editor that supports normalization conversion (some editors have built-in Unicode normalization features or plugins).
- Running a batch conversion on your HTML files before deployment as part of your build process.
Examples
Incorrect: Attribute value uses NFD (decomposed form)
In this example, the id attribute value for “résumé” uses decomposed characters (base letter + combining accent), which triggers the validation error. The decomposition is invisible in source code but present at the byte level.
<!-- The "é" here is stored as "e" + combining acute accent (NFD) -->
<div id="résumé">
<p>My résumé content</p>
</div>
Correct: Attribute value uses NFC (precomposed form)
Here, the id attribute value uses precomposed characters, which is the correct NFC form.
<!-- The "é" here is stored as a single precomposed character (NFC) -->
<div id="résumé">
<p>My résumé content</p>
</div>
While these two examples look identical in source view, they differ at the byte level. You can verify the normalization form using browser developer tools or a hex editor.
Checking and fixing with JavaScript
You can programmatically normalize attribute values:
<script>
// Check if a string is in NFC
const text = "résumé";
const nfcText = text.normalize("NFC");
console.log(text === nfcText); // false if original was NFD
</script>
Checking and fixing with Python
import unicodedata
text = "r\u0065\u0301sume\u0301" # NFD form
normalized = unicodedata.normalize('NFC', text)
print(normalized) # Outputs NFC form: "résumé"
If you encounter this validation error, inspect the flagged attribute value carefully and ensure all characters are in their precomposed NFC form. Adding a normalization step to your build pipeline is a reliable way to prevent this issue from recurring.
Ready to validate your sites?
Start your free trial today.