HTML Guides for encoding
Learn how to identify and fix common HTML validation errors flagged by the W3C Validator — so your pages are standards-compliant and render correctly across every browser. Also check our Accessibility Guides.
When the validator parses your HTML (especially in XHTML mode or when serialized as XML), every element name must conform to the XML 1.0 naming rules. These rules require that element names begin with a letter (a–z, A–Z) or an underscore (_), followed by any combination of letters, digits, hyphens (-), underscores, periods (.), or combining characters. Characters like spaces, angle brackets, slashes, or other special symbols within a tag name make it unrepresentable in XML 1.0.
This error most commonly occurs due to:
- Typos in tag names — accidentally inserting a space, extra character, or symbol into a tag name.
- Malformed closing tags — forgetting the slash or placing characters incorrectly in a closing tag.
- Template syntax errors — template engine placeholders leaking into the final HTML output.
- Copy-paste issues — invisible or non-ASCII characters sneaking into tag names from rich-text editors.
This matters because browsers may not parse malformed tags as intended, leading to broken layouts or missing content. Screen readers and assistive technologies rely on well-formed markup to interpret page structure. Additionally, any system that processes your HTML as XML (such as RSS feed generators, EPUB renderers, or XHTML-serving environments) will reject documents with invalid element names entirely.
How to Fix
- Inspect the flagged line — look carefully at the element name the validator is complaining about. Check for stray characters, spaces, or symbols.
- Correct any typos — replace the malformed tag with the correct HTML element name.
- Validate template output — if you use a templating engine, ensure the rendered HTML doesn’t contain unprocessed template tokens inside tag names.
- Check for invisible characters — paste the tag name into a plain-text editor or use a hex viewer to spot hidden characters.
Examples
Typo with a space in the tag name
A space inside the tag name creates an invalid element name:
<!-- Wrong: space in the element name -->
<di v class="container">
<p>Hello world</p>
</di v>
Fix by removing the accidental space:
<!-- Correct -->
<div class="container">
<p>Hello world</p>
</div>
Special character in a tag name
An accidental special character makes the name unrepresentable in XML 1.0:
<!-- Wrong: stray hash character in the tag name -->
<s#ection>
<h2>About</h2>
</s#ection>
Fix by using the correct element name:
<!-- Correct -->
<section>
<h2>About</h2>
</section>
Malformed closing tag
A missing or misplaced slash can produce a garbled tag name:
<!-- Wrong: slash is in the wrong place -->
<p>Some text<p/>
Fix with a properly formed closing tag:
<!-- Correct -->
<p>Some text</p>
Template placeholder leaking into output
Unprocessed template syntax can produce invalid element names in the rendered HTML:
<!-- Wrong: unresolved template variable in element name -->
<{{tagName}}>Content</{{tagName}}>
Ensure your template engine resolves the variable before serving the HTML. The rendered output should be:
<!-- Correct: after template processing -->
<article>Content</article>
The HTML specification explicitly forbids certain Unicode code points from appearing anywhere in an HTML document. These include most ASCII control characters (such as U+0000 NULL, U+0008 BACKSPACE, or U+000B VERTICAL TAB), as well as Unicode noncharacters like U+FFFE, U+FFFF, and the range U+FDD0 to U+FDEF. When the W3C validator encounters one of these code points, it reports the error “Forbidden code point” followed by the specific value.
These characters are forbidden because they have no defined meaning in HTML and can cause unpredictable behavior across browsers and platforms. Some may be silently dropped, others may produce rendering glitches, and some could interfere with parsing. Screen readers and other assistive technologies may also behave erratically when encountering these characters, making this an accessibility concern as well.
How forbidden characters get into your code
- Copy-pasting from external sources like word processors, PDFs, or databases that embed invisible control characters.
- Faulty text editors or build tools that introduce stray bytes during file processing.
- Incorrect character encoding where byte sequences are misinterpreted, resulting in forbidden code points.
- Programmatic content generation where strings aren’t properly sanitized before being inserted into HTML.
How to fix it
- Identify the character and its location. The validator message includes the code point (e.g., U+000B) and the line number. Use a text editor that can show invisible characters (such as VS Code with the “Render Whitespace” or “Render Control Characters” setting enabled, or a hex editor).
- Remove or replace the character. In most cases, the forbidden character serves no purpose and can simply be deleted. If it was standing in for a space or line break, replace it with the appropriate standard character.
- Sanitize content at the source. If your HTML is generated dynamically, strip forbidden code points from strings before outputting them. In JavaScript, you can use a regular expression to remove them.
// Remove common forbidden code points
text = text.replace(/[\x00-\x08\x0B\x0E-\x1F\x7F\uFDD0-\uFDEF\uFFFE\uFFFF]/g, '');
Examples
Incorrect — contains a forbidden control character
In this example, a vertical tab character (U+000B) is embedded between “Hello” and “World.” It is invisible in most editors but the validator will flag it.
<!-- The ␋ below represents U+000B VERTICAL TAB, an invisible forbidden character -->
<p>Hello␋World</p>
Correct — forbidden character removed
<p>Hello World</p>
Incorrect — NULL character in an attribute value
A U+0000 NULL character may appear inside an attribute, often from programmatic output.
<!-- The attribute value contains a U+0000 NULL byte -->
<div title="Some�Text">Content</div>
Correct — NULL character removed from attribute
<div title="SomeText">Content</div>
Allowed control characters
Not all control characters are forbidden. The following are explicitly permitted in HTML:
- U+0009 — Horizontal tab (regular tab character)
- U+000A — Line feed (newline)
- U+000D — Carriage return
<pre>Line one
Line two with a tab</pre>
This is valid because it uses only standard whitespace characters (U+000A for the newline and U+0009 for the tab).
A < character appearing where an attribute name is expected typically means a closing > is missing on the previous tag, causing the browser to interpret the next tag as an attribute.
This error occurs when you forget to close an HTML element’s opening tag with >. The validator sees the < of the next element and thinks it’s still parsing attributes of the previous element. It’s a common typo that can cascade into multiple confusing errors.
For example, if you write <div without the closing >, the following <p> tag gets parsed as if it were an attribute of the div, triggering this error.
HTML Examples
❌ Incorrect
<div class="container"
<p>Hello, world!</p>
</div>
The <div> tag is missing its closing > after "container", so the validator sees <p> as part of the div‘s attribute list.
✅ Correct
<div class="container">
<p>Hello, world!</p>
</div>
Make sure every opening tag is properly closed with >. If the error points to a specific line, check the tag immediately before that line for a missing >.
When a browser or validator reads your HTML document, it looks at the <meta charset="..."> declaration to determine how to decode the bytes in the file. Every character encoding maps bytes to characters differently. UTF-8 and Windows-1252 share the same mappings for basic ASCII characters (letters A–Z, digits, common punctuation), but they diverge for bytes in the 0x80–0x9F range. Windows-1252 uses these bytes for characters like €, ", ", —, and ™, while UTF-8 treats them as invalid or interprets them as parts of multi-byte sequences. When the declared encoding doesn’t match the actual encoding, the validator raises this error, and browsers may render characters incorrectly.
This is a problem for several reasons:
- Broken text display: Characters like curly quotes (" "), em dashes (—), and accented letters (é, ñ) can appear as mojibake — sequences like â€" or é — confusing your readers.
- Standards compliance: The HTML specification requires that the declared encoding match the actual byte encoding of the file. A mismatch is a conformance error.
- Accessibility: Screen readers and other assistive technologies rely on correct character interpretation. Garbled text is unintelligible to these tools.
- Search engines: Encoding mismatches can cause search engines to index corrupted text, hurting your content’s discoverability.
How to fix it
The best approach is to re-save your file in UTF-8 encoding. Most modern text editors and IDEs support this:
- VS Code: Click the encoding indicator in the bottom status bar (it may say “Windows 1252”), select “Save with Encoding,” and choose “UTF-8.”
- Sublime Text: Go to File → Save with Encoding → UTF-8.
- Notepad++: Go to Encoding → Convert to UTF-8, then save the file.
- Vim: Run :set fileencoding=utf-8 then :w.
After re-saving, make sure your <meta charset="utf-8"> declaration remains in the <head>. The <meta charset> tag should appear as early as possible — ideally as the first element inside <head> — because the browser needs to know the encoding before parsing any other content.
If your workflow or legacy system absolutely requires Windows-1252 encoding, you can change the declaration to <meta charset="windows-1252"> instead. However, this is strongly discouraged. UTF-8 is the universal standard for the web, supports virtually all characters from all languages, and is recommended by the WHATWG HTML specification.
Examples
Incorrect — encoding mismatch triggers the error
The file is saved in Windows-1252, but the meta tag declares UTF-8:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>My Page</title>
</head>
<body>
<!-- The byte 0x93 in Windows-1252 represents " but is invalid in UTF-8 -->
<p>She said, "Hello!"</p>
</body>
</html>
This produces the validator error: Internal encoding declaration “utf-8” disagrees with the actual encoding of the document (“windows-1252”).
Correct — file saved as UTF-8 with matching declaration
Re-save the file in UTF-8 encoding. The meta tag and the file’s byte encoding now agree:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>My Page</title>
</head>
<body>
<p>She said, "Hello!"</p>
</body>
</html>
Alternative — declaration changed to match Windows-1252 file
If you cannot change the file encoding, update the charset declaration to match. This eliminates the mismatch error but is not the recommended approach:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="windows-1252">
<title>My Page</title>
</head>
<body>
<p>She said, "Hello!"</p>
</body>
</html>
Tips for preventing this issue
- Configure your editor to default to UTF-8 for all new files.
- If you copy text from Microsoft Word or other desktop applications, be aware that they often use Windows-1252 curly quotes and special characters. Pasting this text into a UTF-8 file is fine as long as your editor properly converts the characters to UTF-8 bytes when saving.
- Use <meta charset="utf-8"> as the very first element inside <head> so the encoding is established before the browser encounters any other content.
- If your server sends an HTTP Content-Type header with a charset parameter, make sure it also matches — for example, Content-Type: text/html; charset=utf-8.
When a browser loads an HTML document, it needs to know which character encoding to use to correctly interpret the bytes in the file. The <meta> tag’s charset attribute (or the http-equiv="Content-Type" declaration) tells the browser what encoding to expect. If this declaration says windows-1251 but the file is actually saved as utf-8, the browser faces conflicting signals — the declared encoding disagrees with the actual byte content.
This mismatch matters for several reasons:
- Broken text rendering: Characters outside the basic ASCII range (such as accented letters, Cyrillic, CJK characters, emoji, and special symbols) may display as garbled or replacement characters (often seen as Ð sequences, �, or other mojibake).
- Data integrity: Form submissions and JavaScript string operations may produce corrupted data if the browser interprets the encoding incorrectly.
- Standards compliance: The WHATWG HTML Living Standard requires that the encoding declaration match the actual encoding of the document. Validators flag this mismatch as an error.
- Inconsistent behavior: Different browsers may handle the conflict differently — some may trust the <meta> tag, others may sniff the actual encoding — leading to unpredictable results across user agents.
How to fix it
-
Determine the actual encoding of your file. Most modern text editors (VS Code, Sublime Text, Notepad++) display the file encoding in the status bar. If your file is saved as UTF-8 (which is the recommended encoding for all new web content), your <meta> tag must reflect that.
-
Update the <meta> tag to declare utf-8 instead of windows-1251.
-
Prefer the shorter charset syntax introduced in HTML5, which is simpler and equivalent to the older http-equiv form.
-
Place the encoding declaration within the first 1024 bytes of the document, ideally as the first element inside <head>, so the browser encounters it before parsing other content.
Examples
❌ Incorrect: declared encoding doesn’t match actual file encoding
The file is saved as UTF-8 but the <meta> tag declares windows-1251:
<head>
<meta charset="windows-1251">
<title>My Page</title>
</head>
Or using the older http-equiv syntax:
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1251">
<title>My Page</title>
</head>
✅ Correct: declared encoding matches the actual UTF-8 file
Using the modern HTML5 charset attribute:
<head>
<meta charset="utf-8">
<title>My Page</title>
</head>
Or using the equivalent http-equiv form:
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>My Page</title>
</head>
✅ Correct: full document example
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>My Page</title>
</head>
<body>
<p>Hello, world!</p>
</body>
</html>
What if you actually need windows-1251?
If your content genuinely requires windows-1251 encoding (for example, a legacy Cyrillic text file), you need to re-save the file in windows-1251 encoding using your text editor. However, UTF-8 is strongly recommended for all web content because it supports every Unicode character and is the default encoding for HTML5. Converting your file to UTF-8 and updating the <meta> tag accordingly is almost always the better path forward.
When a browser or validator reads your HTML file, it interprets the raw bytes according to a character encoding — most commonly UTF-8. Each encoding has rules about which byte sequences are valid. For example, in UTF-8, bytes above 0x7F must follow specific multi-byte patterns. If the validator encounters a byte or sequence of bytes that violates these rules, it reports a “malformed byte sequence” error because it literally cannot decode the bytes into meaningful characters.
This problem commonly arises in a few scenarios:
- Encoding mismatch: Your file is saved as Windows-1252 (or Latin-1, ISO-8859-1) but the document declares UTF-8, or vice versa. Characters like curly quotes (" "), em dashes (—), or accented letters (é, ñ) are encoded differently across these encodings, producing invalid byte sequences when interpreted under the wrong one.
- Copy-pasting from word processors: Content copied from Microsoft Word or similar applications often includes “smart quotes” and special characters encoded in Windows-1252, which can produce malformed bytes in a UTF-8 file.
- File corruption: The file was partially corrupted during transfer (e.g., FTP in the wrong mode) or by a tool that modified it without respecting its encoding.
- Mixed encodings: Parts of the file were written or appended using different encodings, resulting in some sections containing invalid byte sequences.
This is a serious problem because browsers may display garbled text (mojibake), skip characters entirely, or substitute replacement characters (�). It also breaks accessibility tools like screen readers, which may mispronounce or skip corrupted text. Search engines may index garbled content, harming your SEO.
How to Fix It
- Declare UTF-8 encoding in your HTML with <meta charset="utf-8"> as the first element inside <head>.
- Save your file as UTF-8 in your text editor. Most editors have an option like “Save with Encoding” or “File Encoding” in the status bar or save dialog. Choose “UTF-8” or “UTF-8 without BOM.”
-
Re-encode the file if it was originally saved in a different encoding. Tools like iconv on the command line can convert between encodings:
iconv -f WINDOWS-1252 -t UTF-8 input.html -o output.html - Replace problematic characters by re-typing them or using HTML character references if needed.
- Check your server configuration. If your server sends a Content-Type header with a charset that conflicts with the file’s actual encoding (e.g., Content-Type: text/html; charset=iso-8859-1 for a UTF-8 file), the validator will use the HTTP header’s encoding, causing mismatches.
Examples
Incorrect — Encoding mismatch
A file saved in Windows-1252 but declaring UTF-8. The byte 0xE9 represents é in Windows-1252 but is an invalid lone byte in UTF-8, triggering the malformed byte sequence error.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>My Page</title>
</head>
<body>
<!-- If the file is saved as Windows-1252, the é below is byte 0xE9, -->
<!-- which is not a valid UTF-8 sequence -->
<p>Resumé</p>
</body>
</html>
Correct — File properly saved as UTF-8
The same document, but the file is actually saved in UTF-8 encoding. The character é is stored as the two-byte sequence 0xC3 0xA9, which is valid UTF-8.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>My Page</title>
</head>
<body>
<p>Resumé</p>
</body>
</html>
Alternative — Using character references
If you can’t resolve the encoding issue immediately, you can use HTML character references to avoid non-ASCII bytes entirely:
<p>Resumé</p>
Or using the named reference:
<p>Resumé</p>
Both render as “Resumé” regardless of file encoding, though this is a workaround — properly saving the file as UTF-8 is the preferred long-term solution.
The < and > characters have special meaning in HTML — they signal the start and end of tags. When the parser encounters </>, it sees what looks like a closing tag with no element name, which is invalid in HTML. This sequence can appear in your markup for two main reasons:
- Unescaped text content: You’re trying to display the literal characters </> as visible text on the page (common in tutorials, documentation, or code snippets), but the browser interprets them as markup rather than content.
- Mistyped end tag: You intended to write a proper closing tag like </p> or </div> but accidentally omitted the element name, leaving just </>.
This matters because browsers may silently discard the malformed tag or interpret it in unexpected ways, leading to broken layouts or missing content. Screen readers and other assistive technologies may also struggle with the resulting DOM structure. Properly escaping special characters and writing well-formed tags ensures consistent rendering across all browsers and devices.
To fix this, determine which scenario applies. If you want to display the literal text </>, replace < with < and > with >. If you meant to close an element, add the correct element name between </ and >.
Examples
Unescaped angle brackets in text content
This triggers the error because the parser sees </> as an invalid closing tag:
<!-- ❌ Bad: raw </> in text content -->
<p>In JSX, self-closing tags use the </> syntax.</p>
Escape the angle brackets using HTML character entities:
<!-- ✅ Good: properly escaped characters -->
<p>In JSX, self-closing tags use the </> syntax.</p>
Mistyped closing tag with missing element name
This triggers the error because the closing tag has no name:
<!-- ❌ Bad: empty closing tag -->
<div class="container">
<p>Some content here.</p>
</>
Add the correct element name to the closing tag:
<!-- ✅ Good: proper closing tag -->
<div class="container">
<p>Some content here.</p>
</div>
Displaying code snippets with angle brackets
When writing about HTML or XML in your page content, all angle brackets in text must be escaped:
<!-- ❌ Bad: unescaped tags in text -->
<p>Use <strong> to make text bold and </strong> to close it.</p>
<!-- ✅ Good: escaped tags in text -->
<p>Use <strong> to make text bold and </strong> to close it.</p>
Using the <code> element for inline code
Even inside <code> elements, angle brackets must still be escaped — the <code> element only changes visual presentation, it does not prevent HTML parsing:
<!-- ❌ Bad: unescaped inside <code> -->
<p>A React fragment looks like <code><></code> and <code></></code>.</p>
<!-- ✅ Good: escaped inside <code> -->
<p>A React fragment looks like <code><></code> and <code></></code>.</p>
Using <pre> blocks for larger code examples
The same escaping rules apply within <pre> elements:
<!-- ✅ Good: escaped characters inside pre -->
<pre><code><div>
<p>Hello, world!</p>
</div></code></pre>
If you frequently need to display code and find manual escaping tedious, consider using a JavaScript-based syntax highlighting library that handles escaping automatically, or use a build tool or templating engine that escapes HTML entities for you.
Unicode allows some characters to be represented in multiple ways. For example, the accented letter “é” can be stored as a single precomposed character (U+00E9) or as two separate code points: the base letter “e” (U+0065) followed by a combining acute accent (U+0301). While these look identical when rendered, they are fundamentally different at the byte level. Unicode Normalization Form C (NFC) is the canonical form that prefers the single precomposed representation whenever one exists.
The HTML specification and the W3C Character Model for the World Wide Web require that all text in HTML documents be in NFC. This matters for several reasons:
- String matching and search: Non-NFC text can cause failures when browsers or scripts try to match strings, compare attribute values, or process CSS selectors. Two visually identical strings in different normalization forms won’t match with simple byte comparison.
- Accessibility: Screen readers and assistive technologies may behave inconsistently when encountering decomposed character sequences.
- Interoperability: Different browsers, search engines, and tools may handle non-NFC text differently, leading to unpredictable behavior.
- Fragment identifiers and IDs: If an id attribute contains non-NFC characters, fragment links (#id) may fail to work correctly.
This issue most commonly appears when text is copied from word processors, PDFs, or other applications that use decomposed Unicode forms (NFD), or when content is generated by software that doesn’t normalize its output.
How to Fix It
- Identify the affected text: The validator will point to the specific line containing non-NFC characters. The characters will often look normal visually, so you’ll need to inspect them at the code-point level.
- Convert to NFC: Use a text editor or command-line tool that supports Unicode normalization. Many programming languages provide built-in normalization functions.
- Prevent future issues: Configure your text editor or build pipeline to save files in NFC. When accepting user input, normalize it server-side before storing or embedding in HTML.
In Python, you can normalize a string:
import unicodedata
normalized = unicodedata.normalize('NFC', original_string)
In JavaScript (Node.js or browser):
const normalized = originalString.normalize('NFC');
On the command line (using uconv from ICU):
uconv -x NFC input.html > output.html
Examples
Incorrect (decomposed form — NFD)
In this example, the letter “é” is represented as two code points (e + combining acute accent), which triggers the validation warning. The source may look identical to the correct version, but the underlying bytes differ:
<!-- "é" here is stored as U+0065 U+0301 (decomposed) -->
<p>Résumé available upon request.</p>
Correct (precomposed form — NFC)
Here, the same text uses the single precomposed character é (U+00E9):
<!-- "é" here is stored as U+00E9 (precomposed) -->
<p>Résumé available upon request.</p>
Incorrect in attributes
Non-NFC text in attribute values also triggers this issue:
<!-- The id contains a decomposed character -->
<h2 id="resumé">Résumé</h2>
Correct in attributes
<!-- The id uses the precomposed NFC character -->
<h2 id="resumé">Résumé</h2>
While these examples look the same in rendered text, the difference is in how the characters are encoded. To verify your text is in NFC, you can paste it into a Unicode inspector tool or use the normalization functions mentioned above. For further reading, the W3C provides an excellent guide on Normalization in HTML and CSS.
Character encoding tells the browser how to map the raw bytes of your HTML file into readable characters. When no encoding is declared, browsers must rely on heuristics or defaults to figure out which encoding to use. The validator’s fallback to “windows-1252” reflects the behavior described in the HTML specification: if no encoding information is found, parsers may default to this legacy encoding. This can cause serious problems — characters like curly quotes, em dashes, accented letters, emoji, and non-Latin scripts can appear as garbled text (often called “mojibake”) if the actual encoding of the file doesn’t match what the browser assumes.
Why this matters
- Correctness: If your file is saved as UTF-8 (which is the default in most modern editors) but the browser interprets it as windows-1252, multi-byte characters will render incorrectly.
- Standards compliance: The WHATWG HTML Living Standard requires that a character encoding declaration be present. The encoding must also be declared within the first 1024 bytes of the document.
- Security: Ambiguous encoding can be exploited in certain cross-site scripting (XSS) attacks where an attacker takes advantage of encoding mismatches.
- Interoperability: Different browsers may choose different fallback encodings, leading to inconsistent rendering across platforms.
How to fix it
The simplest and recommended fix is to add <meta charset="utf-8"> as the first child element inside <head>, before <title> or any other elements. It must appear within the first 1024 bytes of the document so the parser encounters it early enough.
UTF-8 is the universal standard for the web. It can represent every character in Unicode, is backward-compatible with ASCII, and is the encoding recommended by the WHATWG HTML specification. Unless you have a very specific reason to use another encoding, always use UTF-8.
You should also ensure your text editor or build tool is actually saving the file in UTF-8 encoding. Declaring <meta charset="utf-8"> while the file is saved in a different encoding will still produce garbled text.
An alternative (less common) approach is to declare the encoding via an HTTP Content-Type header sent by the server, such as Content-Type: text/html; charset=utf-8. However, the <meta> tag is still recommended as a fallback for when files are viewed locally or cached without headers.
Examples
❌ Missing character encoding declaration
This document has no encoding declaration, triggering the validator warning:
<!DOCTYPE html>
<html lang="en">
<head>
<title>My Page</title>
</head>
<body>
<p>Héllo, wörld! — enjoy "quotes" and emöji 🎉</p>
</body>
</html>
✅ Fixed with <meta charset="utf-8">
Adding the <meta charset="utf-8"> tag as the first element in <head> resolves the issue:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>My Page</title>
</head>
<body>
<p>Héllo, wörld! — enjoy "quotes" and emöji 🎉</p>
</body>
</html>
❌ Encoding declared too late
The <meta charset> must come before other content in the <head>. Placing it after a <title> that contains non-ASCII characters means the parser may have already committed to a wrong encoding:
<!DOCTYPE html>
<html lang="en">
<head>
<title>Café Menu</title>
<meta charset="utf-8">
</head>
<body>
<p>Welcome to the café.</p>
</body>
</html>
✅ Encoding declared before all other content
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Café Menu</title>
</head>
<body>
<p>Welcome to the café.</p>
</body>
</html>
The key takeaway: always place <meta charset="utf-8"> as the very first element inside <head>. This is a small addition that prevents a whole class of character rendering and security issues.
Unicode allows certain characters — especially accented letters and other composed characters — to be represented in multiple ways. For example, the letter “é” can be a single precomposed character (U+00E9, NFC form) or a base letter “e” (U+0065) followed by a combining acute accent (U+0301, NFD form). While they may look identical on screen, they are different byte sequences. The HTML specification requires that all attribute values use NFC to ensure consistent behavior across browsers, search engines, and assistive technologies.
This matters for several important reasons:
- String matching and comparison: Browsers and scripts may compare attribute values byte-by-byte. An id value in NFD form won’t match a CSS selector or fragment identifier targeting the NFC form, causing broken links and broken styles.
- Accessibility: Screen readers and other assistive technologies may process NFC and NFD strings differently, potentially mispronouncing text or failing to match ARIA references.
- Interoperability: Different operating systems produce different normalization forms by default (macOS file systems historically use NFD, for example). Copying text from various sources can introduce non-NFC characters without any visual indication.
- Standards compliance: The WHATWG HTML specification and W3C guidance on normalization explicitly recommend NFC for all HTML content.
The issue most commonly appears when attribute values contain accented characters (like in id, class, alt, title, or value attributes) that were copied from a source using NFD normalization, or when files are created on systems that default to NFD.
To fix the problem, you need to convert the affected attribute values to NFC. You can do this by:
- Retyping the characters directly in your editor, which usually produces NFC by default.
- Using a programming tool such as Python’s unicodedata.normalize('NFC', text), JavaScript’s text.normalize('NFC'), or similar utilities in your language of choice.
- Using a text editor that supports normalization conversion (some editors have built-in Unicode normalization features or plugins).
- Running a batch conversion on your HTML files before deployment as part of your build process.
Examples
Incorrect: Attribute value uses NFD (decomposed form)
In this example, the id attribute value for “résumé” uses decomposed characters (base letter + combining accent), which triggers the validation error. The decomposition is invisible in source code but present at the byte level.
<!-- The "é" here is stored as "e" + combining acute accent (NFD) -->
<div id="résumé">
<p>My résumé content</p>
</div>
Correct: Attribute value uses NFC (precomposed form)
Here, the id attribute value uses precomposed characters, which is the correct NFC form.
<!-- The "é" here is stored as a single precomposed character (NFC) -->
<div id="résumé">
<p>My résumé content</p>
</div>
While these two examples look identical in source view, they differ at the byte level. You can verify the normalization form using browser developer tools or a hex editor.
Checking and fixing with JavaScript
You can programmatically normalize attribute values:
<script>
// Check if a string is in NFC
const text = "résumé";
const nfcText = text.normalize("NFC");
console.log(text === nfcText); // false if original was NFD
</script>
Checking and fixing with Python
import unicodedata
text = "r\u0065\u0301sume\u0301" # NFD form
normalized = unicodedata.normalize('NFC', text)
print(normalized) # Outputs NFC form: "résumé"
If you encounter this validation error, inspect the flagged attribute value carefully and ensure all characters are in their precomposed NFC form. Adding a normalization step to your build pipeline is a reliable way to prevent this issue from recurring.
Ready to validate your sites?
Start your free trial today.