How to use XML or plain text sitemaps to organize large site validations

by Jaime Iniesta

TL;DR: Rocket Validator supports XML and plain text sitemaps, use them to organize batch validation of large sites.

Spider web

Photo by Nathan Dumlao on Unsplash

Rocket Validator is a fully automated web crawler that assists you in validating large sites. To perform batch HTML and accessibility checking on the web pages of a large site, you just need to give it a starting URL, and it will automatically crawl the site, scrape the links, and validate each web page found.

Our web spider will find the internal linked web pages by scraping HTML in them, and adding only new web pages found to the site validation report.

As there are many paths to traverse following links on a site, there’s no guarantee of the exact URLs our web spider will find, when the site is larger than the specified limit on the report. Also, it can take a while to discover the unique web pages on a site by following the links and discarding repeated web pages.

When you want to have more control on the exact URLs to validate on a web site, and you want to make it easier, and therefore faster, for our web crawler, you can use XML or plain text sitemaps as the starting URL.

Chances are your site already has a sitemap - typically these are named sitemap.xml. For example, here’s our XML sitemap and here is the plain text version. We use these sitemaps to submit our web pages to search engines, and these same sitemaps can be used with the Rocket Validator crawler.

XML Sitemaps

According to sitemaps.org,

Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.

Although the XML Sitemaps protocol can include metadata about the web pages, Rocket Validator only takes into account the URLs, as specified in the loc tag. In its simplest form, here’s the structure we expect for an XML sitemap:

<?xml version="1.0" encoding="UTF-8"?>

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
     <loc>http://www.example.com/first</loc>
   </url>

   <url>
     <loc>http://www.example.com/second</loc>
   </url>
</urlset> 

In this example, we see 2 web pages being listed. As long as the content type is text/xml and this structure is respected, Rocket Validator will parse your XML sitemaps.

Plain text sitemaps

There’s a simpler alternative when you just need to list URLs, and you don’t need to pass additional metadata - just list the URLs in plain text, one URL per line, like this:

http://www.example.com/first
http://www.example.com/second

In this example, we see the same 2 web pages being listed. As long as the content type is text/plain and there’s one URL per line, Rocket Validator will parse your plain text sitemaps.

Organizing large sites using sitemaps

You can use XML or plain text sitemaps to organize the web pages you want to batch-check in your Rocket Validator site reports. There are many reasons to do that:

  • Controlling what exact URLs to include in the report. Instead of leaving it to the random paths our web crawler can find discovering your internal web pages, you can specify the exact URLs to validate using a sitemap.
  • Speeding up crawling. By giving our web crawler a specific list of web pages to include, you’re making its job easier and therefore faster.
  • Including more web pages than the maximum allowed in a report. Depending on your subscription plan, there’s a limit on the maximum number of web pages that a site report can include. For example, a Pro subscription gives you up to 5,000 web pages per report. A way to validate a site with 10,000 web pages is crafting 2 separate sitemaps, one for the first 5,000 web pages and a second one for the last 5,000 web pages.
  • Organizing web pages by sections. You may want to run different reports on different sections of a site. For example, you may want to have a report for the Blog and another report for the Store on your site. A good way to organize this is by using sitemaps, you can have https://example.com/blog_sitemap.txt to cover the web pages on the Blog, and https://example.com/store_sitemap.txt to cover the web pages on the Store. Remember to combine this with a matching max_pages to the length of the sitemap, to avoid deep crawling finding other web pages outside that section.

Some tips

Paginating sitemaps

If you’re generating your sitemaps dynamically, you can consider including pagination parameters in the sitemap URL. For example:

https://example.com/sitemap.php?page_size=1000&page=1

Then, you can tell your sitemap.php script to generate the URLs in pages of size page_size, and return the page number page.

Validating less web pages

While trying to validate a whole site is tempting, typically you’ll only want to validate a representation of your web pages. For example, if you have a blog, chances are all the posts in it will share the same common layout, so instead of validating all your posts, you can consider only validating one. You can for example include only the post list, a sample post, and a tag list in your sitemap.

Still checking your large sites one page at a time?

Save time using our automated web checker. Let our crawler check your web pages on the W3C Validator.

Subscribe to Rocket Validator updates

Join our mailing list to receive the latest news from Rocket Validator on your inbox.

Terms of Service

Automated site-wide A11Y / HTML checker.
Start your trial today.