Documentation

Understanding how the sitemap generator works under the hood

Overview

This sitemap generator intelligently crawls websites to discover all accessible pages and generates a standards-compliant XML sitemap. It's designed to handle both traditional server-side rendered (SSR) pages and modern client-side rendered (CSR) applications like React, Vue, and Angular.

The generator uses a hybrid approach: it first attempts to extract links using simple HTTP requests and HTML parsing. If it detects a CSR application, it automatically falls back to Puppeteer for JavaScript rendering.

Algorithm Flow

Sitemap Discovery

Check robots.txt for existing sitemaps and common sitemap paths

Initialize Crawl Queue

Start with base URL and any discovered sitemap URLs

Concurrent Crawling

Process up to 5 URLs simultaneously using breadth-first search

CSR Detection

Analyze HTML structure to determine if page is client-side rendered

Link Extraction

Extract links from HTML or use Puppeteer if CSR detected

Generate Sitemap

Create XML with URLs, priorities, and last modification dates

Client-Side Rendering Detection

The CSR detection algorithm analyzes several signals to determine if a page requires JavaScript execution to render its content:

Detection Criteria

•Short HTML: Less than 200 characters suggests minimal initial content
•Empty Body: Less than 5 child nodes in the body element
•Framework Markers: Presence of #root or #__next elements
•Script Heavy: More than 10 script tags with low content-to-script ratio
•Loading Indicators:Text like "loading" or "spinner" in the initial HTML

function detectCSR(html, root, config) {
  const body = root.querySelector("body");
  const bodyChildCount = body ? body.childNodes.length : 0;

  // Check various CSR indicators
  const isShortHtml = html.length < 200;
  const hasEmptyBody = bodyChildCount < 5;
  const hasRootDiv = ["#root", "#__next"].some(
    (selector) => root.querySelector(selector) !== null
  );
  const hasManyScripts = scriptCount > 10;
  const lowContentRatio = (html.length / scriptCount) < 1000;

  return (
    isShortHtml ||
    (hasRootDiv && hasEmptyBody) ||
    (hasEmptyBody && hasManyScripts) ||
    (lowContentRatio && hasRootDiv)
  );
}

Crawling Strategy

The crawler uses a breadth-first search (BFS) algorithm with concurrent processing to efficiently discover pages:

Concurrency

Processes up to 5 URLs simultaneously to maximize throughput while being respectful to the target server.

Depth Tracking

Tracks link depth from the homepage to calculate priority scores and understand site structure.

Deduplication

Uses a visited set to prevent crawling the same URL multiple times, normalizing URLs by removing query strings and fragments.

Scope Control

Only crawls URLs within the same hostname, preventing external link following and respecting robots.txt disallow rules.

Priority Calculation

Each URL in the sitemap is assigned a priority value between 0.1 and 1.0 based on its depth from the homepage:

Homepage (depth 0)1.0

First level (depth 1)0.9

Second level (depth 2)0.8

Deep pages (depth 9+)0.1

priority = Math.max(0.1, 1.0 - depth * 0.1)

robots.txt Handling

The generator respects robots.txt directives to ensure ethical crawling:

Sitemap Discovery

Parses robots.txt for existing sitemap declarations and prioritizes them for crawling.

Sitemap: https://example.com/sitemap.xml

Disallow Rules

Respects Disallow directives by checking URLs against disallowed paths before adding them to the crawl queue.

Disallow: /admin/
Disallow: /private/

Puppeteer Fallback

When CSR is detected, the generator uses Puppeteer to render the page and extract links from the fully rendered DOM:

Launch Browser: Starts a headless Chrome instance shared across all CSR pages

Navigate & Wait: Loads the page and waits for networkidle2 (all network connections idle)

Wait for Selectors: Waits up to 10 seconds for critical selectors like anchor tags or framework root elements

Extract Links: Uses page.evaluate() to query the DOM for all href attributes and canonical/alternate links

Cleanup: Closes the page context to free resources while keeping the browser instance alive

XML Sitemap Generation

The final sitemap follows the sitemaps.org protocol specification:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-04-18T12:00:00+00:00</lastmod>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/about</loc>
    <lastmod>2026-04-15T08:30:00+00:00</lastmod>
    <priority>0.9</priority>
  </url>
  <!-- Additional URLs... -->
</urlset>

Performance Considerations

Concurrent Requests

Limits concurrency to 5 to balance speed with server load

Shared Browser

Reuses single Puppeteer instance across all CSR pages

HTTP First

Attempts lightweight HTTP parsing before expensive Puppeteer rendering

Stream Updates

Uses Server-Sent Events to provide real-time progress without polling

Configuration

The algorithm can be tuned using these configuration parameters:

const config = {
  csr: {
    minimalContentLength: 200,     // Min HTML length
    minimalChildNodes: 5,           // Min body children
    scriptCountThreshold: 10,       // Script tag threshold
    contentScriptRatio: 1000,       // Content/script ratio
    rootSelectors: ["#root", "#__next"]
  },
  puppeteer: {
    waitForSelectorsTimeout: 10000, // Selector wait time
    gotoTimeout: 60000,             // Page load timeout
    waitUntil: "networkidle2"       // Wait strategy
  },
  crawler: {
    concurrency: 5,                 // Parallel requests
    maxPages: 100                   // Maximum pages
  }
}

Back to generator