Documentation
Understanding how the sitemap generator works under the hood
Overview
This sitemap generator intelligently crawls websites to discover all accessible pages and generates a standards-compliant XML sitemap. It's designed to handle both traditional server-side rendered (SSR) pages and modern client-side rendered (CSR) applications like React, Vue, and Angular.
The generator uses a hybrid approach: it first attempts to extract links using simple HTTP requests and HTML parsing. If it detects a CSR application, it automatically falls back to Puppeteer for JavaScript rendering.
Algorithm Flow
Sitemap Discovery
Check robots.txt for existing sitemaps and common sitemap paths
Initialize Crawl Queue
Start with base URL and any discovered sitemap URLs
Concurrent Crawling
Process up to 5 URLs simultaneously using breadth-first search
CSR Detection
Analyze HTML structure to determine if page is client-side rendered
Link Extraction
Extract links from HTML or use Puppeteer if CSR detected
Generate Sitemap
Create XML with URLs, priorities, and last modification dates
Client-Side Rendering Detection
The CSR detection algorithm analyzes several signals to determine if a page requires JavaScript execution to render its content:
Detection Criteria
- •Short HTML: Less than 200 characters suggests minimal initial content
- •Empty Body: Less than 5 child nodes in the body element
- •Framework Markers: Presence of #root or #__next elements
- •Script Heavy: More than 10 script tags with low content-to-script ratio
- •Loading Indicators:Text like "loading" or "spinner" in the initial HTML
function detectCSR(html, root, config) {
const body = root.querySelector("body");
const bodyChildCount = body ? body.childNodes.length : 0;
// Check various CSR indicators
const isShortHtml = html.length < 200;
const hasEmptyBody = bodyChildCount < 5;
const hasRootDiv = ["#root", "#__next"].some(
(selector) => root.querySelector(selector) !== null
);
const hasManyScripts = scriptCount > 10;
const lowContentRatio = (html.length / scriptCount) < 1000;
return (
isShortHtml ||
(hasRootDiv && hasEmptyBody) ||
(hasEmptyBody && hasManyScripts) ||
(lowContentRatio && hasRootDiv)
);
}Crawling Strategy
The crawler uses a breadth-first search (BFS) algorithm with concurrent processing to efficiently discover pages:
Concurrency
Processes up to 5 URLs simultaneously to maximize throughput while being respectful to the target server.
Depth Tracking
Tracks link depth from the homepage to calculate priority scores and understand site structure.
Deduplication
Uses a visited set to prevent crawling the same URL multiple times, normalizing URLs by removing query strings and fragments.
Scope Control
Only crawls URLs within the same hostname, preventing external link following and respecting robots.txt disallow rules.
Priority Calculation
Each URL in the sitemap is assigned a priority value between 0.1 and 1.0 based on its depth from the homepage:
1.00.90.80.1priority = Math.max(0.1, 1.0 - depth * 0.1)robots.txt Handling
The generator respects robots.txt directives to ensure ethical crawling:
Sitemap Discovery
Parses robots.txt for existing sitemap declarations and prioritizes them for crawling.
Disallow Rules
Respects Disallow directives by checking URLs against disallowed paths before adding them to the crawl queue.
Disallow: /private/
Puppeteer Fallback
When CSR is detected, the generator uses Puppeteer to render the page and extract links from the fully rendered DOM:
Launch Browser: Starts a headless Chrome instance shared across all CSR pages
Navigate & Wait: Loads the page and waits for networkidle2 (all network connections idle)
Wait for Selectors: Waits up to 10 seconds for critical selectors like anchor tags or framework root elements
Extract Links: Uses page.evaluate() to query the DOM for all href attributes and canonical/alternate links
Cleanup: Closes the page context to free resources while keeping the browser instance alive
XML Sitemap Generation
The final sitemap follows the sitemaps.org protocol specification:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2026-04-18T12:00:00+00:00</lastmod>
<priority>1.0</priority>
</url>
<url>
<loc>https://example.com/about</loc>
<lastmod>2026-04-15T08:30:00+00:00</lastmod>
<priority>0.9</priority>
</url>
<!-- Additional URLs... -->
</urlset>Performance Considerations
Concurrent Requests
Limits concurrency to 5 to balance speed with server load
Shared Browser
Reuses single Puppeteer instance across all CSR pages
HTTP First
Attempts lightweight HTTP parsing before expensive Puppeteer rendering
Stream Updates
Uses Server-Sent Events to provide real-time progress without polling
Configuration
The algorithm can be tuned using these configuration parameters:
const config = {
csr: {
minimalContentLength: 200, // Min HTML length
minimalChildNodes: 5, // Min body children
scriptCountThreshold: 10, // Script tag threshold
contentScriptRatio: 1000, // Content/script ratio
rootSelectors: ["#root", "#__next"]
},
puppeteer: {
waitForSelectorsTimeout: 10000, // Selector wait time
gotoTimeout: 60000, // Page load timeout
waitUntil: "networkidle2" // Wait strategy
},
crawler: {
concurrency: 5, // Parallel requests
maxPages: 100 // Maximum pages
}
}