Domain-Specific Web Content Crawler with Depth Control & Text Extraction
This template implements a recursive web crawler inside n8n. Starting from a given URL, it crawls linked pages up to a maximum depth (default: 3), extracts text and links, and returns the collected content via webhook.
🚀 How It Works
-
Webhook Trigger
Accepts a JSON body with a url field.
Example payload:{ "url": "https://example.com" }
-
Initialization
Sets crawl parameters: url, domain, maxDepth = 3, and depth = 0.
Initializes global static data (pending, visited, queued, pages). -
Recursive Crawling
Fetches each page (HTTP Request).
Extracts body text and links (HTML node).
Cleans and deduplicates links.
Filters out: External domains (only same-site is followed)
Anchors (#), mailto/tel/javascript links
Non-HTML files (.pdf, .docx, .xlsx, .pptx) -
Depth Control & Queue
Tracks visited URLs
Stops at maxDepth to prevent infinite loops
Uses SplitInBatches to loop the queue -
Data Collection
Saves each crawled page (url, depth, content) into pages[]
When pending = 0, combines results -
Output
Responds via the Webhook node with: combinedContent (all pages concatenated) pages[] (array of individual results) Large results are chunked when exceeding ~12,000 characters
🛠️ Setup Instructions
-
Import Template
Load from n8n Community Templates. -
Configure Webhook
Open the Webhook node
Copy the Test URL (development) or Production URL (after deploy)
You’ll POST crawl requests to this endpoint -
Run a Test
Send a POST with JSON:curl -X POST https://<your-n8n>/webhook/<id>
-H "Content-Type: application/json"
-d '{"url": "https://example.com"}' -
View Response
The crawler returns a JSON object containing combinedContent and pages[].
⚙️ Configuration
maxDepth**
Default: 3. Adjust in the Init Crawl Params (Set) node.
Timeouts**
HTTP Request node timeout is 5 seconds per request; increase if needed.
Filtering Rules**
Only same-domain links are followed (apex and www treated as same-site)
Skips anchors, mailto:, tel:, javascript:
Skips document links (.pdf, .docx, .xlsx, .pptx)
You can tweak the regex and logic in Queue & Dedup Links (Code) node
📌 Limitations
No JavaScript rendering (static HTML only)
No authentication/cookies/session handling
Large sites can be slow or hit timeouts; chunking mitigates response size
✅ Example Use Cases
Extract text across your site for AI ingestion / embeddings
SEO/content audit and internal link checks
Build a lightweight page corpus for downstream processing in n8n
⏱️ Estimated Setup Time
~10 minutes (import → set webhook → test request)
Related Templates
Extract Named Entities from Web Pages with Google Natural Language API
Who is this for? Content strategists analyzing web page semantic content SEO professionals conducting entity-based anal...
Add product ideas to Notion via a Slack command
Use Case In most companies, employees have a lot of great ideas. That was the same for us at n8n. We wanted to make it a...
Automate Daily Keyword Research with Google Sheets, Suggest API & Custom Search
Who's it for This workflow is perfect for SEO specialists, marketers, bloggers, and content creators who want to automa...
🔒 Please log in to import templates to n8n and favorite templates
Workflow Visualization
Loading...
Preparing workflow renderer
Comments (0)
Login to post comments