Domain-Specific Web Content Crawler with Depth Control & Text Extraction

Name: Domain-Specific Web Content Crawler with Depth Control & Text Extraction
Availability: InStock
Rating: 0.4 (1 reviews)
Author: Le Nguyen

This template implements a recursive web crawler inside n8n. Starting from a given URL, it crawls linked pages up to a maximum depth (default: 3), extracts text and links, and returns the collected content via webhook.

🚀 How It Works

Webhook Trigger
Accepts a JSON body with a url field.
Example payload:

{ "url": "https://example.com" }
Initialization
Sets crawl parameters: url, domain, maxDepth = 3, and depth = 0.
Initializes global static data (pending, visited, queued, pages).
Recursive Crawling
Fetches each page (HTTP Request).
Extracts body text and links (HTML node).
Cleans and deduplicates links.
Filters out: External domains (only same-site is followed)
Anchors (#), mailto/tel/javascript links
Non-HTML files (.pdf, .docx, .xlsx, .pptx)
Depth Control & Queue
Tracks visited URLs
Stops at maxDepth to prevent infinite loops
Uses SplitInBatches to loop the queue
Data Collection
Saves each crawled page (url, depth, content) into pages[]
When pending = 0, combines results
Output
Responds via the Webhook node with: combinedContent (all pages concatenated) pages[] (array of individual results) Large results are chunked when exceeding ~12,000 characters

🛠️ Setup Instructions

Import Template
Load from n8n Community Templates.
Configure Webhook
Open the Webhook node
Copy the Test URL (development) or Production URL (after deploy)
You’ll POST crawl requests to this endpoint
Run a Test
Send a POST with JSON:

curl -X POST https://<your-n8n>/webhook/<id>
-H "Content-Type: application/json"
-d '{"url": "https://example.com"}'
View Response
The crawler returns a JSON object containing combinedContent and pages[].

⚙️ Configuration

maxDepth**
Default: 3. Adjust in the Init Crawl Params (Set) node.

Timeouts**
HTTP Request node timeout is 5 seconds per request; increase if needed.

Filtering Rules**
Only same-domain links are followed (apex and www treated as same-site)
Skips anchors, mailto:, tel:, javascript:
Skips document links (.pdf, .docx, .xlsx, .pptx)
You can tweak the regex and logic in Queue & Dedup Links (Code) node

📌 Limitations

No JavaScript rendering (static HTML only)
No authentication/cookies/session handling
Large sites can be slow or hit timeouts; chunking mitigates response size

✅ Example Use Cases

Extract text across your site for AI ingestion / embeddings
SEO/content audit and internal link checks
Build a lightweight page corpus for downstream processing in n8n

⏱️ Estimated Setup Time

~10 minutes (import → set webhook → test request)

0

Downloads

0

Views

8.34

Quality Score

beginner

Complexity

Category:Data Processing

Author:Le Nguyen(View Original →)

Created:9/28/2025

Updated:11/17/2025

Related Templates

Restore your workflows from GitHub

This workflow restores all n8n instance workflows from GitHub backups using the n8n API node. It complements the Backup ...

Data Processing1 downloads

Verify Linkedin Company Page by Domain with Airtop

Automating LinkedIn Company URL Verification Use Case This automation verifies that a given LinkedIn URL actually belo...

Data Processing0 downloads

USDT And TRC20 Wallet Tracker API Workflow for n8n

Overview This n8n workflow is specifically designed to monitor USDT TRC20 transactions within a specified wallet. It u...

Domain-Specific Web Content Crawler with Depth Control & Text Extraction

Tags

Related Templates

Restore your workflows from GitHub

Verify Linkedin Company Page by Domain with Airtop

USDT And TRC20 Wallet Tracker API Workflow for n8n

Workflow Visualization

Loading...

Comments (0)