Orchestrate Web Crawls with Scrapyd and Automated Data Enrichment
How it works This workflow runs a spider job in the background via Scrapyd, using a YAML config that defines selectors and parsing rules. When triggered, it schedules the spider with parameters (query, project ID, page limits, etc.). The workflow polls Scrapyd until the job finishes. Once complete, it fetches the output items, enriches them (parse JSONL, deduplicate, extract ID/part number/make/model/part name, normalize price), sorts results, and returns structured JSON. Optional debug outputs such as logs, HTML dumps, and screenshots are also collected.
How to use Use the manual trigger for testing, or replace it with webhook, schedule, or other triggers. Adjust the runtime parameters (q, project_id, pages, etc.) directly in the workflow when running. The background spider config (YAML and spider code) must be updated separately — this workflow only orchestrates and enriches results, it does not define scraping logic.
Requirements
Scrapyd service for job scheduling & status tracking
A deployed spider with a valid YAML config (adjust selectors there)
JSON Lines output (items.jl) from the spider
Endpoints for optional artifacts (logs, HTML, screenshots)
n8n with HTTP, Wait, Code, and Aggregate nodes enabled
Customising this workflow
Update the YAML config if the target website structure changes
Modify the enrichment code to extract different fields (e.g., categories, ratings)
Adjust deduplication (cheapest, newest, or other logic)
Toggle debug retrieval depending on performance/storage needs
Extend webhook response to integrate with databases, APIs, or downstream workflows
Related Templates
Send structured logs to BetterStack from any workflow using HTTP Request
Send structured logs to BetterStack from any workflow using HTTP Request Who is this for? This workflow is perfect for...
Provide latest euro exchange rates from European Central Bank via Webhook
What is this workflow doing? This simple workflow is pulling the latest Euro foreign exchange reference rates from the E...
Convert Tour PDFs to Vector Database using Google Drive, LangChain & OpenAI
🧩 Workflow: Process Tour PDF from Google Drive to Pinecone Vector DB with OpenAI Embeddings Overview This workflow au...
🔒 Please log in to import templates to n8n and favorite templates
Workflow Visualization
Loading...
Preparing workflow renderer
Comments (0)
Login to post comments