Transform Websites into a Conversational Knowledge Base with OpenAI RAG & Supabase
Overview
This advanced automation workflow enables deep web scraping combined with Retrieval-Augmented Generation (RAG) to transform websites into intelligent, queryable knowledge bases. The system recursively crawls target websites, extracts content, and indexes all data in a vector database for AI conversational access.
How the system works
Intelligent Web Scraping and RAG Pipeline
Recursive Web Scraper - Automatically crawls every accessible page of a target website Data Extraction - Collects text, metadata, emails, links, and PDF documents Supabase Integration - Stores content in PostgreSQL tables for scalability RAG Vectorization - Generates embeddings and stores them for semantic search AI Query Layer - Connects embeddings to an AI chat engine with citations Error Handling - Automatically retriggers failed queries
Setup Instructions
Estimated setup time: 30-45 minutes
Prerequisites
Self-hosted n8n instance (v0.200.0 or higher) Supabase account and project (PostgreSQL enabled) OpenAI/Gemini/Claude API key for embeddings and chat Optional: External vector database (Pinecone, Qdrant)
Detailed configuration steps
Step 1: Supabase configuration
Project creation**: New Supabase project with PostgreSQL enabled Generating credentials**: API keys (anon key and service_role key) and connection string Security configuration**: RLS policies according to your access requirements
Step 2: Connect Supabase to n8n
Configure Supabase node**: Add credentials to n8n Credentials Test connection**: Verify with a simple query Configure PostgreSQL**: Direct connection for advanced operations
Step 3: Preparing the database
Main tables**: pages: URLs, content, metadata, scraping statuses documents: Extracted and processed PDF files embeddings: Vectors for semantic search links: Link graph for navigation
Management functions**: Scripts to reactivate failed URLs and manage retries
Step 4: Configuring automation
Recursive scraper**: Starting URL, crawling depth, CSS selectors HTTP extraction**: User-Agent, headers, timeouts, and retry policies Supabase backup**: Batch insertion, data validation, duplicate management
Step 5: Error handling and re-executions
Failure monitoring**: Automatic detection of failed URLs Manual triggers**: Selective re-execution by domain or date Recovery sub-streams**: Retry logic with exponential backoff
Step 6: RAG processing
Embedding generation**: Text-embedding models with intelligent chunking Vector storage**: Supabase pgvector or external database Conversational engine**: Connection to chat models with source citations
Data structure
Main Supabase tables | Table | Content | Usage | |-------|---------|-------| | pages | URLs, HTML content, metadata | Main storage for scraped content | | documents | PDF files, extracted text | Downloaded and processed documents | | embeddings | Vectors, text chunks | Semantic search and RAG | | links | Link graph, navigation | Relationships between pages |
Use cases
Business and enterprise Competitive intelligence with conversational querying Market research from complex web domains Compliance monitoring and regulatory watch
Research and academia Literature extraction with semantic search Building datasets from fragmented sources
Legal and technical Scraping legal repositories with intelligent queries Technical documentation transformed into a conversational assistant
Key features
Advanced scraping Recursive crawling with automatic link discovery Multi-format extraction (HTML, PDF, emails) Intelligent error handling and retry
Intelligent RAG Contextual embeddings for semantic search Multi-document queries with citations Intuitive conversational interface
Performance and scalability Processing of thousands of pages per execution Embedding cache for fast responses Scalable architecture with Supabase Technical Architecture
Main flow: Target URL → Recursive scraping → Content extraction → Supabase storage → Vectorization → Conversational interface
Supported types: HTML pages, PDF documents, metadata, links, emails
Performance specifications
Capacity**: 10,000+ pages per run Response time**: < 5 seconds for RAG queries Accuracy**: >90% relevance for specific domains Scalability**: Distributed architecture via Supabase
Advanced configuration
Customization Crawling depth and scope controls Domain and content type filters Chunking settings to optimize RAG
Monitoring Real-time monitoring in Supabase Cost and performance metrics Detailed conversation logs
Related Templates
Restore your workflows from GitHub
This workflow restores all n8n instance workflows from GitHub backups using the n8n API node. It complements the Backup ...
Verify Linkedin Company Page by Domain with Airtop
Automating LinkedIn Company URL Verification Use Case This automation verifies that a given LinkedIn URL actually belo...
USDT And TRC20 Wallet Tracker API Workflow for n8n
Overview This n8n workflow is specifically designed to monitor USDT TRC20 transactions within a specified wallet. It u...
🔒 Please log in to import templates to n8n and favorite templates
Workflow Visualization
Loading...
Preparing workflow renderer
Comments (0)
Login to post comments