Convert Tour PDFs to Vector Database using Google Drive, LangChain & OpenAI
🧩 Workflow: Process Tour PDF from Google Drive to Pinecone Vector DB with OpenAI Embeddings
Overview This workflow automates the process of extracting tour information from PDF files stored in a Google Drive folder, processes and vectorizes the extracted data, and stores it in a Pinecone vector database for efficient querying. This is especially useful for building AI-powered search or recommendation systems for travel packages.
Setup:
Prerequisites A folder in Google Drive with PDF tour package brochures.
Pinecone account + API key
OpenAI API key
n8n cloud or self-hosted instance
Workflow Setup Steps Trigger Manual Trigger (When clicking 'Test workflow'): Used for manual testing and execution of the workflow.
Google Drive Integration Step 1: Store Tour Packages in PDF Format
Upload your curated tour packages containing the tours, activities and sight-seeings in PDF format into a designated Google Drive folder.
Step 2: Search Folder
Node: PDF Tour Package Folder (Google Drive)
This node searches the designated folder for files (filter by MIME type = application/pdf if needed).
Step 3: Download PDFs
Node: Download Package Files (Google Drive)
Downloads each matching PDF file found in the previous step.
Process Each PDF File Step 4: Loop Through Files
Node: Loop Over each PDF file
Iterates through each downloaded PDF file to extract, clean, split, and embed.
Data Preparation & Embedding Step 5: Data Loader
Node: Data Loader
Reads each PDF’s content using a compatible loader. It passes clean raw text to the next node.
Often integrated with document loaders like pdf-loader, Unstructured, or pdfplumber.
Step 6: Recursive Text Splitter
Node: Recursive Character Text Splitter
Splits large chunks of text into manageable segments using overlapping window logic (e.g., 500 tokens with 50 token overlap).
This ensures contextual preservation for long documents during embedding.
Step 7: Generate Embeddings
Node: Embeddings OpenAI
Uses text-embedding-3-small model to vectorize the split chunks.
Outputs vector representations for each content chunk.
Store in Pinecone Step 8: Pinecone Vector Store
Node: Pinecone Vector Store - Store...
Stores each embedding along with its metadata (source PDF name, chunk ID, etc.).
This becomes the basis for fast, semantic search via RAG workflows or agents.
🛠️ Tools & Nodes Used Google Drive (Search & Download)
Searches for all PDF files in a specified Google Drive folder. Downloads each file for processing. SplitInBatches (Loop Over Items)
Loops through each file found in the folder, ensuring each is processed individually. Default Data Loader (LangChain)
Reads and extracts text from the PDF files. Recursive Character Text Splitter (LangChain)
Splits the extracted text into manageable chunks for embedding. OpenAI Embeddings (LangChain)
Converts each text chunk into a vector using OpenAI’s embedding model. Pinecone Vector Store (LangChain)
Stores the resulting vectors in a Pinecone index for fast similarity search and querying.
🔗 Workflow Steps Explained Trigger: The workflow starts manually for testing or can be scheduled. Google Drive Search: Finds all PDF files in the specified folder.
Loop Over Files: Each file is processed one at a time using the SplitInBatches node.
Download File: Downloads the current PDF file from Google Drive.
Extract Text: The Default Data Loader node reads the PDF and extracts its text content.
*Text Splitting: * The Recursive Character Text Splitter breaks the text into chunks (e.g., 1000 characters with 50 overlap) to optimize embedding quality.
**Vectorization: **Each chunk is sent to the OpenAI Embeddings node to generate vector representations.
Store in Pinecone: The vectors are inserted into a Pinecone index, making them available for semantic search and recommendations.
🚀 What Can Be Improved in the Next Version?
*Error Handling: * Add error handling nodes to manage failed downloads or extraction issues gracefully. File Type Filtering: Ensure only PDF files are processed by adding a filter node. Metadata Storage: Store additional metadata (e.g., file name, tour ID) alongside vectors in Pinecone for richer search results. *Parallel Processing: * Optimize for large folders by processing multiple files in parallel (with care for API rate limits). Automated Triggers: Replace manual trigger with a time-based or webhook trigger for full automation. Data Validation: Add checks to ensure extracted text contains valid tour data before vectorization. User Feedback: Integrate notifications (e.g., email or Slack) to inform when processing is complete or if issues arise.
💡 Summary This workflow demonstrates how n8n can orchestrate a powerful AI data pipeline using Google Drive, LangChain, OpenAI, and Pinecone. It’s a great foundation for building intelligent search or recommendation features for travel and tour data.
Feel free to ask for more details or share your improvements!
Let me know if you want to see a specific part of the workflow or need help with a particular node!
Related Templates
Bulk Automated Google Drive Files Sharing and Direct Download Link Generation
This N8N workflow automates the process of sharing files from Google Drive. It includes OAuth2 authentication, batch pro...
USDT And TRC20 Wallet Tracker API Workflow for n8n
Overview This n8n workflow is specifically designed to monitor USDT TRC20 transactions within a specified wallet. It u...
Add product ideas to Google Sheets via a Slack
Use Case This workflow is a slight variation of a workflow we're using at n8n. In most companies, employees have a lot o...
🔒 Please log in to import templates to n8n and favorite templates
Workflow Visualization
Loading...
Preparing workflow renderer
Comments (0)
Login to post comments