Automate Document Ingestion & RAG System with Google Drive, Sheets & OpenAI

  1. Overview

The IngestionDocs workflow is a fully automated document ingestion and knowledge management system built with n8n. Its purpose is to continuously ingest organizational documents from Google Drive, transform them into vector embeddings using OpenAI, store them in Pinecone, and make them searchable and retrievable through an AI-powered Q&A interface.

This ensures that employees always have access to the most up-to-date knowledge base without requiring manual intervention.

  1. Key Objectives

Automated Ingestion** → Seamlessly process new and updated documents from Google Drive.
Change Detection** → Track and differentiate between new, updated, and previously processed documents.
Knowledge Base Construction** → Convert documents into embeddings for semantic search.
AI-Powered Assistance** → Provide an intelligent Q&A system for employees to query manuals.
Scalable & Maintainable** → Modular design using n8n, LangChain, and Pinecone.

  1. Workflow Breakdown

A. Document Monitoring and Retrieval

The workflow begins with two Google Drive triggers: File Created Trigger → Fires when a new document is uploaded.
File Updated Trigger → Fires when an existing document is modified.
A search operation lists the files in the designated Google Drive folder.
Non-downloadable items (e.g., subfolders) are filtered out.
For valid files: The file is downloaded.
A SHA256 hash is generated to uniquely identify the file's content.

B. Record Management (Google Sheets Integration)

To keep track of ingestion states, the workflow uses a Google Sheets--based Record Manager:
Each file entry contains:
Id** (Google Drive file ID)
Name** (file name)
hashId** (SHA256 checksum)
The workflow compares the current file's hash with the stored one:
New Document** → File not found in records → Inserted into the Record Manager.
Already Processed** → File exists and hash matches → Skipped.
Updated Document** → File exists but hash differs → Record is updated.

This guarantees that only new or modified content is processed, avoiding duplication.

C. Document Processing and Vectorization

Once a document is marked as new or updated:
Default Data Loader extracts its content (binary files supported).
Pages are split into individual chunks.
Metadata such as file ID and name are attached.
Recursive Character Text Splitter divides the content into manageable segments with overlap.
OpenAI Embeddings (text-embedding-3-large) transform each text chunk into a semantic vector.
Pinecone Vector Store stores these vectors in the configured index:
For new documents, embeddings are inserted into a namespace based on the file name.
For updated documents, the namespace is cleared first, then re-ingested with fresh embeddings.

This process builds a scalable and queryable knowledge base.

D. Knowledge Base Q&A Interface

The workflow also provides an interactive form-based user interface:
Form Trigger** → Collects employee questions.
LangChain AI Agent**:
Receives the question.
Retrieves relevant context from Pinecone using vector similarity search.
Processes the response using OpenAI Chat Model (gpt-4.1-mini).
Answer Formatting**:
Responses are returned in HTML format for readability.
A custom CSS theme ensures a modern, user-friendly design.
Answers may include references to page numbers when available.

This creates a self-service knowledge base assistant that employees can query in natural language.

  1. Technologies Used

n8n** → Orchestration of the entire workflow.
Google Drive API** → File monitoring, listing, and downloading.
Google Sheets API** → Record manager for tracking file states.
OpenAI API**: text-embedding-3-large for semantic vector creation.
gpt-4.1-mini for conversational Q&A.
Pinecone** → Vector database for embedding storage and retrieval.
LangChain** → Document loaders, text splitters, vector store connectors, and agent logic.
Crypto (SHA256)** → File hash generation for change detection.
Form Trigger + Form Node** → Employee-facing Q&A submission and answer display.
Custom CSS** → Provides a modern, responsive, styled UI for the knowledge base.

  1. End-to-End Data Flow

Employee uploads or updates a document → Google Drive detects the change.
Workflow downloads and hashes the file → Ensures uniqueness and detects modifications.
Record Manager (Google Sheets) → Decides whether to skip, insert, or update the record.
Document Processing → Splitting + Embedding + Storing into Pinecone.
Knowledge Base Updated → The latest version of documents is indexed.
Employee asks a question via the web form.
AI Agent retrieves embeddings from Pinecone + uses GPT-4.1-mini → Generates a contextual answer.
Answer displayed in styled HTML → Delivered back to the employee through the form interface.

  1. Benefits

Always Up-to-Date** → Automatically syncs documents when uploaded or changed.
No Duplicates** → Smart hashing ensures only relevant updates are reprocessed.
Searchable Knowledge Base** → Employees can query documents semantically, not just by keywords.
Enhanced Productivity** → Answers are immediate, reducing time spent browsing manuals.
Scalable** → New documents and users can be added without workflow redesign.

✅ In summary, IngestionDocs is a robust AI-driven document ingestion and retrieval system that integrates Google Drive, Google Sheets, OpenAI, and Pinecone within n8n. It continuously builds and maintains a knowledge base of manuals while offering employees an intelligent, user-friendly Q&A assistant for fast and accurate knowledge retrieval.

0
Downloads
0
Views
8.06
Quality Score
intermediate
Complexity
Author:Mohamed Abdelwahab(View Original →)
Created:9/10/2025
Updated:9/24/2025

🔒 Please log in to import templates to n8n and favorite templates

Workflow Visualization

Loading...

Preparing workflow renderer

Comments (0)

Login to post comments