Migrate large Hugging Face datasets to MongoDB with a looping subworkflow
This n8n template provides a production-ready, memory-safe pipeline for ingesting large Hugging Face datasets into MongoDB using batch pagination.
It is designed as a reusable data ingestion layer for RAG systems, recommendation engines, analytics pipelines, and ML workflows.
The template includes: A main workflow that orchestrates pagination and looping A subworkflow that fetches dataset rows, sanitizes them, and inserts them into MongoDB safely
🚀 What This Template Does
Fetches rows from a Hugging Face dataset using the datasets-server API Processes data in configurable batches (offset + length) Removes Hugging Face _id fields to avoid MongoDB duplicate key errors Inserts clean documents into MongoDB Automatically loops until all dataset rows are ingested Handles large datasets without memory overflow
🧩 Architecture Overview
Main Workflow (Orchestrator) Starts the ingestion process Defines dataset, batch size, and MongoDB collection Repeatedly calls the subworkflow until no rows remain
Subworkflow (Batch Processor) Fetches a single batch of rows from Hugging Face Splits rows into individual items Removes _id fields Inserts documents into MongoDB Returns batch statistics to the main workflow
🔁 Workflow Logic (High-Level)
Set initial configuration: Dataset name Split (train, test, etc.) Batch size Offset Fetch rows from Hugging Face If rows exist: Split rows into items Remove _id Insert into MongoDB Increase offset Repeat until no rows are returned
📦 Default Configuration
| Parameter | Default Value | |---------|--------------| | Dataset | MongoDB/airbnb_embeddings | | Config | default | | Split | train | | Batch Size | 100 | | MongoDB Collection | airbnb |
All values can be changed easily from the Config_Start node.
🛠 Prerequisites
n8n (self-hosted or cloud) MongoDB (local or hosted) MongoDB credentials configured in n8n Internet access to datasets-server.huggingface.co
▶️ How to Use
Import the workflow JSON into n8n Configure MongoDB credentials in the MongoDB node Update dataset parameters if needed: Dataset name Split Batch size Collection name Run the workflow using the Manual Trigger Monitor execution until completion
🧠 Why _id Is Removed
Hugging Face dataset rows often include an _id field.
MongoDB requires _id values to be unique, so reusing these values can cause insertion failures.
This template: Removes the Hugging Face _id** Lets MongoDB generate its own ObjectId Prevents duplicate key errors Allows safe re-runs and incremental ingestion
🔍 Ideal Use Cases
✅ RAG (Retrieval-Augmented Generation) Store dataset content as source documents Add embeddings later using OpenAI, Mistral, or local models Connect MongoDB to a vector database or hybrid search
✅ Recommendation Systems Build item catalogs from public datasets Use embeddings or metadata for similarity search Combine with user behavior data downstream
✅ ML & Analytics Pipelines Centralize dataset ingestion Normalize data before training or analysis
⚙️ Recommended Enhancements
You can easily extend this template with:
Upsert logic** using a deterministic hash (idempotent ingestion) Embedding generation** before or after insertion Schema validation** or field filtering Rate-limit handling & backoff** Parallel ingestion** for faster processing
⚠️ Notes & Best Practices
Reduce batch size if you encounter memory limits Verify dataset license before production use Add indexes in MongoDB for faster downstream querying Use upserts if you plan to re-run ingestion frequently
📄 License & Disclaimer
This workflow template is provided as-is.
You are responsible for:
Dataset licensing compliance
Infrastructure costs
Downstream data usage
Hugging Face datasets are subject to their respective licenses.
⭐ Template Summary
Category: Data Ingestion
Complexity: Intermediate
Scalability: High
Memory Safe: Yes
Production Ready: Yes
If you want a version with: Upserts instead of inserts Built-in embeddings Vector database support Logging & monitoring
Just say the word and I’ll generate the enhanced workflow JSON.
Related Templates
Create a Speech-to-Text API with OpenAI GPT4o-mini Transcribe
Description This template provides a simple and powerful backend for adding speech-to-text capabilities to any applicat...
Automate Daily Keyword Research with Google Sheets, Suggest API & Custom Search
Who's it for This workflow is perfect for SEO specialists, marketers, bloggers, and content creators who want to automa...
USDT And TRC20 Wallet Tracker API Workflow for n8n
Overview This n8n workflow is specifically designed to monitor USDT TRC20 transactions within a specified wallet. It u...
🔒 Please log in to import templates to n8n and favorite templates
Workflow Visualization
Loading...
Preparing workflow renderer
Comments (0)
Login to post comments