Reduce LLM Costs with Semantic Caching using Redis Vector Store and HuggingFace
Stop Paying for the Same Answer Twice
Your LLM is answering the same questions over and over. "What's the weather?" "How's the weather today?" "Tell me about the weather." Same answer, three API calls, triple the cost. This workflow fixes that.
What Does It Do?
Semantic caching with superpowers. When someone asks a question, it checks if you've answered something similar before. Not exact matches—semantic similarity. If it finds a match, boom, instant cached response. No LLM call, no cost, no waiting.
First time: "What's your refund policy?" → Calls LLM, caches answer
Next time: "How do refunds work?" → Instant cached response (it knows these are the same!)
Result: Faster responses + way lower API bills
The Flow
Question comes in through the chat interface Vector search checks Redis for semantically similar past questions Smart decision: Cache hit? Return instantly. Cache miss? Ask the LLM. New answers get cached automatically for next time Conversation memory keeps context across the whole chat
It's like having a really smart memo pad that understands meaning, not just exact words.
Quick Start
You'll need: OpenAI API key (for the chat model) huggingface API key (for embeddings) Redis 8.x (for vector magic)
Get it running: Drop in your credentials Hit the chat interface Watch your API costs drop as the cache fills up
That's it. No complex setup, no configuration hell.
Tune It Your Way
The distanceThreshold in the "Analyze results from store" node is your control knob:
Lower (0.2): Strict matching, fewer false positives, more LLM calls Higher (0.5): Loose matching, more cache hits, occasional weird matches Default (0.3)**: Sweet spot for most use cases
Play with it. Find what works for your questions.
Hack It Up
Some ideas to get you started:
Add TTL**: Make cached answers expire after a day/week/month Category filters**: Different caches for different topics Confidence scores**: Show users when they got a cached vs fresh answer Analytics dashboard**: Track cache hit rates and cost savings Multi-language**: Cache works across languages (embeddings are multilingual!) Custom embeddings**: Swap OpenAI for local models or other providers
Real Talk 💡
When it shines: Customer support (same questions, different words) Documentation chatbots (limited knowledge base) FAQ systems (obvious use case) Internal tools (repetitive queries)
When to skip it: Real-time data queries (stock prices, weather, etc.) Highly personalized responses Questions that need fresh context every time
Pro tip: Start with a higher threshold (0.4-0.5) and tighten it as you see what gets cached. Better to cache too much at first than miss obvious matches.
Built with n8n, Redis, Huggingface and OpenAI. Open source, self-hosted, completely under your control.
Related Templates
Use OpenRouter in n8n versions <1.78
What it is: In version 1.78, n8n introduced a dedicated node to use the OpenRouter service, which lets you to use a lot...
Task Deadline Reminders with Google Sheets, ChatGPT, and Gmail
Intro This template is for project managers, team leads, or anyone who wants to automatically remind teammates of tasks ...
🤖 Build Resilient AI Workflows with Automatic GPT and Gemini Failover Chain
This workflow contains community nodes that are only compatible with the self-hosted version of n8n. How it works This...
🔒 Please log in to import templates to n8n and favorite templates
Workflow Visualization
Loading...
Preparing workflow renderer
Comments (0)
Login to post comments