Compare Different LLM Responses Side-by-Side with Google Sheets
This workflow allows you to easily evaluate and compare the outputs of two language models (LLMs) before choosing one for production.
In the chat interface, both model outputs are shown side by side. Their responses are also logged into a Google Sheet, where they can be evaluated manually or automatically using a more advanced model.
Use Case You're developing an AI agent, and since LLMs are non-deterministic, you want to determine which one performs best for your specific use case. This template is designed to help you compare them effectively.
How It Works The user sends a message to the chat interface. The input is duplicated and sent to two different LLMs. Each model processes the same prompt independently, using its own memory context. Their answers, along with the user input and previous context, are logged to Google Sheets. You can review, compare, and evaluate the model outputs manually (or automate it later). In the chat, both responses are also shown one after the other for direct comparison.
How To Use It Copy this Google Sheets template (File > Make a Copy). Set up your System Prompt and Tools in the AI Agent node to suit your use case. Start chatting! Each message will trigger both models and log their responses to the spreadsheet.
Note: This version is set up for two models. If you want to compare more, you’ll need to extend the workflow logic and update the sheet.
About Models
You can use OpenRouter or Vertex AI to test models across providers.
If you're using a node for a specific provider, like OpenAI, you can compare different models from that provider (e.g., gpt-4.1 vs gpt-4.1-mini).
Evaluation in Google Sheets This is ideal for teams, allowing non-technical stakeholders (not just data scientists) to evaluate responses based on real-world needs.
Advanced users can automate this evaluation using a more capable model (like o3 from OpenAI), but note that this will increase token usage and cost.
Token Considerations
Since each input is processed by two different models, the workflow will consume more tokens overall.
Keep an eye on usage, especially if working with longer prompts or running multiple evaluations, as this can impact cost.
Related Templates
Use OpenRouter in n8n versions <1.78
What it is: In version 1.78, n8n introduced a dedicated node to use the OpenRouter service, which lets you to use a lot...
Task Deadline Reminders with Google Sheets, ChatGPT, and Gmail
Intro This template is for project managers, team leads, or anyone who wants to automatically remind teammates of tasks ...
🤖 Build Resilient AI Workflows with Automatic GPT and Gemini Failover Chain
This workflow contains community nodes that are only compatible with the self-hosted version of n8n. How it works This...
🔒 Please log in to import templates to n8n and favorite templates
Workflow Visualization
Loading...
Preparing workflow renderer
Comments (0)
Login to post comments