Compare GPT-4, Claude & Gemini Responses with Contextual AI's LMUnit Evaluation
PROBLEM
Evaluating and comparing responses from multiple LLMs (OpenAI, Claude, Gemini) can be challenging when done manually.
Each model produces outputs that differ in clarity, tone, and reasoning structure.
Traditional evaluation metrics like ROUGE or BLEU fail to capture nuanced quality differences.
Human evaluations are inconsistent, slow, and difficult to scale.
This workflow automates LLM response quality evaluation using Contextual AI’s LMUnit, a natural language unit testing framework that provides systematic, fine-grained feedback on response clarity and conciseness.
> Note: LMUnit offers natural language-based evaluation with a 1–5 scoring scale, enabling consistent and interpretable results across different model outputs.
How it works
A chat trigger node collects responses from multiple LLMs such as OpenAI GPT-4.1, Claude 4.5 Sonnet, and Gemini 2.5 Flash.
Each model receives the same input prompt to ensure fair comparison, which is then aggregated and associated with each test cases
We use Contextual AI's LMUnit node to evaluate each response using predefined quality criteria:
“Is the response clear and easy to understand?” - Clarity
“Is the response concise and free from redundancy?” - Conciseness
LMUnit then produces evaluation scores (1–5) for each test
Results are aggregated and formatted into a structured summary showing model-wise performance and overall averages.
How to set up
Create a free Contextual AI account and obtain your CONTEXTUALAI_API_KEY.
In your n8n instance, add this key as a credential under “Contextual AI.”
Obtain and add credentials for each model provider you wish to test:
OpenAI API Key: platform.openai.com/account/api-keys
Anthropic API Key: console.anthropic.com/settings/keys
Gemini API Key: ai.google.dev/gemini-api/docs/api-key
Start sending prompts using chat interface to automatically generate model outputs and evaluations.
How to customize the workflow
Add more evaluation criteria (e.g., factual accuracy, tone, completeness) in the LMUnit test configuration.
Include additional LLM providers by duplicating the response generation nodes.
Adjust thresholds and aggregation logic to suit your evaluation goals.
Enhance the final summary formatting for dashboards, tables, or JSON exports.
For detailed API parameters, refer to the LMUnit API reference.
If you have feedback or need support, please email feedback@contextual.ai.
Related Templates
Use OpenRouter in n8n versions <1.78
What it is: In version 1.78, n8n introduced a dedicated node to use the OpenRouter service, which lets you to use a lot...
Task Deadline Reminders with Google Sheets, ChatGPT, and Gmail
Intro This template is for project managers, team leads, or anyone who wants to automatically remind teammates of tasks ...
🤖 Build Resilient AI Workflows with Automatic GPT and Gemini Failover Chain
This workflow contains community nodes that are only compatible with the self-hosted version of n8n. How it works This...
🔒 Please log in to import templates to n8n and favorite templates
Workflow Visualization
Loading...
Preparing workflow renderer
Comments (0)
Login to post comments