How to Build and Run AI Evaluations in N8N

This guide walks you through setting up AI evaluation workflows in N8N, running scored tests against a ground truth dataset, and using evaluation results to improve agent quality and reduce costs. It is intended for automation builders and agency operators who want objective, measurable quality control over their AI workflows.


Prerequisites

  • Access to an N8N account (cloud or self-hosted)
  • An OpenRouter account with API credentials configured in N8N
  • A basic understanding of N8N workflows and agent nodes
  • A dataset of question-and-answer pairs to use as ground truth (at least 20 rows recommended for getting started)

Step 1: Review the Webinar Overview Slide

Before building anything, familiarise yourself with the four main topics this guide covers: what AI evaluations are, N8N's built-in evaluation nodes, output drift, and using evaluation scores as workflow logic. Understanding these concepts upfront will help you make better decisions as you configure each part of the workflow. Keep these four areas in mind as a framework throughout the setup process.

Overview slide showing the four main topics covered in the AI evaluations guide

Step 2: Review the Five Built-In N8N Evaluation Metric Types

N8N includes five built-in evaluation metrics you can use to score your AI agent's output. These are:

  • Categorisation — checks if the output exactly matches an expected class, returning a score of 0 or 1
  • Correctedness — uses an AI judge to assess factual accuracy, scored 1 to 5
  • Helpfulness — assesses whether the response was genuinely useful to the user, scored 1 to 5
  • String Similarity — measures how closely the output text matches a reference answer, scored 0 to 1
  • Tools Used — checks whether the agent called the correct tools in the right situations

For most client-facing AI workflows, Correctedness is the recommended starting point, as it provides a nuanced, AI-judged score that reflects real-world quality.

Slide listing the five built-in N8N evaluation metric types with descriptions

Step 3: Understand the Four Key Evaluation Nodes

An N8N evaluation workflow is built using four specific nodes, each with a distinct role:

  • Evaluation Trigger — replaces the normal workflow trigger and pulls each row from your test dataset one at a time, supporting both Google Sheets and N8N Data Tables
  • Check If Evaluating — a branching node that routes traffic into evaluation mode during a test run, and into normal production mode otherwise
  • Set Output — writes the agent's response back into the dataset alongside the expected answer so you can compare them row by row
  • Set Metrics — runs the AI judge against the expected and actual answers to produce a score; the model you choose here should never be changed between evaluation runs, as this will affect comparability

Understanding how these nodes connect will help you build and troubleshoot the workflow confidently.

Diagram showing the four N8N evaluation nodes and how they connect in a workflow

Step 4: Review the Three-Zone Scoring Threshold Logic

Once your evaluation workflow produces a Correctedness score, you can use that score as branching logic to route responses automatically. Set up three zones using an If node placed immediately after the Set Metrics node:

  • Score 4 or above (High Confidence) — the response is auto-approved and sent directly to the client or the next step in the workflow, with no human review required
  • Score 2.5 to 3.9 (Borderline) — the response is flagged and routed to a human reviewer via Slack, email, or another notification channel; the reviewer can approve it or provide a corrected answer
  • Score below 2.5 (Low Quality) — the response is blocked from being sent; the workflow can either automatically retry with a stricter prompt or escalate to a team member for manual handling

This turns your evaluation from a passive measurement tool into active quality control that runs inside every workflow execution.

Slide showing the three scoring zones: high confidence, borderline, and low quality with routing logic

Step 5: Open the FAQ Evaluation Workflow in N8N

Open the FAQ evaluation workflow in your N8N account. You should see an Evaluation Trigger node connected to the FAQ agent, followed by an evaluation agent node and a routing branch. This is the complete flow you will configure and run throughout the remaining steps. Take a moment to trace the path from the trigger through to the routing output so you understand how data moves through the workflow.

N8N canvas showing the FAQ evaluation workflow with trigger, agent, evaluation agent, and routing nodes

Step 6: Open the FAQ Evaluation Data Table

Navigate to Overview > Data Tables in your N8N account and open the FAQ evaluation data table. You should see 20 rows, each containing a question in one column and its correct expected answer in another. This is your ground truth dataset — the Evaluation Trigger node will pull from this table row by row when a test run is initiated. Confirm all 20 rows are present and that both the question and expected answer columns are populated before proceeding.

N8N Data Tables view showing the FAQ evaluation table with 20 rows of questions and expected answers

Step 7: Review the Evaluation Agent's Scoring Prompt

Open the evaluation agent node and inspect its system prompt. It should instruct the AI to act as an expert factual evaluator, compare the agent's output against the ground truth answer, and return a score from 1 to 5 using clearly defined criteria at each level. The prompt should also specify a structured JSON output containing the score, the reasoning behind it, and the actual answer provided by the agent. Confirm the scoring criteria are clearly defined before running any evaluations, as this prompt drives all scoring decisions.

Evaluation agent node open in N8N showing the system prompt with scoring criteria from 1 to 5 and JSON output structure

Step 8: Run the First Evaluation With No FAQ Data in the Prompt

For this baseline run, ensure the FAQ agent's system prompt contains no FAQ answers — the agent should have no specific knowledge to draw from. Navigate to the Evaluations tab within the workflow and click Run Evaluation. N8N will process each of the 20 test questions one by one, which may take a couple of minutes to complete. This baseline run is expected to produce a low Correctedness score, which confirms the evaluation system is measuring accurately and gives you a starting point to improve from.

N8N Evaluations tab showing the Run Evaluation button and the first evaluation in progress

Step 9: Review the First Evaluation Correctedness Score

Once the first run completes, check the Correctedness score displayed in the evaluations panel. A score of around 1.5 is expected at this stage, confirming the agent is performing poorly without any FAQ knowledge in its prompt. This low score is not a problem — it is your baseline. Record this result so you can compare it against subsequent runs after making improvements.

N8N evaluations panel showing the first evaluation result with a Correctedness score of 1.5

Step 10: Improve the Agent Prompt by Adding the FAQ Knowledge Base

Open the FAQ agent node and paste the full set of 40 FAQ question-and-answer pairs into the system prompt. This gives the agent the specific knowledge it needs to answer the evaluation questions correctly. Once you have pasted in the FAQ content, save the workflow before proceeding to the next step. Do not run the evaluation again yet — you must first clear the previous results from the data table.

FAQ agent node open in N8N showing the system prompt updated with 40 FAQ question and answer pairs

Step 11: Clear Previous Evaluation Results From the Data Table

Before running a second evaluation, navigate back to the FAQ evaluation data table and delete all values in the actual answer column that were populated during the first run. N8N determines which rows to evaluate based on the actual answer field being empty — if these values are not cleared, the evaluation trigger will skip those rows and your results will be incomplete or inaccurate. You will need to delete each value individually. The question and expected answer columns should remain untouched.

N8N data table showing the actual answer column being cleared row by row before the second evaluation run

Step 12: Run the Second Evaluation With the Improved Prompt

With the data table cleared and the updated FAQ knowledge base now in the agent's system prompt, navigate back to the Evaluations tab and click Run Evaluation again. This time, because the agent has access to the correct answers, you should expect a Correctedness score approaching 4.9. Allow the run to complete fully before reviewing the results.

N8N evaluations tab showing the second evaluation run in progress with the improved FAQ prompt

Step 13: Review the Second Evaluation Score and Token Usage

Once the second run completes, review the results in the evaluations panel. You should see a significantly higher Correctedness score alongside an increase in token usage — this is because the larger system prompt (containing all 40 FAQs) requires more tokens to process. This trade-off between accuracy and cost is exactly what evaluations are designed to help you measure and optimise. Note the completion tokens, prompt tokens, total tokens, and execution time for comparison in the next steps.

N8N evaluations panel showing the second run results with a higher Correctedness score and increased token usage

Step 14: Change the FAQ Agent to a Cheaper Model (GPT-4.1 Nano)

Open the FAQ agent node and change the AI model from Claude Sonnet 4.6 to GPT-4.1 Nano via OpenAI. This tests whether a cheaper, faster model can match or exceed the quality of the more expensive one when the prompt is well-structured. After changing the model, save the workflow, then clear the actual answer values from the data table again before running the next evaluation.

FAQ agent node in N8N showing the model being changed from Claude Sonnet 4.6 to GPT-4.1 Nano via OpenAI

Step 15: Run the Third Evaluation With GPT-4.1 Nano

After confirming the data table has been cleared of previous actual answer values, navigate to the Evaluations tab and click Run Evaluation with GPT-4.1 Nano selected as the agent model. You should notice this run completes noticeably faster than the previous ones. Allow it to finish fully before comparing results.

N8N evaluations tab showing the third evaluation run in progress using GPT-4.1 Nano

Step 16: Review the Cost Comparison Across Evaluation Runs

With all three runs now complete, compare the Correctedness scores and execution times side by side in the evaluations panel. GPT-4.1 Nano should show a Correctedness score equal to or higher than Claude Sonnet 4.6, while running faster and at a significantly lower cost. This is the core insight that evaluations provide — objective data to justify model selection decisions rather than relying on assumptions about which model is "better".

N8N evaluations panel showing all three evaluation runs side by side with Correctedness scores and execution times

Step 17: Check Model Costs in OpenRouter Logs

Open your OpenRouter account and navigate to the Logs section. Filter by model to view the per-request costs for each model used across your evaluation runs — Claude Sonnet 4.6, GPT-4.1 Nano, and the free Nvidia Nementron model. You should be able to see clearly that GPT-4.1 Nano costs a fraction of what Sonnet costs per request, while the Nementron model runs at no cost at all. Use this data alongside your Correctedness scores to make an informed decision about which model to use in production.

OpenRouter logs view showing per-request costs for Claude Sonnet, GPT-4.1 Nano, and Nvidia Nementron across evaluation runs

Step 18: Download the Workflow JSON and Data Table CSV for Reuse

Download the provided workflow JSON file and data table CSV file to use as a starting point for your own evaluations. To set them up in your N8N account, follow these steps in order:

  1. Go to Overview > Data Tables and click Create Data Table from CSV, then import the CSV file and name the table FAQ evaluation
  2. Navigate to your workflows and click Import from File, then select the workflow JSON file
  3. Once imported, open the Evaluation Trigger node and update the data table ID to match your newly created table
  4. Update the OpenRouter credential in any nodes that reference it
  5. Update the model selections in the FAQ agent node and evaluation agent node to your preferred models

After completing these updates, you can replace the sample FAQ questions and answers with your own content and begin running evaluations against your actual workflow.

N8N interface showing the import workflow from file option and the data table import process

Troubleshooting

The evaluation run skips some rows or produces fewer results than expected.
This is almost always caused by rows in the data table that still have a value in the actual answer column from a previous run. N8N only evaluates rows where the actual answer field is empty. Go to the data table and manually delete all values in the actual answer column before running the evaluation again.

The Correctedness scores vary unexpectedly between runs even when nothing has changed.
Check that the AI model selected in the Set Metrics node (the evaluator/judge model) has not been changed between runs. Changing the judge model will alter how responses are scored, making comparisons between runs unreliable. Always keep the evaluator model consistent across all runs for the same workflow.

The workflow was imported successfully but the evaluation trigger is not pulling any data.
After importing the workflow JSON, the data table ID inside the Evaluation Trigger node will still reference the original table from the source account. Open the Evaluation Trigger node, navigate to the data table selector, and choose your newly imported FAQ evaluation table from the list. Also confirm your OpenRouter credential is correctly assigned in all relevant nodes.

Need Help?

Contact us at hello@awesomate.ai or raise a ticket in your Teamwork Desk portal.