Introduction
Running evaluations in CI means every pull request gets a quality gate: if a code change degrades a metric beyond your threshold, the build fails before the change ships. This guide covers:- Setting a baseline run - tie an evaluation run to your pipeline ID so you can compare against it
- Detecting regressions - call
compare_runs()on every PR and exit non-zero on degradation - GitHub Actions YAML - a copy-paste workflow that automates the full loop
evaluate(). If not, start with the Experiments introduction first.
Worked example: evaluating a sentiment classifier
This section walks through evaluating a minimal sentiment classifier end to end: dataset, evaluators, baseline run, and regression gate. Swap in your own function and dataset - the evaluation flow only cares about the shape of the output.The function under test
A small classifier that tags a product review as"positive" or "negative":
The dataset
Three reviews with expected labels. A local list is enough for this example; for larger datasets see Upload Datasets.Evaluator 1: exact_match
Programmatic evaluator. Returns 1.0 when the prediction matches the expected label, 0.0 otherwise. Follows the evaluator signature (outputs, inputs, ground_truth):
Evaluator 2: llm_judge
LLM-as-judge evaluator for a softer correctness signal - useful when the model returns a valid label that still disagrees with the expected one. Temperature 0 keeps scores reproducible across runs:
Wiring it into evaluate()
Feed the function, dataset, and evaluators into evaluate(). The run_id is derived directly from the commit SHA (see section 1), so the same commit always maps to the same run_id:
exact_match or llm_judge means the classifier started mislabeling reviews it used to handle correctly.
References
- Evaluator templates - patterns for programmatic and LLM-judge evaluators.
- honeyhive python-sdk - install and auth reference.
1. Setting a Baseline Run
A baseline run is an ordinaryevaluate() call tagged with a run_id that your PR jobs can later reference. HoneyHive validates run_id as a strict UUIDv4, so plain strings like "ci-abc123" are rejected. Derive the run_id deterministically from the commit SHA, forcing the UUIDv4 version and variant bits so the result stays valid:
run_id is a pure function of the SHA, the baseline for any commit is the same across retries, and the PR job can reconstruct its base commit’s run_id without any state-passing - no cache, no artifact, no metadata lookup.
evaluate() options, see the Experiments introduction.
2. Detecting Regressions
Once the PR’s evaluation has run (via the samebaseline.py script above), call compare_runs() against the baseline to detect regressions. The PR workflow passes the PR head SHA and the base SHA in as PR_SHA and BASELINE_SHA; the script derives both run_ids directly from them.
Key methods on RunComparisonResult
| Method | Returns | Description |
|---|---|---|
list_degraded_metrics() | list[str] | Metric names where at least one datapoint degraded |
list_improved_metrics() | list[str] | Metric names where at least one datapoint improved |
get_metric_delta(name) | dict | Delta dict with old_aggregate, new_aggregate, improved_count, degraded_count, improved (IDs), degraded (IDs) |
sys.exit(1) on any degraded metric is the minimal threshold. For per-datapoint breakdowns and a richer comparison workflow, see Comparing Experiments.
3. GitHub Actions Workflow
The workflow below has two jobs:run-evaluation- runs on push tomain(sets the baseline) and on every PR (sets the PR run)detect-regression- runs only on PRs, compares the two runs and posts a comment
- Both jobs derive
run_idfrom the commit SHA via the samerun_id_from_sha()helper, so the same commit always produces the samerun_id. No cache, no artifact, no metadata passing. - On
pushtomain:baseline.pyruns on the pushed commit. That commit’srun_idbecomes the baseline any future PR branching off it will look up. - On
pull_request:run-evaluationproduces the PR head’s run, anddetect-regressionderives the baselinerun_idfromgithub.event.pull_request.base.shaand compares. github.event.pull_request.head.shais used instead ofgithub.shaon PR events becausegithub.shais GitHub’s temporary merge commit, not the PR head.- The PR comment posts the full output whether the check passes or fails, so reviewers see exactly which metrics changed.
baseline.pyis the script that runs the evaluation - for a concretefunction,dataset, andevaluatorslist, see the worked example above.
REST API (Non-Python CI)
If your CI doesn’t use Python, you can drive the same comparison via the REST API. Start a run, wait for it to complete, then call the comparison endpoint. For the full REST flow covering run creation and event logging, see Experiments via API. The run comparison endpoint:metrics array where each entry includes metric_name, old_aggregate, new_aggregate, improved_count, and degraded_count - the same data surface as RunComparisonResult.
Summary
| Step | What happens |
|---|---|
Push to main | evaluate() runs with run_id derived from github.sha; this is the baseline for any PR that branches off this commit |
| Pull request opens | evaluate() runs on the PR head with run_id derived from pr.head.sha; baseline run_id derived from pr.base.sha |
compare_runs() called | Returns RunComparisonResult with degraded/improved metrics |
| Degraded metric found | sys.exit(1) - CI fails, PR blocked |
| All metrics stable | sys.exit(0) - CI passes, PR unblocked |
| PR comment posted | Reviewers see exact metric deltas inline |

