Evaluation API — measure model F1 before deploying

Evaluations let you measure how well a trained model performs against a labeled dataset before you deploy it. You can evaluate your own fine-tuned models or compare them against Pioneer’s baseline LLM models to understand the improvement your training has achieved. Results include overall F1, precision, and recall scores as well as per-entity breakdowns for NER tasks.

The base_model field in evaluation requests accepts a training job ID — unlike training jobs, which require a HuggingFace model ID or checkpoint UUID. You can also pass a base model ID to evaluate an untuned model as a baseline.

Run an evaluation

POST /felix/evaluations Starts an evaluation run that measures model performance against a labeled dataset. Returns an evaluation ID you can use to poll for results. Request body

base_model

string

required

The model to evaluate. Accepts a training job ID (to evaluate your fine-tuned model) or a base model ID (to evaluate an untuned model as a baseline).

dataset_name

string

required

The name of the labeled dataset to evaluate against. The dataset must be in the ready state.

project_id

string

Associate this evaluation with a specific project for organizational purposes.

curl -X POST https://api.pioneer.ai/felix/evaluations \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "base_model": "YOUR_TRAINING_JOB_ID",
    "dataset_name": "YOUR_DATASET_NAME"
  }'

Response

string

UUID of the evaluation. Use this with GET /felix/evaluations/:id to retrieve results.

status

string

Initial evaluation status.

List evaluations

GET /felix/evaluations Returns all evaluations for your account. Supports filtering by project. Query parameters

project_id

string

Filter results to evaluations associated with a specific project.

curl https://api.pioneer.ai/felix/evaluations \
  -H "X-API-Key: YOUR_API_KEY"

Get evaluation results

GET /felix/evaluations/:id Returns the status and, once complete, the full results of an evaluation run. Path parameters

string

required

The evaluation UUID.

curl https://api.pioneer.ai/felix/evaluations/YOUR_EVALUATION_ID \
  -H "X-API-Key: YOUR_API_KEY"

Response

string

Evaluation UUID.

status

string

Current status of the evaluation. Values: queued, running, complete, failed.

metrics

object

Overall performance metrics. Only present when status is complete.

Show metrics properties

number

Overall F1 score.

precision

number

Overall precision score.

recall

number

Overall recall score.

per_entity

object

Per-entity-type breakdown of F1, precision, and recall for NER evaluations.

Delete an evaluation

DELETE /felix/evaluations/:id Permanently deletes an evaluation and its results. Path parameters

string

required

The evaluation UUID.

curl -X DELETE https://api.pioneer.ai/felix/evaluations/YOUR_EVALUATION_ID \
  -H "X-API-Key: YOUR_API_KEY"

Returns 204 No Content on success.

List baseline models

GET /felix/baseline-models Returns the list of baseline LLM models available for evaluation. Use these to benchmark your fine-tuned model’s performance against general-purpose models and quantify the improvement from training.

curl https://api.pioneer.ai/felix/baseline-models \
  -H "X-API-Key: YOUR_API_KEY"

Response Returns an array of baseline model objects, each with an id and display name you can pass as base_model in POST /felix/evaluations.

Overview

Inference

Training & Data

Projects

Evaluation API — measure model F1 before deploying

Run an evaluation

List evaluations

Get evaluation results

Delete an evaluation

List baseline models

Overview

Inference

Training & Data

Projects

​Run an evaluation

​List evaluations

​Get evaluation results

​Delete an evaluation

​List baseline models

Run an evaluation

List evaluations

Get evaluation results

Delete an evaluation

List baseline models