Datasets in Pioneer are collections of labeled examples used to train and evaluate models. Each dataset has a name you define, and Pioneer versions it automatically as you add or regenerate data. You reference a dataset by name when starting a training job or running an evaluation — so the name you choose is the stable identifier you’ll use throughout your workflow.Documentation Index
Fetch the complete documentation index at: https://docs.pioneer.ai/llms.txt
Use this file to discover all available pages before exploring further.
How datasets are created
You create datasets in two ways: Synthetic data generation — UsePOST /generate to have Pioneer produce labeled examples from a description of your domain and the labels you care about. This is the fastest way to bootstrap a dataset without any existing labeled data.
GET /generate/jobs/:job_id to check when generation is complete before starting training.
Uploading your own dataset
Uploading your own data: UsePOST/felix/datasets/upload/url if you already have labeled data. This is a three-step process:
Step 1. Get a presigned upload URL
Step 2. Upload the file directly to S3
Step 3. Trigger processing
GET /felix/datasets/{name}/{version} until status is ready before starting a training job.
Listing your datasets
Retrieve all datasets in your account:Inspecting a dataset
To see the versions and details of a specific dataset, pass its name:Deleting a dataset
Dataset storage is free. You are not charged for storing datasets in Pioneer, regardless of size or number of versions.
Dataset endpoints summary
| Method | Endpoint | Description |
|---|---|---|
GET | /felix/datasets | List all datasets |
GET | /felix/datasets/:name | Get all versions for a dataset |
GET | /felix/datasets/:name/:version | Get status and metadata for a specific version |
DELETE | /felix/datasets/:name | Permanently delete a dataset and all its versions |
DELETE | /felix/datasets/:name/:version | Delete a specific version |
POST | /felix/datasets/upload/url | Get presigned S3 URL for direct upload |
POST | /felix/datasets/upload/process | Trigger processing after S3 upload |
POST | /generate | Start a synthetic data generation job |
GET | /generate/jobs/:job_id | Poll generation job status |
POST | /generate/ner/label-existing | Auto-label raw text for NER |
POST | /generate/classification/label-existing | Auto-classify raw text |