Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.pioneer.ai/llms.txt

Use this file to discover all available pages before exploring further.

Datasets in Pioneer are collections of labeled examples used to train and evaluate models. Each dataset has a name you define, and Pioneer versions it automatically as you add or regenerate data. You reference a dataset by name when starting a training job or running an evaluation — so the name you choose is the stable identifier you’ll use throughout your workflow.

How datasets are created

You create datasets in two ways: Synthetic data generation — Use POST /generate to have Pioneer produce labeled examples from a description of your domain and the labels you care about. This is the fastest way to bootstrap a dataset without any existing labeled data.
curl -X POST https://api.pioneer.ai/generate \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "task_type": "ner",
    "dataset_name": "my-ner-dataset",
    "labels": ["person", "company", "product"],
    "num_examples": 100,
    "domain_description": "Tech industry news articles"
  }'
Once generated or labeled, examples are stored in your dataset automatically. Poll GET /generate/jobs/:job_id to check when generation is complete before starting training.

Uploading your own dataset

Uploading your own data: Use POST/felix/datasets/upload/url if you already have labeled data. This is a three-step process:  

Step 1. Get a presigned upload URL

 curl -X POST https://api.pioneer.ai/felix/datasets/upload/url \
    -H "X-API-Key: YOUR_API_KEY" \
    -H "Content-Type: application/json" \                                                                               
    -d '{
      "dataset_name": "my-ner-dataset",                                                                                 
      "dataset_type": "ner",
      "type": "training",
      "filename": "data.jsonl"
    }'      
The response includes ‘presigned_url’, ‘dataset_id’, and ‘version_number’.

Step 2. Upload the file directly to S3

curl -X PUT "<presigned_url from step 1 response>" \                                                                  
   --upload-file ./data.jsonl
This is a direct HTTP PUT to S3. Do not include your API key here. 

Step 3. Trigger processing  

curl -X POST https://api.pioneer.ai/felix/datasets/upload/process \
    -H "X-API-Key: YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{                                                                                                               
      "dataset_id": "<dataset_id from step 1>"
    }' 
After this call, the dataset moves through statuses: initialized → uploading → converting → validating → ready Poll GET /felix/datasets/{name}/{version} until status is ready before starting a training job.                            

Listing your datasets

Retrieve all datasets in your account:
curl https://api.pioneer.ai/felix/datasets \
  -H "X-API-Key: YOUR_API_KEY"
The response lists each dataset by name along with metadata such as creation time and version count.

Inspecting a dataset

To see the versions and details of a specific dataset, pass its name:
curl https://api.pioneer.ai/felix/datasets/my-ner-dataset \
  -H "X-API-Key: YOUR_API_KEY"
This returns version history and example counts, which is useful for confirming the dataset is ready before training.

Deleting a dataset

curl -X DELETE https://api.pioneer.ai/felix/datasets/my-ner-dataset \
  -H "X-API-Key: YOUR_API_KEY"
Deleting a dataset by name soft-deletes it and all its versions. Completed training jobs and evaluations are unaffected, but you will no longer be able to start new jobs referencing it.      
Dataset storage is free. You are not charged for storing datasets in Pioneer, regardless of size or number of versions.

Dataset endpoints summary

MethodEndpointDescription
GET/felix/datasetsList all datasets
GET/felix/datasets/:nameGet all versions for a dataset
GET/felix/datasets/:name/:versionGet status and metadata for a specific version
DELETE/felix/datasets/:namePermanently delete a dataset and all its versions
DELETE/felix/datasets/:name/:versionDelete a specific version
POST/felix/datasets/upload/urlGet presigned S3 URL for direct upload
POST/felix/datasets/upload/processTrigger processing after S3 upload
POST/generateStart a synthetic data generation job
GET/generate/jobs/:job_idPoll generation job status
POST/generate/ner/label-existingAuto-label raw text for NER
POST/generate/classification/label-existingAuto-classify raw text