Pioneer datasets: create, version, inspect, and delete

Datasets in Pioneer are collections of labeled examples used to train and evaluate models. Each dataset has a name you define, and Pioneer versions it automatically as you add or regenerate data. You reference a dataset by name when starting a training job or running an evaluation — so the name you choose is the stable identifier you’ll use throughout your workflow.

How datasets are created

You create datasets in two ways: Synthetic data generation — Use POST /generate to have Pioneer produce labeled examples from a description of your domain and the labels you care about. This is the fastest way to bootstrap a dataset without any existing labeled data.

curl -X POST https://api.pioneer.ai/generate \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "task_type": "ner",
    "dataset_name": "my-ner-dataset",
    "labels": ["person", "company", "product"],
    "num_examples": 100,
    "domain_description": "Tech industry news articles"
  }'

Once generated or labeled, examples are stored in your dataset automatically. Poll GET /generate/jobs/:job_id to check when generation is complete before starting training.

Uploading your own dataset

Uploading your own data: Use POST/felix/datasets/upload/url if you already have labeled data. This is a three-step process:

Step 1. Get a presigned upload URL

 curl -X POST https://api.pioneer.ai/felix/datasets/upload/url \
    -H "X-API-Key: YOUR_API_KEY" \
    -H "Content-Type: application/json" \                                                                               
    -d '{
      "dataset_name": "my-ner-dataset",                                                                                 
      "dataset_type": "ner",
      "type": "training",
      "filename": "data.jsonl"
    }'      

The response includes ‘presigned_url’, ‘dataset_id’, and ‘version_number’.

Step 2. Upload the file directly to S3

curl -X PUT "<presigned_url from step 1 response>" \                                                                  
   --upload-file ./data.jsonl

This is a direct HTTP PUT to S3. Do not include your API key here.

Step 3. Trigger processing

curl -X POST https://api.pioneer.ai/felix/datasets/upload/process \
    -H "X-API-Key: YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{                                                                                                               
      "dataset_id": "<dataset_id from step 1>"
    }' 

After this call, the dataset moves through statuses: initialized → uploading → converting → validating → ready Poll GET /felix/datasets/{name}/{version} until status is ready before starting a training job.

Listing your datasets

Retrieve all datasets in your account:

curl https://api.pioneer.ai/felix/datasets \
  -H "X-API-Key: YOUR_API_KEY"

The response lists each dataset by name along with metadata such as creation time and version count.

Inspecting a dataset

To see the versions and details of a specific dataset, pass its name:

curl https://api.pioneer.ai/felix/datasets/my-ner-dataset \
  -H "X-API-Key: YOUR_API_KEY"

This returns version history and example counts, which is useful for confirming the dataset is ready before training.

Deleting a dataset

curl -X DELETE https://api.pioneer.ai/felix/datasets/my-ner-dataset \
  -H "X-API-Key: YOUR_API_KEY"

Deleting a dataset by name soft-deletes it and all its versions. Completed training jobs and evaluations are unaffected, but you will no longer be able to start new jobs referencing it.

Dataset storage is free. You are not charged for storing datasets in Pioneer, regardless of size or number of versions.

Dataset endpoints summary

Method	Endpoint	Description
`GET`	`/felix/datasets`	List all datasets
`GET`	`/felix/datasets/:name`	Get all versions for a dataset
`GET`	`/felix/datasets/:name/:version`	Get status and metadata for a specific version
`DELETE`	`/felix/datasets/:name`	Permanently delete a dataset and all its versions
`DELETE`	`/felix/datasets/:name/:version`	Delete a specific version
`POST`	`/felix/datasets/upload/url`	Get presigned S3 URL for direct upload
`POST`	`/felix/datasets/upload/process`	Trigger processing after S3 upload
`POST`	`/generate`	Start a synthetic data generation job
`GET`	`/generate/jobs/:job_id`	Poll generation job status
`POST`	`/generate/ner/label-existing`	Auto-label raw text for NER
`POST`	`/generate/classification/label-existing`	Auto-classify raw text

Data Privacy: If you would like to opt out of having your data used in Fastino’s model training, please email support@fastino.ai and we will ensure your data is excluded from our training pipelines.

Get Started

Integrations

Core Concepts

API Reference

Guides

Account

Pioneer datasets: create, version, inspect, and delete

How datasets are created

Uploading your own dataset

Step 1. Get a presigned upload URL

Step 2. Upload the file directly to S3

Step 3. Trigger processing

Listing your datasets

Inspecting a dataset

Deleting a dataset

Dataset endpoints summary

​How datasets are created

​Uploading your own dataset

​Step 1. Get a presigned upload URL

​Step 2. Upload the file directly to S3

​Step 3. Trigger processing

​Listing your datasets

​Inspecting a dataset

​Deleting a dataset

​Dataset endpoints summary

How datasets are created

Uploading your own dataset

Step 1. Get a presigned upload URL

Step 2. Upload the file directly to S3

Step 3. Trigger processing

Listing your datasets

Inspecting a dataset

Deleting a dataset

Dataset endpoints summary