Synthetic data API — POST /generate and label-existing

Pioneer’s data generation API lets you produce high-quality labeled training examples without manually annotating data. You can generate synthetic examples from scratch for NER, classification, and decoder tasks, or bring your own unlabeled text and have Pioneer label it automatically. All generated data is saved directly to a named dataset ready for fine-tuning.

Generate endpoints are rate-limited to 120 requests per minute per user. For large datasets, consider batching your requests or using the job polling endpoint to monitor long-running generation jobs.

Start a generation job

POST /generate Starts an asynchronous job that generates labeled training examples and stores them in a named dataset. Returns a job ID you can use to poll for completion. Request body

task_type

string

required

The type of task to generate data for. Accepted values: ner, classification, decoder.

dataset_name

string

required

The name of the dataset to create or append to. If a dataset with this name already exists, new examples are added as a new version.

num_examples

number

required

Number of labeled examples to generate.

labels

string[]

List of label strings for NER or classification tasks. For NER, these are entity type names (e.g. "person", "organization"). For classification, these are the class names.

domain_description

string

A natural-language description of the domain or topic for the generated examples. Providing a detailed description improves example quality and relevance.

classified_examples

object[]

Few-shot examples with labels to guide generation for classification tasks.

prompt

string

Custom instruction prompt to control generation style for decoder tasks.

curl -X POST https://api.pioneer.ai/generate \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "task_type": "ner",
    "dataset_name": "my-ner-dataset",
    "labels": ["person", "company", "product"],
    "num_examples": 100,
    "domain_description": "Tech industry news articles"
  }'

Response

job_id

string

Unique identifier for the generation job. Use this with GET /generate/jobs/:job_id to poll for status.

status

string

Initial job status, typically queued.

Poll generation job status

GET /generate/jobs/:job_id Returns the current status of a data generation job. Poll this endpoint until the status is complete or failed before starting a training job on the resulting dataset. Path parameters

job_id

string

required

The job ID returned by POST /generate.

curl https://api.pioneer.ai/generate/jobs/JOB_ID \
  -H "X-API-Key: YOUR_API_KEY"

Response

job_id

string

The generation job ID.

status

string

Current job status. Values: queued, running, complete, failed.

dataset_name

string

The dataset name that examples are being written to.

num_examples

number

Number of examples generated so far.

Auto-label text for NER

POST /generate/ner/label-existing Sends your own unlabeled text to Pioneer and returns NER annotations. Use this when you have existing text that you want to annotate rather than generating new synthetic examples. Request body

labels

string[]

required

List of entity type names to detect. For example: ["person", "organization", "location"].

inputs

string[]

required

List of text strings to annotate. Accepts between 1 and 1,000 strings per request.

curl -X POST https://api.pioneer.ai/generate/ner/label-existing \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "labels": ["person", "organization", "location"],
    "inputs": [
      "Apple CEO Tim Cook spoke in Cupertino.",
      "Google hired 500 engineers in London."
    ]
  }'

Response Returns an array of annotation objects, one per input string, each containing detected entities with their spans, labels, and confidence scores.

Auto-classify text

POST /generate/classification/label-existing Sends your own unlabeled text to Pioneer and returns classification labels. Use this when you have existing text that you want to classify rather than generating new synthetic examples. Request body

labels

string[]

required

List of class names to classify text into. For example: ["positive", "negative", "neutral"].

inputs

string[]

required

List of text strings to classify. Accepts between 1 and 1,000 strings per request.

curl -X POST https://api.pioneer.ai/generate/classification/label-existing \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "labels": ["positive", "negative", "neutral"],
    "inputs": [
      "The product exceeded all my expectations.",
      "Shipping took three weeks and the box was damaged."
    ]
  }'

Response Returns an array of classification results, one per input string, each containing the predicted label and a confidence score.

Overview

Inference

Training & Data

Projects

Synthetic data API — POST /generate and label-existing

Start a generation job

Poll generation job status

Auto-label text for NER

Auto-classify text

Overview

Inference

Training & Data

Projects

​Start a generation job

​Poll generation job status

​Auto-label text for NER

​Auto-classify text

Start a generation job

Poll generation job status

Auto-label text for NER

Auto-classify text