Skip to main content
Pioneer’s data generation API lets you produce high-quality labeled training examples without manually annotating data. You can generate synthetic examples from scratch for NER, classification, and decoder tasks, or bring your own unlabeled text and have Pioneer label it automatically. All generated data is saved directly to a named dataset ready for fine-tuning.
Generate endpoints are rate-limited to 120 requests per minute per user. For large datasets, consider batching your requests or using the job polling endpoint to monitor long-running generation jobs.

Start a generation job

POST /generate Starts an asynchronous job that generates labeled training examples and stores them in a named dataset. Returns a job ID you can use to poll for completion. Request body
task_type
string
required
The type of task to generate data for. Accepted values: ner, classification, decoder.
dataset_name
string
required
The name of the dataset to create or append to. If a dataset with this name already exists, new examples are added as a new version.
num_examples
number
required
Number of labeled examples to generate.
labels
string[]
List of label strings for NER or classification tasks. For NER, these are entity type names (e.g. "person", "organization"). For classification, these are the class names.
domain_description
string
A natural-language description of the domain or topic for the generated examples. Providing a detailed description improves example quality and relevance.
classified_examples
object[]
Few-shot examples with labels to guide generation for classification tasks.
prompt
string
Custom instruction prompt to control generation style for decoder tasks.
curl -X POST https://api.pioneer.ai/generate \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "task_type": "ner",
    "dataset_name": "my-ner-dataset",
    "labels": ["person", "company", "product"],
    "num_examples": 100,
    "domain_description": "Tech industry news articles"
  }'
Response
job_id
string
Unique identifier for the generation job. Use this with GET /generate/jobs/:job_id to poll for status.
status
string
Initial job status, typically queued.

Poll generation job status

GET /generate/jobs/:job_id Returns the current status of a data generation job. Poll this endpoint until the status is complete or failed before starting a training job on the resulting dataset. Path parameters
job_id
string
required
The job ID returned by POST /generate.
curl https://api.pioneer.ai/generate/jobs/JOB_ID \
  -H "X-API-Key: YOUR_API_KEY"
Response
job_id
string
The generation job ID.
status
string
Current job status. Values: queued, running, complete, failed.
dataset_name
string
The dataset name that examples are being written to.
num_examples
number
Number of examples generated so far.

Auto-label text for NER

POST /generate/ner/label-existing Sends your own unlabeled text to Pioneer and returns NER annotations. Use this when you have existing text that you want to annotate rather than generating new synthetic examples. Request body
labels
string[]
required
List of entity type names to detect. For example: ["person", "organization", "location"].
inputs
string[]
required
List of text strings to annotate. Accepts between 1 and 1,000 strings per request.
curl -X POST https://api.pioneer.ai/generate/ner/label-existing \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "labels": ["person", "organization", "location"],
    "inputs": [
      "Apple CEO Tim Cook spoke in Cupertino.",
      "Google hired 500 engineers in London."
    ]
  }'
Response Returns an array of annotation objects, one per input string, each containing detected entities with their spans, labels, and confidence scores.

Auto-classify text

POST /generate/classification/label-existing Sends your own unlabeled text to Pioneer and returns classification labels. Use this when you have existing text that you want to classify rather than generating new synthetic examples. Request body
labels
string[]
required
List of class names to classify text into. For example: ["positive", "negative", "neutral"].
inputs
string[]
required
List of text strings to classify. Accepts between 1 and 1,000 strings per request.
curl -X POST https://api.pioneer.ai/generate/classification/label-existing \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "labels": ["positive", "negative", "neutral"],
    "inputs": [
      "The product exceeded all my expectations.",
      "Shipping took three weeks and the box was damaged."
    ]
  }'
Response Returns an array of classification results, one per input string, each containing the predicted label and a confidence score.