Skip to main content
Labeling training data by hand is slow and expensive. Pioneer’s data generation API lets you produce high-quality labeled examples from a short description of your domain and the labels you care about. You can also pass in raw unlabeled text and have Pioneer annotate it automatically. Either way, the resulting dataset is ready to feed directly into a training job.
1

Decide on your task type

Pioneer generates training data for three task types:
Task typeUse case
nerNamed entity recognition — extract spans of text with entity labels
classificationText classification — assign one or more labels to each input
decoderGenerative LLM training — prompt-completion or conversation pairs
Choose the task type that matches the model you plan to train. You’ll pass it as task_type in the request body.
2

Start a generation job

Send a POST /generate request with your task type, a dataset name, the labels you want annotated, a description of your domain, and the number of examples to generate.
curl -X POST https://api.pioneer.ai/generate \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "task_type": "ner",
    "dataset_name": "my-ner-dataset",
    "labels": ["person", "company", "product"],
    "num_examples": 100,
    "domain_description": "Tech industry news articles"
  }'
Required fields:
FieldDescription
task_type"ner", "classification", or "decoder"
dataset_nameName for the generated dataset (used when starting training)
num_examplesNumber of labeled examples to generate
Optional fields:
FieldDescription
labelsList of label strings (required for NER and classification)
domain_descriptionShort description of your content domain — improves output relevance
classified_examplesSeed examples with existing labels (classification only)
promptAdditional instructions for the generation model
The response includes a job ID you’ll use to poll status.
3

Poll job status

Generation jobs run asynchronously. Poll GET /generate/jobs/:job_id until the status is "complete".
curl https://api.pioneer.ai/generate/jobs/YOUR_JOB_ID \
  -H "X-API-Key: YOUR_API_KEY"
Once complete, the dataset is available under the name you provided in dataset_name.
4

Use the dataset in a training job

Pass the dataset name directly to POST /felix/training-jobs:
curl -X POST https://api.pioneer.ai/felix/training-jobs \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model_name": "my-ner-model",
    "base_model": "fastino/gliner2-base-v1",
    "datasets": [{"name": "my-ner-dataset"}],
    "training_type": "lora",
    "nr_epochs": 5,
    "learning_rate": 5e-5
  }'
See the NER fine-tuning guide or LLM fine-tuning guide for full training walkthroughs.

Auto-label existing text

If you already have raw text and want Pioneer to annotate it — rather than generating new examples from scratch — use the label-existing endpoints. This is useful when you have a corpus of real documents but haven’t labeled them yet. Auto-label for NER:
curl -X POST https://api.pioneer.ai/generate/ner/label-existing \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "labels": ["person", "organization", "location"],
    "inputs": [
      "Apple CEO Tim Cook spoke in Cupertino.",
      "Google hired 500 engineers in London."
    ]
  }'
Auto-label for classification:
curl -X POST https://api.pioneer.ai/generate/classification/label-existing \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "labels": ["positive", "negative", "neutral"],
    "inputs": [
      "This product exceeded all my expectations.",
      "The battery life is disappointingly short."
    ]
  }'
Both endpoints accept 1–1,000 strings per request and return annotations synchronously. Required fields are labels and inputs.
Generation endpoints are rate-limited to 120 requests per minute per user. For large annotation jobs, batch your inputs and add a short delay between requests. If you need higher throughput, contact the Pioneer team about an enterprise plan.

Next steps