Labeling training data by hand is slow and expensive. Pioneer’s data generation API lets you produce high-quality labeled examples from a short description of your domain and the labels you care about. You can also pass in raw unlabeled text and have Pioneer annotate it automatically. Either way, the resulting dataset is ready to feed directly into a training job.Documentation Index
Fetch the complete documentation index at: https://docs.pioneer.ai/llms.txt
Use this file to discover all available pages before exploring further.
Decide on your task type
Pioneer generates training data for three task types:
Choose the task type that matches the model you plan to train. You’ll pass it as
| Task type | Use case |
|---|---|
ner | Named entity recognition — extract spans of text with entity labels |
classification | Text classification — assign one or more labels to each input |
decoder | Generative LLM training — prompt-completion or conversation pairs |
task_type in the request body.Start a generation job
Send a Required fields:
Optional fields:
The response includes a job ID you’ll use to poll status.
POST /generate request with your task type, a dataset name, the labels you want annotated, a description of your domain, and the number of examples to generate.| Field | Description |
|---|---|
task_type | "ner", "classification", or "decoder" |
dataset_name | Name for the generated dataset (used when starting training) |
num_examples | Number of labeled examples to generate |
| Field | Description |
|---|---|
labels | List of label strings (required for NER and classification) |
domain_description | Short description of your content domain — improves output relevance |
classified_examples | Seed examples with existing labels (classification only) |
prompt | Additional instructions for the generation model |
Poll job status
Generation jobs run asynchronously. Poll Once complete, the dataset is available under the name you provided in
GET /generate/jobs/:job_id until the status is "complete".dataset_name.Use the dataset in a training job
Pass the dataset name directly to See the NER fine-tuning guide or LLM fine-tuning guide for full training walkthroughs.
POST /felix/training-jobs:Auto-label existing text
If you already have raw text and want Pioneer to annotate it — rather than generating new examples from scratch — use the label-existing endpoints. This is useful when you have a corpus of real documents but haven’t labeled them yet. Auto-label for NER:labels and inputs.
Next steps
- Fine-tune a NER model — use your generated dataset to train a custom GLiNER model
- Fine-tune an LLM — train a decoder model on generated prompt-completion pairs
- Adaptive Inference — let Pioneer generate training data from live inference traffic automatically