sft), GRPO (grpo), and DPO (dpo) — are created through the same POST /felix/training-jobs endpoint and selected with the training_algorithm parameter.
Choose an algorithm
| Algorithm | training_algorithm | What it optimizes | Dataset signal |
|---|---|---|---|
| Supervised fine-tuning | sft (default) | Imitate the assistant turns in your examples | Chat messages |
| GRPO | grpo | Maximize a reward over sampled completions | prompt + answer (+ a reward function) |
| DPO | dpo | Prefer chosen responses over rejected ones | prompt + chosen + rejected |
training_algorithm is equivalent to sft, so existing requests keep working unchanged.
SFT
You have example outputs you want the model to imitate (conversations, instruction-response pairs). The default and simplest path.
GRPO
“Good” is a programmatic check — exact answers, numeric correctness, JSON validity, a rubric. The model explores and is reinforced toward higher reward.
DPO
You have preference pairs — a better and a worse response per prompt — rather than a single gold answer.
All three algorithms are LoRA-based. A completed job produces a low-rank adapter that is hot-swapped onto the shared base model at serve time and exposed behind the same inference endpoints as base models — reference the training job’s
id as the model_id at inference time. training_type defaults to "lora" and is the only supported value for decoder LLMs; "full" is reserved for GLiNER encoder models.End-to-end walkthrough
Choose a decoder base model
Use The table below shows a selection of popular options. Context window size matters if your training examples or inference prompts are long.
Choosing a model size: Smaller models (1B–8B) train and respond faster and cost less. Larger models (30B–70B) handle complex reasoning and longer inputs more reliably. Start with
GET /base-models to see the full current catalog, filtered to models that support training:| Model ID | Label | Context |
|---|---|---|
Qwen/Qwen3-32B | Qwen3 32B | 131K |
Qwen/Qwen3-30B-A3B-Instruct-2507 | Qwen3 30B A3B Instruct | 262K |
Qwen/Qwen3-8B | Qwen3 8B | 131K |
Qwen/Qwen3-4B-Instruct-2507 | Qwen3 4B Instruct | 262K |
Qwen/Qwen2.5-7B-Instruct | Qwen2.5 7B Instruct | 131K |
Qwen/Qwen2.5-14B-Instruct | Qwen2.5 14B Instruct | 131K |
meta-llama/Llama-3.3-70B-Instruct | Llama 3.3 70B Instruct | 131K |
meta-llama/Llama-3.1-8B-Instruct | Llama 3.1 8B Instruct | 131K |
meta-llama/Llama-3.1-70B-Instruct | Llama 3.1 70B Instruct | 131K |
meta-llama/Llama-3.2-3B-Instruct | Llama 3.2 3B Instruct | 131K |
deepseek-ai/DeepSeek-V3.1 | DeepSeek V3.1 | 163K |
google/gemma-4-31b-it | Gemma 4 31B IT | 128K |
openai/gpt-oss-120b | GPT-OSS 120B | 131K |
Qwen/Qwen3-8B or meta-llama/Llama-3.1-8B-Instruct for most tasks and scale up if needed.Not every model supports every algorithm — see Supported models below for the SFT/GRPO/DPO matrix.Prepare your training data
The dataset shape depends on the algorithm you picked. Pick the matching tab.See the Synthetic Data guide for the full set of
/generate options, including auto-labelling existing text. Once generated or uploaded, wait until the dataset status is ready before starting training.Start a training job
Submit your training job with Pioneer routes your job automatically to the best available provider. The response includes your job ID:
POST /felix/training-jobs. The training_algorithm parameter selects SFT, GRPO, or DPO.rl_config is required when training_algorithm is grpo or dpo and must be omitted for sft. Every key inside rl_config is optional and falls back to a TRL-aligned server default except reward_type, which is required for GRPO.Poll until training is complete
Check job status by polling Status transitions:
GET /felix/training-jobs/:id.requested → running → complete → deployed (or failed / stopped). The terminal success state is deployed, reached automatically once the adapter is live behind the inference endpoints.You can also stream training logs while the job is running:Run inference on your fine-tuned model
Once the job status is OpenAI-compatible endpoint — drop-in replacement for the OpenAI SDK:Anthropic-compatible endpoint:Streaming is supported on all three interfaces.
deployed, use your job ID as the model_id (or model) on any of the three inference interfaces.Pioneer native API — use "task": "generate" for decoder models:Downloading your trained model weights is available on the Pro plan and above. Use
GET /felix/training-jobs/:id/download to retrieve the weights once training is complete.LoRA hyperparameters
LoRA capacity and the core optimization settings are configurable; the defaults are sensible starting points for SFT and the RL algorithms alike.| Field | Default | Purpose |
|---|---|---|
lora_r | 16 | LoRA rank — adapter capacity. Raise it for harder tasks or larger datasets. |
lora_alpha | 32 | LoRA scaling factor (typically ~2× lora_r). |
lora_dropout | 0.1 | Dropout applied to the adapter during training. |
learning_rate | 2e-5 | Peak AdamW learning rate. |
batch_size | 4 | Per-step batch size. |
nr_epochs | 100 | Epoch ceiling; early stopping usually halts well before this. |
validation_data_percentage | 0.2 | Fraction of the dataset held out for validation. |
GRPO reward functions
GRPO (Group Relative Policy Optimization) samples multiple completions per prompt and reinforces the ones that score highest against a reward function. Setrl_config.reward_type to one of:
reward_type | Scores a completion as correct when… |
|---|---|
exact_match | the normalized completion equals answer |
contains_substring | answer appears anywhere in the completion |
numeric_match | the extracted number matches answer (handles #### 42, \boxed{42}, “the answer is 42”) |
choice_match | the final multiple-choice letter matches answer |
regex_match | the completion matches the supplied pattern |
json_match | the parsed JSON deep-equals answer |
json_loose_match | the parsed JSON loosely matches answer |
rougeL_match | ROUGE-L against the answer reference(s) |
llm_as_judge | a judge model scores the completion against a rubric |
When
reward_type is llm_as_judge, Pioneer mints and manages the judge credential for you — you never supply an API key. Optional judge knobs include llm_judge_model, llm_judge_rubric, and llm_judge_score_scale.Supported models
The canonical, live list is alwaysGET /base-models?supports_training=true. As of this writing:
| Base model | SFT | GRPO | DPO |
|---|---|---|---|
Qwen/Qwen3-8B | ✅ | ✅ | ✅ |
Qwen/Qwen3-32B | ✅ | ✅ | ✅ |
Qwen/Qwen3-4B-Instruct-2507 | ✅ | ✅ | ✅ |
Qwen/Qwen3-4B-Base | ✅ | ✅ | ✅ |
Qwen/Qwen3-1.7B-Base | ✅ | ✅ | ✅ |
meta-llama/Llama-3.1-8B-Instruct | ✅ | ✅ | ✅ |
HuggingFaceTB/SmolLM3-3B-Base | ✅ | ✅ | ✅ |
google/gemma-4-31b-it | ✅ | — | — |
meta-llama/Llama-3.2-3B-Instruct | ✅ | — | — |
Qwen/Qwen2.5-7B-Instruct | ✅ | — | — |
GRPO and DPO are available on the subset of models that have been verified end-to-end for reinforcement learning. Every trainable decoder supports SFT. Models marked
— for RL accept sft only; submitting grpo/dpo for them returns a 422.fastino/gliner2-base-v1, fastino/gliner2-large-v1, fastino/gliner2-multi-v1, fastino/gliner2-multi-large-v1) are also trainable through the same endpoint — see the encoder fine-tuning guides for NER, classification, and structured extraction.
Serverless inference for base models
If you want to run inference on a base model without fine-tuning, several models are available as serverless endpoints with no startup latency:| Model ID | Label | Context |
|---|---|---|
Qwen/Qwen3-235B-A22B-Instruct-2507 | Qwen3 235B A22B Instruct | 262K |
Qwen/Qwen3-8B | Qwen3 8B | 131K |
deepseek-ai/DeepSeek-V3.1 | DeepSeek V3.1 | 163K |
openai/gpt-oss-120b | GPT-OSS 120B | 131K |
meta-llama/Llama-3.3-70B-Instruct | Llama 3.3 70B Instruct | 131K |
moonshotai/Kimi-K2.6 | Kimi K2.6 | 262K |
GET /base-models?task_type=decoder&supports_inference=true to see the current serverless catalog.
Next steps
- Synthetic Data — generate training data without manual annotation
- Adaptive Inference — automatically retrain on live production data
- Agent Skills — let an AI coding agent manage training and inference for you
- Training Jobs API — every endpoint, parameter, and response field