Choose a base model
Pioneer offers four GLiNER base models. For most tasks,
You can always fetch the latest catalog from the API:
fastino/gliner2-base-v1 is the right starting point: it’s fast, accurate, and supports LoRA and full fine-tuning. If your documents include non-English text, use a multi variant instead.| Model ID | Use case | Training |
|---|---|---|
fastino/gliner2-base-v1 | English, general purpose | LoRA, Full |
fastino/gliner2-large-v1 | English, higher accuracy | LoRA, Full |
fastino/gliner2-multi-v1 | Multilingual | LoRA, Full |
fastino/gliner2-multi-large-v1 | Multilingual, higher accuracy | LoRA, Full |
Design your structure
A structure is a named record made up of fields. Each field has a name and a type. The two supported types are:
A field can also carry an optional The structure name (
| Type | Use when | Stored as |
|---|---|---|
str | Single value extracted from the text (an amount, a date, a party name). | A single string. |
list | Multiple values for the same field (line items, bullet points, attendees). | An array of strings. |
choices list (a closed enum like ["USD", "EUR", "GBP"]) and an optional description to nudge the model on what counts. Both are visible at inference time only — at training time the model just sees the values you actually labelled.A practical example: extracting invoices.invoice) groups related fields. You can train multiple structures on the same dataset — for instance, invoice and shipping_label from the same documents — by including each one separately in the row.Prepare your training data
Structured extraction needs labeled examples — Pioneer’s synthetic data generator covers NER, classification, and decoder tasks but does not generate structures, so plan on bringing your own labeled examples (typically dozens to a few hundred to start).Each row needs a A few rules to keep in mind:Wait until the dataset status is
text column and a json_structures column. The json_structures value is a list of {structure_name: {field: value}} dicts, one per structure instance found in the text. Field values must appear verbatim in the text — validation rejects rows with values that aren’t span-substrings of text.- Every value (including each entry in a
listfield) must be a verbatim span fromtext. Don’t paraphrase or normalize. - A row can carry more than one instance of the same structure if the document contains multiples (e.g. two invoices in one email) — just add more dicts to the
json_structureslist. json_structurescan be combined withentities,label/labels, andrelationsin the same row to train a multi-head model in one job.
ready before proceeding.Start a training job
Submit your training job with The response includes your job ID and initial status:Save the
POST /felix/training-jobs. Set base_model to the GLiNER model you chose in step 1 and training_type to "lora". The training endpoint is shared with NER and classification — Pioneer infers the task heads from the dataset columns.id — you’ll use it to poll status, run evaluations, and call inference.Poll job status and review metrics
Training typically takes a few minutes to a few hours depending on dataset size and epoch count. Poll the job endpoint until status is Job status values: For extraction, the metrics are computed per field across all structure instances. A high F1 score (above 0.85) generally indicates a model ready for production. If a particular field is dragging the score down, the most common fix is adding more training examples that contain that field — especially examples where the value is phrased differently from what you’ve already labelled.
"complete".requested → running → complete (or failed / stopped).When the job reaches "complete", the response includes evaluation metrics:Run an evaluation
Evaluate your trained model against a held-out dataset for a more rigorous read on performance before deploying.Retrieve evaluation results with
GET /felix/evaluations/:id. Results include f1, precision, recall, and a per-field breakdown so you can see which fields are accurate and which need more training data.Run inference with your trained model
Use your job ID as the Field optionsEach entry in
You can request multiple structures in a single call by adding more entries to the
model_id to run predictions. Extraction lives under the structures key of the schema field — at inference time you describe each structure with a list of typed fields, and the model returns the values it extracts from the text.fields accepts these keys:| Key | Type | Description |
|---|---|---|
name | string | Field name returned in the response. |
dtype | string | "str" for a single value, "list" for multiple values. |
choices | string[] | Optional closed enum. Predictions outside the list are dropped. |
description | string | Optional natural-language hint about what to extract. |
structures dict, and you can combine extraction with NER (entities), classification (classifications), or relations (relations) in the same request — the response carries each head independently.You can also call inference using the OpenAI-compatible endpoint. Set base_url to https://api.pioneer.ai/v1 and pass Pioneer fields via extra_body:Tips for higher-quality extractions
- Cover the surface forms. If a field can appear in multiple ways —
$1,250.00,USD 1,250,one thousand two hundred fifty dollars— include examples of each. The model extracts spans verbatim, so it can only learn the patterns it has seen. - Use
choicesfor closed enums. Currency codes, status flags, country codes — anything with a fixed vocabulary benefits fromchoices. Predictions outside the list are dropped at inference time. - Write descriptions for ambiguous fields.
{"name": "date", "dtype": "str", "description": "Date the invoice was issued, not the due date"}is materially more accurate than a bare{"name": "date", "dtype": "str"}when both dates appear in the document. - Treat list fields as their own labelling decisions. Each entry in a
listfield has to be a verbatim span. Splitting a comma-separated string yourself (“Widget x 10, Premium support, 1 month” →["Widget x 10", "Premium support, 1 month"]) is more reliable than asking the model to do the splitting.
Next steps
- Fine-tune a NER model — extract entities with the same GLiNER base model
- Fine-tune a classification model — assign labels to text with the same GLiNER base model
- Adaptive Inference — let Pioneer retrain your extractor automatically on live traffic
- API Reference — full endpoint documentation