Pioneer FAQ: plans, data privacy, storage, and teams

What's the easiest way to fine-tune an SLM?

Pioneer is designed to make fine-tuning small language models (SLMs) as simple as possible. The entire process takes four steps:

Create a dataset — Upload your own data or generate synthetic examples with Felix, Pioneer’s built-in synthetic data tool. See Datasets.
Start a training job — Pick a base model, point it at your dataset, and submit. All hyperparameters have sensible defaults so you don’t need to tune anything to get started. See Training.
Wait for completion — Your job moves through pending → running → complete. Small datasets typically finish in a few minutes.
Run inference — Use your job ID as the model identifier. Encoder models accept a text and schema; decoder models are OpenAI-compatible. See Inference.

For a full walkthrough, check out the NER fine-tuning guide or the LLM fine-tuning guide.

Do you charge for storage?

No. Storage is free for all datasets on every plan. You won’t be charged for the datasets you create or upload to Pioneer.

Which plan is best for me?

It depends on your workload:

Free — best if you want to experiment with new use cases or explore Pioneer before committing.
Pro — best for production workloads where you need uncapped inference and higher rate limits.
Enterprise (Custom) — best for organizations with compliance requirements such as HIPAA, or those that need private networking or VPC deployment.

If you’re unsure, contact the team and they’ll help you find the right fit.

How can I create synthetic data for training an LLM?

Describe your domain and the labels you want to train for, and Pioneer’s Felix pipeline generates realistic labeled examples at scale. This lets you bootstrap a training dataset without any manual annotation — useful when you’re starting from scratch or need to expand coverage for edge cases.

How do I call models like Kimi-K2, Qwen or Deepseek?

This is a place holder

Do you offer any special pricing for students, non-profits, and open source projects?

Yes. The following groups qualify for a discounted Pro plan:

Open source projects
501(c)(3) nonprofit organizations
Students working on research projects

Complete the intake form to apply. The team will follow up with your custom pricing.

Do you train on our data?

Yes, by default Pioneer may use your data to improve models. However, you can opt out on the Pro and Custom plans. Custom plans also let you run fine-tuning privately inside your own VPC so your data never leaves your infrastructure.Contact the team to learn more about Custom plan options.

Can I share models with teammates using Teams?

What's the difference between fine-tuning an encoder vs a decoder model?

Encoder models (like GLiNER) are trained to understand and extract structured information from text, they’re ideal for NER, classification, and JSON extraction tasks. They’re fast, efficient, and run on CPU, making them cheap to serve at scale. Decoder models (like Llama or Qwen) are generative, they produce text, making them suited for summarization, Q&A, chat, and instruction-following tasks.Pioneer supports both. If your task has a defined output structure (extract these entity types, classify into these categories), use an encoder. If your task requires generating free-form text, use a decoder. When in doubt, start with an encoder, they train faster, cost less, and are easier to evaluate.

How do I evaluate whether my fine-tuned model is good enough?

Pioneer runs evaluations automatically after training and reports F1, precision, and recall on your held-out validation set.

F1 Score	What it means
Above 0.85	Production-ready for most NER and classification tasks
0.70 – 0.85	Needs more training data or better label quality
Below 0.70	Model hasn’t learned the task well enough yet

If your score is lower than expected, run a manual evaluation against a separate dataset for a cleaner signal. You can also inspect per-example predictions to identify where the model is failing, then use those gaps to generate targeted synthetic data. See Evaluations and Synthetic Data.

What is GLiNER and when should I use it?

GLiNER is an open-source encoder model architecture designed specifically for named entity recognition and structured extraction. Unlike decoder models that generate text token by token, GLiNER classifies spans of text directly, making it significantly faster and more accurate for extraction tasks.Use GLiNER when you need to extract specific entity types (people, organizations, products, dates), classify text into predefined categories, or run high-volume inference where latency and cost matter. Pioneer’s fine-tuning pipeline is built around GLiNER. You can go from a domain description to a production-ready extraction model in minutes, with no GPU required on your end.

Get Started

Core Concepts

Guides

Plans & Pricing

Pioneer FAQ: plans, data privacy, storage, and teams