01The Model Tiers — Think in Large, Medium, and Small
First, take the AI models available today and sort them roughly into three tiers. It is not a strict taxonomy, but it gives you a map for choosing tools. Remember: the higher the tier, the smarter the model — and the greater the cost and the wait.
Large models (flagship)
Strong at complex reasoning, reading long documents, and judging subtle nuance. Higher cost per call and slower to respond. Use them only for the tough spots.
Medium models (balanced)
Summarizing, classifying, drafting — they handle most practical tasks. A good balance of cost and accuracy, and often the workhorse of the system.
Small models (lightweight, fast)
Good at routine extraction, format conversion, and easy sorting. Above all, fast and cheap. Suited to preprocessing run at high volume.
The key point is that "always use the smartest model" is not the right rule. Assigning the top-tier model to a job that only tidies up formatting is like hiring a specialist to write the address on an envelope. Match the model tier to the difficulty of the task. This obvious mapping is the starting point of cost optimization.
02The Accuracy–Cost Trade-off — What to Measure Before Choosing
When choosing a model, there is more than one axis to compare. Lay these four side by side and the decision becomes concrete. Decide by looking at only one of them and you will pay for it later.
| Axis to compare | What it means | What happens if you overlook it |
|---|---|---|
| Accuracy | How correctly the task is done | Choose on price alone and human effort goes into cleaning up the errors |
| Cost | Unit price per volume of input and output (= tokens) | Choose on cleverness alone and the budget collapses the moment you scale |
| Speed (latency) | Time from query to answer | In interactive uses, slowness alone becomes the reason it goes unused |
| Stability | How little the result varies for the same input | In work that requires review and records, a lack of reproducibility hinders verification |
Cost, in most models, is fundamentally billed separately for "input tokens" and "output tokens". A token is a small unit into which text is divided; in English, roughly one token corresponds to about four characters. Feed in a long document as-is and the input side swells; ask for a long answer and the output side does. Unit prices are stated explicitly in each vendor's official price list, so always verify against primary sources before you choose. Second-hand figures are often out of date.
A common trap here is the assumption that "a model with a lower unit price is cheaper overall." If a cheap model lacks accuracy and you re-query it repeatedly, or a person has to fix its output by hand, the total cost ends up higher. What you should watch is not the unit price but the "total cost to complete one job."
03Routing — Sorting the Easy Work From the Hard Work
Once you understand the tiers, the next step is to use them selectively. Rather than sending every query to a single model, use routing (= a mechanism that dispatches work by content) to send easy items to small models and only the hard ones to large models. It is the same idea as factory inspection, where clearly normal items pass automatically and only the doubtful ones go to an expert.
Here are three routing patterns commonly used in practice.
- Routing by difficulty — Have a small model solve it first, and hand off to a large model only when confidence is low or the answer is ambiguous. Most inputs are easy, so the bulk can be processed by the cheap model.
- Routing by task type — Decide the model per task upfront by the nature of the work, as in "small for extraction, medium for summarizing, large for the final judgment." Simple to design and easy to operate.
- Routing by stage (cascade) — Try small → medium → large in order, and stop the moment an earlier stage produces sufficient quality. Only hard inputs reach the final tier.
The benefit of routing is not cost alone. Because light models return easy work instantly, overall responses get faster too. Do not forget, though, that the routing decision itself carries a cost. If the decision is too complex, the cost of deciding eats up the savings. It is safest to start with simple routing by task type and add difficulty-based decisions as needed.
04Caching — Don't Pay Twice for the Same Question
In real-world systems, similar queries repeat again and again. The same explanation of an internal policy, the same summary of the same product information, the same boilerplate email draft — re-asking the model every time wastes both cost and time. Caching (= a mechanism that temporarily stores a result once produced and reuses it) reduces this duplication.
Caching for AI comes in two broad kinds. Understanding both widens your design options.
| Kind | How it works | Where it helps |
|---|---|---|
| Response caching | Store the whole answer to a given query, and next time return it without calling the model | Boilerplate work where queries repeat. Cost and wait can both drop to nearly zero |
| Prompt caching | Have the model hold the long, always-shared preamble (policies, instructions, documents), and send only the parts that change | Uses that reuse a long shared context many times. Greatly cuts the cost of input tokens |
Prompt caching is a feature offered by the major model vendors, where the input unit price for the shared portion is discounted. When you ask dozens of questions against a long internal manual as the preamble, the effect is large. Specific discount rates and retention times are in each vendor's official documentation, so verify there.
05Designing Evals — Making "It Got Better" Measurable
Even after selecting models by tier and adding routing and caching, if you cannot measure "whether quality is actually being held," your judgment rests on intuition. What you need here is evaluation (= evals, a systematic mechanism for measuring whether a system's output is as expected). The crucial thing is to put it in a form you can compare with numbers, not impressions.
There are several patterns for building evals. Combine them according to the use.
Matching against ground truth
Prepare a question set with "correct answers" in advance and measure how closely the model's output matches. Suited to work with definite answers, like classification and extraction.
Human scoring
Have people score, against criteria, outputs with no single correct answer, such as summaries and explanatory text. Labor-intensive, but it captures subtle quality.
AI scoring
Put another model in the scorer's seat to evaluate large volumes quickly. Periodically reconcile against human scoring and correct for drift as you use it.
Regression checking
Run the same question set before and after changing a model or setting to confirm quality has not dropped. It is the basis for deciding whether to switch to a cheaper model.
The order for designing an eval is "gather representative inputs → decide the scoring criteria → always run it before and after a change." What matters here is building the eval question set from the queries that actually arrive. Collect only ideal examples and you cannot tell whether it will withstand the awkward inputs of the field. Whether you may switch to a cheaper model can only be decided with justification once this regression check shows "quality has not dropped." Cost reduction without evaluation is indistinguishable from a quiet decline in quality.
06Handling Your Own Data — Train It, or Reference It
"I want it to answer based on our own documents" is a very common request on the pharma floor. The important thing here is not to confuse the options. There are two broad paths, and their natures are entirely different.
- Reference it (RAG) — Leave the model as-is and, for each question, search for relevant documents and pass them along (= RAG, retrieval-augmented generation). Swap the documents and the answers change, so updating is fast. Easy to cite sources, and a good fit for regulated work. The first candidate to consider.
- Train it (fine-tuning) — Give the model itself additional training so it memorizes a particular style or format. Effective when you want to stabilize tone or form, but updating the content requires rebuilding, and it is hard to see what was learned.
In a domain like pharma, where accuracy of information, speed of updating, and explicit sourcing are required, it is safest to start with the reference approach (RAG). When a package insert or guideline is revised, the reference approach keeps up with the latest simply by swapping the documents. The training approach makes it hard to trace after the fact "on what basis it answered that way," and calls for caution in work that requires review and records.
07Operational Essentials — Choosing Isn't the End
Model selection is not something you decide once and are done with. Models are updated frequently, and both price and performance move. The optimal answer of six months ago is not necessarily optimal now. Here are the essentials, narrowed to four, for not stumbling in operations.
- Build so you can swap — Don't depend deeply on a particular model; consolidate the call site in one place. When a new model appears, you can switch after simply passing the evals.
- Make cost visible — Make it visible day by day which processes are eating cost. Invisible cost is something you only notice once the scale has ballooned.
- Set limits and fallbacks — Cap the cost and volume per process, and provide an alternative path for when the model does not respond. Prepare for both runaway and stoppage.
- Keep monitoring at fixed points — Run the eval question set periodically and track changes in quality. A model-side update can quietly change behavior.
In pharmaceutical work especially, keeping a record of "when, with which model, and what was produced" pays off later. When you are asked for the basis of an output, being able to reproduce the configuration of the time supports accountability as a matter of business. The more you chase speed and low cost, the more this humble accumulation of records comes to matter.
08Connections to Other Chapters on This Site
This installment reads better alongside the following chapters. Model selection is not a standalone technique; it acquires meaning only when placed within the design of the whole system and within regulation.
- AI Programming Vol. 9 — Architecture Design — Into what kind of structure do you embed the model you chose? The craft of using AI for the design consultation itself.
- Ad Regulations Vol. 1 — The Pharmaceutical Act — Text generated by AI is also subject to regulation. Exaggeration falls under Article 66 of the Pharmaceutical Act, unapproved products under Article 68, and information provision under Article 68-2.
- Material Review series — The practice of review that ultimately receives the model's output. The decision to switch to a cheaper model also passes through the reviewer's eye.
The core of model selection is not "always use the smartest tool" but "match the tool tier to the difficulty of the task." Hard judgments to large models, everyday work to medium models, simple sorting to small models. On top of that, use routing to separate easy work from hard work, and use caching to reuse repeating queries. That alone can bring both cost and wait time down sharply.
But always advance cost-cutting measures paired with evals. "It got cheaper" and "quality is being held" are separate facts, and the latter can only be confirmed by measuring. On the pharma floor, a missed cache update or an untraceable basis leads directly to misinformation or a gap in accountability. The more a mechanism makes things fast and cheap, the more you should first raise the humble fences of updating and record-keeping. An eye for choosing tools, and a mechanism for choosing them continuously — both together are the foundation for rooting AI in your work. Next time, we move on to the vessel itself into which the chosen model is embedded: how to use AI in architecture design.
- See models in three tiers — large, medium, small — and match the tier to the difficulty of the task. The axes to compare are not unit price alone but four: accuracy, cost, speed, and stability. What to watch is not the unit price but the "total cost to complete one job," judged including the fix-up cost of a cheap model.
- Use routing to separate easy work from hard work, and caching (response caching / prompt caching) to reuse repeating queries. In pharma, though, the fences are not leaving personal or clinical data behind, and reliably invalidating old answers when regulatory documents are revised.
- Advance cost reduction paired with evals. Build a question set from representative inputs, run regression checks before and after a change, and switch to a cheaper model only once you can show with numbers that "quality has not dropped." For your own data, start with the reference approach (RAG), and confirm in advance whether the data is used for the provider's training.
- Anthropic. Pricing / Prompt caching (official documentation). Anthropic, 2025. (Referenced as primary information on model unit prices and prompt caching specifications.)
- OpenAI. API Pricing / Models (official documentation). OpenAI, 2025. (Referenced as primary information on model tiers and input/output token billing.)
- Google. Gemini API Pricing (Google AI for Developers official documentation). Google, 2025. (Referenced as primary information for neutral model price comparison.)
- Lewis, P. et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems (NeurIPS), 2020. (Referenced as the original source for the reference approach, RAG.)
- Liang, P. et al. Holistic Evaluation of Language Models (HELM). Transactions on Machine Learning Research, 2023. (Referenced as a general framework for eval design in language-model evaluation.)
- Ministry of Health, Labour and Welfare, Director-General of the Pharmaceutical Safety and Environmental Health Bureau. Guideline for Sales Information Provision Activities for Prescription Drugs. MHLW, 2018. (Referenced as the basis that the same yardstick extends to generated outputs.)