AI Material Review 06 — What AI Review Tools Can and Cannot Do: Finding the Right Distance Between Over-Trust and Underestimation | AI Material Review | Pharmaceutical Advertising Regulation: Material Creation, Review & Use in Japan

What AI Review Tools Can and Cannot Do── Finding the right distance between over-trust and underestimation

Across the first five volumes we traced a path: feeding materials to AI, checking them against the approved scope, giving it rules as a frame, and standardizing review. This time we step back and look calmly at the tools themselves ── what can an AI review-support tool actually do, and what can it not do. Take the sales pitch at face value and it sounds like "review becomes fully automatic." Try one and feel let down, and you want to write it off as "unreliable in the end." Both are extremes. Trust the tool too much and you invite missed violations; underrate it and you throw away real capability. The right stance is in between ── keeping just the right distance from the tool. This volume works through, in order: the kinds of tools, what they can do, what they cannot, the danger of over-trust, how to decide on adoption, and the discipline of checking before you use them.

01Types of tools ── don't lump "AI review support" into one thing

First, let's unpack the words. "AI review-support tool" sounds like one thing, but the insides differ. Different mechanisms mean different strengths and different limits. Asking "is AI usable?" while mixing these together is why the conversation goes nowhere. Broadly, there are four kinds.

Type 01

Rule-based matcher

Catches "set words"

Mechanically matches banned words and fixed patterns against a dictionary. The banned-word dictionary you built in Vol. 4 works exactly as it is. Inflexible, but it reliably catches whatever you defined.

Type 02

Machine-learning classifier

Guesses "looks risky"

Learns from past pass/fail data and gives a probability that "this is likely to get flagged." It can catch phrasings absent from the dictionary, but why it judged that way is hard to explain.

Type 03

Generative AI (LLM) type

Writes the "comment"

A large language model (= AI trained on huge volumes of text) drafts the reasoning for a comment and rewrite suggestions in prose. Easy to read, but it mixes in plausible-sounding errors.

Type 04

Search / reference (RAG) type

Pulls up the "basis"

Uses RAG (= a mechanism that searches external documents and uses them in the answer) to retrieve approved information and past review cases. It leaves judgment to people and speeds up gathering the material.

Real products combine these four. Banned words are screened out by the rule base, delicate expressions are caught by the classifier, comments are drafted by the LLM, and the basis is pulled by the search type ── and so on. So when you evaluate a tool, don't ask "is it AI?"; look at "which type is used for which part." How much you can trust it differs by type. A rule base that catches set words works almost with certainty. A classifier that guesses by probability, or an LLM that writes prose, must be treated on the assumption that it will get things wrong.

02What they can do ── tireless, thorough, fast

Before turning to limits, let's give fair credit to what they can do, because underestimation is also a mistake. AI support tools have strengths that human reviewers structurally cannot match.

First, coverage and consistency. People lose focus in the back half of a long document, and the same banned word may be judged differently depending on the day. Machines don't tire. They check hundreds of pages against the same yardstick to the very end, and their conclusion doesn't change between morning and evening. The "variation between reviewers" we saw in Vol. 5 can be erased in the machine-matching portion.

Second, speed and search. How was a similar expression judged in the past? Where might it touch the approved information? Work that takes a person tens of minutes of flipping through documents, a search-type tool surfaces as candidates in seconds. Faster to "retrieve" than a reviewer is to "recall."

Third, draft generation. Writing out the reasoning for a comment from scratch is hard labor. The LLM type can produce that first draft. In many situations, fixing the draft that comes out is faster than writing from zero.

What the strengths really are: Every one of these strengths has to do with "handling volume." Many, long, repetitive ── the very territory where people get bored and sloppy is where machines deliver uniform force. Put the other way, the work of judging quality rather than volume ── deciding "is this expression going too far in this context?" ── lies outside the machine, as we see in the next section.

03What they cannot do ── context, figures, and final responsibility

So what lies outside the tool? Getting this wrong leads to over-trust, so let's lay it out concretely. The wall of "it's decided by context," which we touched on in Vol. 4, is the very core.

What tools are weak at	Why they are weak at it
Drawing the line by context	The same "effective" is appropriate within the approved scope, but exaggerated (Article 66) where there is no basis. It isn't decided by the word alone; judgments that shift with the surrounding meaning slip through
Implications of figures and layout	How a graph's axis is cut, the impression of a photo, the smallness of a note ── exaggeration that never becomes text can't be caught by matching against the wording
Expressions without precedent	New phrasings absent from the training data, or clever expressions that exploit gaps in the regulation, have no model in the dictionary or in past cases
The final pass/fail decision and responsibility	The decision that "this material may go out into the world" is borne by people, both legally and ethically. The machine only supplies the material for judgment; it cannot take on responsibility

Especially dangerous is the second: figures and layout. Exaggeration does not reside in words alone. Shift the origin of the vertical axis on an efficacy graph and a slight difference looks dramatic. Shrink a safety note to an unreadable size and the risk stops standing out. Even when the machine returns "pass" on the wording, the impression the whole material gives may be going too far ── this is a territory current tools can barely touch.

And the fourth. Final responsibility can only be borne by people. This is not a problem of immature technology but of how regulation and ethics are built. Both the Pharmaceutical and Medical Device Act and the Guidelines for Sales Information Provision Activities (= the Sales-Info Guidelines, discussed later) place the subject of responsibility on people and organizations. "The AI passed it" holds up in no review. This one point won't move no matter how smart the tools get.

04The risk of over-trust ── the trap of "the tool passed it, so it's safe"

Once you grasp what they can and cannot do, the failure mode to guard against most comes into view. It is automation bias (= the tendency to assume, without checking, that the answer a machine produces is correct). When a tool displays "no problem," people ease off looking with their own eyes. This is not weakness of will; it has long been known as a property of human attention.

What happens when this occurs on the review floor? The exaggeration the tool missed, the person waves through too. What should be a double fence opens at the same spot on both. And the tricky part is that the more you use the tool, the duller the reviewer's eye becomes. Keep leaning on the machine and the muscle of judging for yourself weakens ── a phenomenon long noted as the classic irony of automation. The reviewer who usually leaves it to the tool is the one who fails to notice when the machine gets it wrong.

Don't shift where responsibility lies: The worst form of over-trust is the illusion that, when something fails, "the tool passed it" will stand as an excuse. But as Section 3 shows, responsibility lies only with people. A tool's "pass" is not an indulgence but the starting point for a person to check. Unless you write this line into the organization's operating rules, convenience itself becomes a breeding ground for missed violations.

The reverse ── underestimation ── also exists. One off-target comment from the tool and you write the whole thing off as "useless." Then you let go of even the coverage and consistency the machine is good at. Over-trust and underestimation are two sides of the same "failure of distance." The right distance lies in dividing the labor part by part: leave the parts it's good at, doubt the parts it's weak at.

05The adoption decision ── decide first what, and how much, to delegate

So how do you actually decide whether to bring a tool in? Deciding by the flashiness of the pitch or the lowness of the price is the way to avoid most. Set the axes of judgment in this order.

Decide the purpose first ── Do you want thorough screening, a draft of the comment, or a search of past cases? Narrow the desired function to one, and the type you need (Section 1) is determined
Limit the use to screening ── Don't delegate the pass decision. Position the tool as a device for narrowing down "candidates worth a person's look." Leave the pass/fail decision with people
Try it small ── Don't roll it out company-wide at once; test it on materials whose judgment is already known. Check the actual missed violations and false alarms with your own eyes before widening
Look at maintenance and cost ── Every time the approved information changes, the dictionary or model needs updating. The cost and effort of continued use bite harder than the cost of adoption

There is one more axis you can't skip on the pharma floor: handling of confidential data. Materials sent to review contain unpublished product information and internal judgments. If the mechanism sends them to an outside cloud, check where the information is stored, whether it won't be used for training, and how the contract protects it. Leak confidentiality in exchange for convenience and you're nowhere near helping the review.

Fine to leave to the tool	People keep holding
Machine matching of banned words and fixed patterns	The final judgment of exaggerated vs. appropriate given context
Search and candidate listing of approved information and past cases	Evaluating the overall impression figures and layout give
Drafting the reasoning for a comment (on the premise a person fixes it)	The "may go out into the world" pass/fail decision, and its responsibility

The right side of this table ── the column people keep holding ── must not be surrendered when a tool is adopted. As long as you hold this, whichever product you choose you won't go far wrong. Conversely, a product premised on handing even the right side to the machine should be passed over, however high its spec.

06The discipline of validation ── measure missed violations before you use it

Even once adoption is decided, don't put it straight into production. A tool is validated before use. And where you place the weight of validation is crucial. To put the conclusion first: measure missed violations as the top priority.

First, prepare a bundle of materials whose correct answers are known ── ones people have already settled the judgment on. Mix in both appropriate and problematic ones. Run these through the tool and count two kinds of error separately. Missed violations (= returning "pass" on a problematic material) and over-detection (= stopping an appropriate material as "fail"). These two are utterly different in character.

Missed violations and over-detection don't weigh the same: Over-detection is handled when a person pushes back with "this is an error." The effort rises, but the danger doesn't go out into the world. A missed violation, however, leads straight to publication. If an exaggerated or unapproved expression slips through the tool and the person overlooks it out of over-trust, a material in violation of the regulation goes out into the world. So in validation, check the fewness of missed violations before the fewness of over-detection. Swing to the safe side and design it so that anything in doubt is routed to a person.

Validation isn't finished in one pass. When approved information is revised and new forms of violation appear, the tool's real capability changes too. So re-measure periodically. In the same spirit as the version control of rules described in Vol. 4, keep a record of "when, on which correct-answer set, and with what result." Do this and you can later explain that "review in this period was supported by a tool of this capability." When there's an audit or a query, this speaks.

One more point. A tool's score has meaning compared against human review. Set the rate at which people miss and the rate at which the tool misses side by side, and see how far the overall miss rate drops when the two are combined. Not the tool's standalone score, but how firm the fence becomes with the "human + tool" combination ── that is the number you really want to know.

07Connections to other chapters ── place the tool inside the design of review

"How to keep distance from the tool," seen in this volume, connects to the series' other volumes as follows. A tool doesn't work on its own; it works only when placed inside the design of the review as a whole.

AI Material Review Vol. 4 ── Rule Design ── What this volume's rule-based matcher uses is the banned words and approved-information dictionary built in Vol. 4. The insides of the tool are the very rules you designed
AI Material Review Vol. 5 ── Standardizing Review and AI ── The tool's coverage and consistency are a means to erase variation between reviewers. Read this volume together as the equipment of standardization
AI Programming Vol. 1 ── Foundations of Code Generation ── The validation idea that "it ran" is not "it's correct" has exactly the same skeleton as the discipline of measuring a tool's score against a correct-answer set
AI Marketing Vol. 1 ── Marketing Redefined ── The equipment for absorbing the mass-generated promotional materials. The side that makes and the side that reviews use the same tool, AI, from opposite faces

In closing

An AI review-support tool is neither magic nor useless. In the work of handling volume ── coverage, consistency, search ── it delivers force people can't match, and in the work of deciding quality ── drawing the line by context, the implications of figures, the final pass/fail and responsibility ── it stays outside people. Get this boundary wrong and two failures await. The over-trust that delegates too much and misses violations, and the underestimation that writes it off and discards the benefits. Both are the shape of getting the distance from the tool wrong.

The right distance lies in dividing the labor part by part. Leave the screening it's good at; doubt the judgment it's weak at. A "pass" is not an indulgence but the starting point for a person to check. Before use, measure missed violations as the top priority, and re-measure every time the approved information changes. And never entrust the final responsibility to the tool. As long as you don't break this stance, AI makes the reviewer faster and surer. Next time we take up how to preserve "when, by whom, and how it was judged" ── the review record and audit trail, and on through corrective action.

Key Points ── three to take away

"AI review-support tool" is not monolithic. Rule-based matching, the machine-learning classifier, the generative-AI (LLM) type, and the search-reference (RAG) type differ in how far you can trust them. The rule base that catches set words is almost certain; the classifier that guesses by probability and the LLM that writes prose are handled on the assumption they will miss. Evaluate not by "is it AI?" but by "which type is used for which part."
What tools are good at is the work of handling volume (coverage, consistency, search, drafting). What they're weak at is the work of deciding quality ── drawing the line of exaggerated vs. appropriate by context (Article 66), the implications of figures and layout, expressions without precedent, and the final pass/fail and responsibility. Responsibility lies only with people, legally and ethically, and "the AI passed it" holds up nowhere. The line that an MR can only go as far as providing information ── price, stock, delivery and other transaction terms are out of scope (they belong to the wholesaler and the hospital) ── is one you don't let the tool cross either.
Over-trust (automation bias) and underestimation are two sides of a failure of distance. The right distance lies in dividing the labor part by part. Adopt by narrowing the purpose, limiting the use to screening, trying it small, and checking maintenance cost and the handling of confidentiality. Before use, measure missed violations as the top priority on a correct-answer set, weight the fewness of missed violations over over-detection, and re-measure periodically while recording the version.

Sources & references

Ministry of Health, Labour and Welfare. Act on Securing Quality, Efficacy and Safety of Products Including Pharmaceuticals and Medical Devices (Pharmaceutical and Medical Device Act), Articles 66, 68, and 68-2. (The provisions on prohibition of exaggerated advertising, prohibition of advertising pre-approval pharmaceuticals, and the appropriate provision of information in sales information provision activities, respectively.)
Director-General, Pharmaceutical Safety and Environmental Health Bureau, MHLW. Guidelines for Sales Information Provision Activities for Prescription Drugs. Yakusei-hatsu 0925 No. 1, September 25, 2018 (applied April 1, 2019). (The primary source defining the scope, methods, and structure of information provision activities.)
Director, Compliance and Narcotics Division, Pharmaceutical Safety and Environmental Health Bureau, MHLW. On the Revision of the Standards for Fair Advertising of Pharmaceuticals. Yakusei-kanma-hatsu 0929 No. 5, September 29, 2017. (A notice translating the Act's advertising regulation into practical standards. Issued by the Director of the Compliance and Narcotics Division.)
Parasuraman, R. & Manzey, D. H. Complacency and Bias in Human Use of Automation: An Attentional Integration. Human Factors, Vol. 52, No. 3, 2010. (The seminal work systematically laying out automation bias = the tendency to believe a machine's output without checking.)
Bainbridge, L. Ironies of Automation. Automatica, Vol. 19, No. 6, 1983. (The classic arguing the "irony of automation" ── that the further automation advances, the duller human judgment becomes.)
Ji, Z. et al. Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, Vol. 55, No. 12, 2023. (A survey systematically summarizing how generative AI produces plausible-sounding errors.)

← Back to AI Material Review