Production QM Vol. 6 — AI Checking | Production QM

AI Checking— Running AI Checks and Handling the Output — Use AI as a tool that surfaces candidates; keep the final call human

An AI check is not a device that pronounces "right" or "wrong." It plants flags on suspect passages and narrows down where a human should look. With too many flags or too few, the review collapses. This sixth installment fixes — through procedures and templates — the order in which AI checks run, how to read the output, how to separate false positives from false negatives, how to hand off to human judgment, and what to record. The decision-maker remains human throughout. AI produces candidates, not conclusions.

01Where the AI check sits — generating candidates, not replacing judgment

Within the regulatory map laid out in Part 3 and its relationship to the QC gate (C) and QA gate (D), the AI check belongs upstream of human review, as a pre-processing step. The AI scans the full draft and plants flags on passages that warrant attention from regulatory, scientific, and consistency angles. When humans scrutinize only the flagged passages, misses drop and review time shrinks.

One principle must not be crossed. The AI's output is a candidate finding, not a verdict. Even if the AI writes "constitutes exaggerated advertising under Article 66," that is only a flag indicating possibility. Whether it actually applies is decided by a human reviewer who reads the norm and weighs the context.

Principle: An AI check is a candidate generator, not a verdict engine. The AI's "applies" is a flag that triggers human confirmation; it never becomes a conclusion on its own.

This installment covers how to plant those flags (execution), read them (interpretation), sort them (false positives and negatives), hand them to people (connection), and keep them (recordkeeping). The adjacent Part 5: Check design (in preparation) designs the check items themselves; this piece runs them; Part 7: Human review (in preparation) takes over the decision.

02Prerequisites — fix the input, the check lenses, and the thresholds

Before running an AI check, prepare three fixed items. A check missing any of them does not reproduce and cannot be recorded.

Fixed item	Contents	What happens without it
Input scope	Draft version number, coverage (body / figures / footnotes / references), edition of the source documents referenced	Which version was reviewed becomes unclear; findings float free
Check-lens set	Regulatory (§66/§68/§68-2, ad standards, MSA guidelines, JPMA), scientific accuracy, in-house wording rules	Lenses drift run to run; the cause of a miss cannot be pinned down
Thresholds	Minimum confidence to flag, severity tiers (high/mid/low), the severity line that must always reach a human	Too many flags go unread, or too few and things slip through

Reduce the lenses from norms like the MSA guidelines and the Standards for Proper Advertising of Drugs into items granular enough for the AI to match against. "Not exaggerated" cannot be matched. Break it down to "no wording suggesting an unapproved indication (Article 68)" and "are the data's limiting conditions (patient population, concomitant use, observation period) stated alongside?"

03Execution — don't do it in one pass; split into three

Running all lenses at once mixes the output and makes false positives hard to trace. Split execution into three passes by the nature of each lens.

Pass 1: Regulatory compliance — scan against the Act, the ad standards, the MSA guidelines, and the JPMA code. Flag suggestions beyond the approved scope, assertions of comparative superiority, exaggeration of efficacy or safety, and missing conflict-of-interest disclosures.
Pass 2: Scientific accuracy — match the draft's numbers and claims against source documents (package insert, papers, CSR). Detect misattributed citations, significant-figure errors, dropped confidence intervals, and confusion between relative and absolute risk.
Pass 3: Consistency and wording — check term unification, brand and generic name notation, footnote-to-body correspondence, and numeric agreement between figures and text.

Splitting passes organizes the output by lens and eases downstream sorting and recording. Always keep an execution log for each pass (model, prompt version, input version, timestamp). This becomes the audit trail used in Part 10: Traceability (in preparation).

Principle: The default is not one run per draft but three passes by lens. Mix them and you can no longer trace, after the fact, which lens let something slip.

04Reading the output — sort flags on three axes

Do not pass the AI's candidate findings to reviewers as-is. First reorder the received flags on three axes.

Axis	Categories	Use
Lens	Regulatory / scientific / consistency-wording	Routes to the right reviewer (regulatory to the regulatory affairs lead)
Severity	High (cannot publish) / mid (must fix) / low (recommended)	Process from high down; no next step while a high remains
Confidence	The likelihood the AI assigned	Low-confidence × high-severity gets top-priority human check (guards against missed catches)

The key is not to trust confidence at face value. The AI's "95% confidence" does not mean "certainly applies." Confidence decides the order in which a human looks; it is never grounds for ignoring something because confidence is low. A high-severity flag is always opened by a human, even at low confidence.

05Handling false positives — classify and feed back, don't just delete

False positives (the AI flagged it but there is no real problem) eat reviewer time. But raising thresholds to chase zero false positives breeds false negatives (missed catches). Understanding the trade-off, treat false positives as something to classify and keep, not erase.

Silently delete a flag judged a false positive without recording it. The same false positive returns next time and costs the same time again, and the reasoning for why it was a false positive is gone.

Record the false positive with a one-line rejection reason (e.g., "indication within approved scope, §68 does not apply; confirmed on p.X of the package insert"). When the same type recurs, feed the threshold and prompt back to Part 5 (in preparation).

Rejecting a false positive is a judgment a reviewer made against the norm. That is exactly why it deserves a record. As rejection-reason logs accumulate, you see which lens is over-detecting, and the check itself grows more accurate.

06Guarding against false negatives — the silence is where the danger is

False positives are visible; false negatives (the AI did not flag it but a problem exists) are not. So the false negative carries the higher risk. The worst failure pattern is relaxing because the draft "passed" the AI check.

Principle: The AI saying nothing is not proof there is no problem. The AI check is a tool to reduce human review, not to eliminate it.

To prevent false negatives structurally, always run an independent check line that does not rely on the AI.

Full visual review of high-risk areas — efficacy statements, safety wording, comparative expressions, and suggestions of unapproved information are reviewed in full by a human regardless of whether the AI flagged them.
Reviewers re-run the source matching — for final number checks, do not trust the AI output; the reviewer returns to the package insert and papers.
A ledger of known false-negative patterns — log types the AI has missed before (efficacy hinted via euphemism, comparative superiority shown only in a figure) and have humans inspect them closely.

When one false negative surfaces, do not end it with a single fix. Analyze why the AI missed it and add a new detection rule to the check lenses. One false negative signals that misses of the same type exist elsewhere.

07Connecting to the human final call — the handoff SOP

The point where AI output meets human judgment is the heart of this installment. When it is vague, groundless decisions creep in — "fixed it / didn't fix it because the AI said so." Fix the connection with the following SOP.

Stage	Owner	Action	What is kept
1 Receipt	QC lead	Sort the three-pass output on three axes; order from high severity	Classified finding list
2 First call	QC lead	Mark each flag accept / reject / hold; rejection requires a reason	Decisions and reason log
3 Expert call	Regulatory / medical	Scrutinize high-severity regulatory-scientific flags and holds; final decision	Final decision with the cited norm
4 Reflection check	QA lead	Verify accepted findings were correctly reflected in the draft	Before/after diff

For each decision, keep who decided, against which point of which norm, and when. Whether the AI's flag was accepted or rejected, the decider was a human, and the fact that the human applied the norm stays on record. This is the line between "merely followed the AI" and "used the AI as a tool."

Record only "deleted because the AI flagged §68" or "passed because the AI found nothing." The decision-maker has become the AI, and the validity cannot be verified later.

Record "AI flagged §68. Regulatory matched against the package insert's indication field, judged it an unapproved suggestion, deleted (name/date)." The decider is human and the basis ties to the norm.

08Keeping the record — minimum checklist and audit trail

Decide the minimum record set kept for each AI-check run. Without records, you cannot reproduce why a decision was made, and you cannot withstand an audit.

Execution log: draft version, model name, prompt version, three-pass timestamps, operator
Finding list: per flag — lens, severity, confidence, location, the AI's wording
Decision log: per flag — accept/reject/hold, decider, basis (norm reference point)
False-positive log: rejected flags and rejection reasons (input to check improvement)
False-negative log: misses humans found later, and the detection rule added
Reflection diff: before/after of accepted findings

With these six in place, a third party can later reconstruct what checks a material passed, and who decided what against which norm. The value of the AI check lies less in planting flags than in shaping human judgment into a form recorded and tied to the norms.

Principle: An AI check left unrecorded is the same as one never run. Keep the five streams — execution, decision, false positive, false negative, reflection — every time, and return them to next time's accuracy.

Key Points ── three to carry home

The AI check is a candidate generator, not a verdict engine. "Applies" is only a flag that triggers human confirmation; deciding applicability against the norm stays human to the end.
Classify false positives and keep their rejection reasons; prevent false negatives structurally with an AI-independent check line (full visual review of high-risk areas, human source-matching, a miss ledger). The silence is where the danger is.
Fix the AI-to-human connection with a four-stage SOP and record who decided, against which norm, and when, every time. An unrecorded AI check is the same as one never run.

Sources and references

Ministry of Health, Labour and Welfare, "Guidelines for Sales Information Provision Activities for Prescription Drugs," 2018 (applied April 2019). (MSA guidelines text and commentary)
Ministry of Health, Labour and Welfare, "Standards for Proper Advertising of Drugs and Medical Devices" (Notification Yakuseihatsu 0929 No.4, Sept 29, 2017). (Criteria for exaggeration and comparative ads)
Pharmaceutical Affairs Study Group, ed., "Article-by-Article Commentary on the Pharmaceuticals and Medical Devices Act," Yakuji Nippo. (Interpretation of §66/§68/§68-2)
Japan Pharmaceutical Manufacturers Association, "JPMA Code of Practice" and "Guidelines for Producing Promotional Printed Materials." (Self-regulatory norms for material production)
Yoshinori Iizuka, "Quality Control and Quality Assurance, and Their Future," Japanese Standards Association. (Process design of QC/QA and the gate concept)
ISO/IEC, "ISO 9001 Quality Management Systems — Requirements," Japanese Standards Association. (Framework for records and traceability requirements)

← Back to Production QM