01How Diagnostic Imaging AI Works — What It Learns, and What It Outputs
At the heart of diagnostic imaging AI is deep learning (= a kind of machine learning that automatically learns features from large volumes of data), and in particular the convolutional neural network (= CNN, a design that scans an image through small windows to pick up features piece by piece), which is well suited to images. It learns by matching tens of thousands to hundreds of thousands of images against the "correct answer" attached to each one (= whether this image contains a lesion, and where it is).
The key point is that what the AI learns is not "disease itself" but the patterns common to images that carry a correct label. Because it takes the diagnoses made by radiologists and pathologists as its model, the ceiling of the AI is basically set by "the quality of the diagnoses made by those doctors." If the model is biased, the AI inherits that bias.
The form of the output also varies by product. Some return "abnormality suspected / not suspected," some mark the location of a lesion with a box or color, some give the probability of malignancy as a number — the design diverges by use. What is common is that AI returns a probability; it does not make a definitive assertion. Miss this single point and the whole discussion of sensitivity and specificity that follows falls apart.
02Sensitivity and Specificity — Two Kinds of "Correctness" That Are Hard to Have Together
When we talk about the performance of diagnostic imaging AI, the two most fundamental measures are sensitivity and specificity. The words sound technical, but the substance is simple.
Sensitivity
The proportion of people who truly have the disease that are correctly picked up as "positive" (= sensitivity, the true-positive rate). The higher the sensitivity, the fewer the misses (false negatives). It is especially emphasized in cancer screening.
Specificity
The proportion of people who are truly healthy that are correctly judged "negative" (= specificity, the true-negative rate). When specificity is low, more false positives arise — healthy people being called "abnormal."
Trade-off
Lower the decision threshold and sensitivity rises but specificity falls. The reverse is equally true. You cannot maximize both at once; you decide which to prioritize according to the use.
The advertising line "the AI is 95% accurate" often blurs these two. Sensitivity of 95% and specificity of 95% mean entirely different things, and in the first place, how many people in the target population actually have the disease (= prevalence) changes the real hit-and-miss outcome dramatically even at the same sensitivity and specificity. In low-prevalence screening, even a slightly lower specificity can produce a flood of false positives. When you look at a number, you must always check as far as "what was measured, in which population."
03The Weight of a False Negative — Why a Miss Is Especially Heavy
There are two kinds of error: the false positive, calling a healthy person diseased, and the false negative, missing a person who has the disease. In medicine, the weight of these two is entirely different.
| Type of error | What happens | The consequence that follows |
|---|---|---|
| False positive (over-calling positive) | A healthy person is judged "abnormality suspected" | There remains room to rule it out in the end through additional tests and closer examination. It creates anxiety and cost, but is mostly reversible. |
| False negative (a miss) | A person with disease is judged "no abnormality" | It strips away the very chance to seek care and be treated. The condition may progress before the next test, sometimes past the point of recovery. |
A false negative is heavy because the chance to notice the error is lost. With a false positive, it often ends with "we took a closer look just in case and there was no problem." But with a false negative, the flow of testing stops the moment "no abnormality" is stated, and time passes with no one noticing the mistake. That is why diagnostic imaging AI, especially the kind used in screening, is often designed for high sensitivity even at the cost of accepting some false positives.
04Approved Cases — What Has Been Recognized, and How
Diagnostic imaging AI is no longer a fantasy. In Japan, under the framework of the program medical device (= SaMD, software approved or certified as a medical device in its own right) — the framework under which software including AI is treated as a medical device — products have been approved and certified that support the detection of diabetic retinopathy, polyps in colonoscopy, aneurysms on brain MRI, and more. The U.S. FDA has also authorized AI/machine-learning-enabled medical devices in the hundreds, many of them in the radiology, ophthalmology, and cardiology imaging fields.
What must be grasped here is the scope that "approved" implies. Approval is a limited permission that says "this product may be used, in this way (= intended use), for this target." An AI approved for judging diabetic retinopathy cannot be diverted to judging a different eye disease. Using it outside the intended use is use beyond the scope of approval. Where this distinction bites in pharmaceutical practice is, for example, when a diagnostic AI is discussed together with one's own product. Suggest efficacy even a single step beyond the approved intended use, and you touch the wall of the Pharmaceutical and Medical Devices Act discussed below.
05Operational Cautions — The Gap Between Approval and the Field
Approved performance is, in the end, performance on the data submitted for review. Bring it into the field and the same numbers do not necessarily appear. Several factors create the gap.
Different equipment / facilities
When the images used for training differ from your own institution's equipment and imaging conditions, accuracy drops (= domain shift, a performance decline caused by data bias). A change in manufacturer or generation alone can have an effect.
Different patient population
When age, ethnicity, and prevalence differ from the training data, the same AI's hit-and-miss shifts. Trusting a product trained on overseas data as-is within Japan is risky.
Over-reliance and automation bias
When the AI outputs "no abnormality," people are more likely to be pulled along and overlook it too (= automation bias, the tendency to defer to a machine's judgment). What was meant to be support turns into offloading the judgment entirely.
Performance drift over time
With equipment updates and changes in clinical practice, accuracy can drift over time. It is not "install and done"; post-deployment monitoring (= post-market performance monitoring) is required.
In particular, automation bias is easily overlooked. Once AI is introduced, people unconsciously trust it and their own attention to verify thins out. Then people can no longer catch the false negatives the AI missed, and the double net shrinks to a single one. Unless you set an operational rule at the time of introduction — "even a negative from the AI is independently confirmed by a person" — you can even end up with the absurd result that misses increase after introducing AI.
06The Relationship With the Physician — Where Does Responsibility Remain
"If AI does the diagnosis, who bears responsibility when it is wrong?" — this question always comes up in the field. The answer under the current framework is clear. The responsibility for the final diagnosis lies with the physician. One reason most approved diagnostic imaging AI is positioned as "support" is precisely to keep this locus of responsibility from moving.
So the desirable relationship between physician and AI has a clear order of precedence. AI is a tool that raises candidates, prompts attention, and reduces oversights. The physician is the subject who takes that as reference and judges comprehensively, together with the patient's background and other findings. The framing is that the AI's output is "one opinion," not "the conclusion." When this relationship breaks down and the AI's output is taken directly as the conclusion, it becomes a breeding ground for automation bias, and the locus of responsibility grows blurred as well.
When speaking to this point from a position connected to pharma, it is essential to stay with neutral fact. Elevating a particular diagnostic AI product with a tone like "more accurate than a doctor" or "with this you can rest easy" can be received as promotion or endorsement. It is safest to state only, as fact, that "it is a support tool and the final judgment rests with the physician," and to avoid ranking products by name.
07Verification — On What Basis Can We Say It "Works"
Whether a diagnostic imaging AI truly works is decided not by advertising numbers but by the quality of verification. Trustworthy verification has several conditions.
- Measured on data separate from the training data (= external validation) — Scoring well on the images used for training is a given. Only when performance holds on data from an entirely different facility and a different population can we call it real ability.
- Prospective clinical evaluation — Not merely analyzing past images in a batch after the fact, but examining, within the actual flow of care, how misses and diagnoses change with and without the AI.
- Sensitivity and specificity shown with the target population — Look for whether it is shown not as "95% accurate" but as far as "in which population, at what prevalence, what are the sensitivity and specificity."
- Who reviewed it — Whether it is reported in a peer-reviewed paper (= peer-reviewed, examined in advance by independent expert third parties). Do not swallow whole the numbers from a manufacturer's own announcement.
It matters greatly that a pharmaceutical medical affairs function carries this verifying eye. The perspective for evaluating a diagnostic AI's performance is continuous with the perspective for reading one's own clinical data. "In which population, compared with what, verified by whom" — the habit of asking these three is the foundation for not being swayed by AI's numbers.
08Connections to Other Chapters on This Site
This installment gains depth when read together with the following chapters and areas.
- AI Medical Vol. 4 — Electronic Health Records and AI — After the "seeing" information of images, how AI handles the "writing and keeping" information of records, summaries, and voice input.
- AI Marketing Vol. 5 — AI-Generated Content Strategy — Designing to deliver AI's output to the field while keeping regulation. The idea of leaving the final judgment to a person runs through it, just as with the "support" role of diagnostic AI.
- Material Review series — The practical safeguard of the discipline of not speaking of efficacy beyond the scope of the approved intended use.
Diagnostic imaging AI is the most fully implemented AI in medicine. Approved cases have grown, and it genuinely helps as a double net that reduces oversights. But its real power can only be spoken of once you correctly understand these things: the two hard-to-reconcile forms of correctness that are sensitivity and specificity, the especially heavy error that is a false negative, and the gap between approval data and the field. AI is a support tool that returns a probability, and the final judgment and responsibility remain with the physician. Do not be carried along by automation bias; have a person independently confirm even the AI's negatives. Only with this discipline does AI become an ally that reduces oversights.
From a position connected to pharma, there is one more fence. Do not suggest efficacy beyond the scope of the approved intended use, and do not elevate a particular product in a form that can be read as promotion or endorsement. When speaking of a diagnostic AI's performance too, stay with neutral fact. Next time we move from the "seeing" information of images to the "writing and keeping" information of the electronic health record — how AI is changing the clinical record.
- "95% accurate" cannot be relied on. Sensitivity (the power not to miss) and specificity (the power not to be wrong) are hard to have together, and the real hit-and-miss changes dramatically with the target population's prevalence. Always check a number as far as "what was measured, in which population."
- A false negative (a miss) is heavier than a false positive, because it strips away the very chance to notice the error. That is why most approved diagnostic imaging AI is "support," and the final judgment and responsibility remain with the physician. An operation in which a person independently confirms even the AI's negatives is required.
- Approval is a limited permission that says "for this intended use, for this target." Suggesting efficacy beyond that scope touches the Pharmaceutical and Medical Devices Act (exaggeration under Article 66, unapproved products under Article 68, information provision under Article 68-2). From a pharma position, do not speak of a particular product in a form that can be read as promotion or endorsement; stay with neutral fact.
- Ministry of Health, Labour and Welfare. Act on Securing Quality, Efficacy and Safety of Pharmaceuticals, Medical Devices, etc. (Pharmaceutical and Medical Devices Act). Articles 66, 68, and 68-2. (Primary statutory text on advertising regulation and information provision.)
- Pharmaceuticals and Medical Devices Agency (PMDA). Approval and Certification Information for Program Medical Devices (SaMD). PMDA Medical Device Information Search. (Domestic approval cases of diagnostic imaging support AI.)
- U.S. Food and Drug Administration (FDA). Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices. FDA, 2024. (List of AI-enabled medical devices authorized in the United States.)
- Gulshan V, Peng L, Coram M, et al. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA, 2016;316(22):2402-2410. (A representative validation study of diabetic retinopathy detection AI.)
- McKinney SM, Sieniek M, Godbole V, et al. International Evaluation of an AI System for Breast Cancer Screening. Nature, 2020;577:89-94. (External validation of breast cancer screening AI, reporting sensitivity and specificity.)
- Ministry of Health, Labour and Welfare, Pharmaceutical Safety and Environmental Health Bureau. Guidelines for Sales Information Provision Activities for Prescription Drugs. Bureau Director-General Notice, 2018. (The yardstick for pharmaceutical information provision activities.)
- Ministry of Health, Labour and Welfare, Director of the Compliance and Narcotics Division, Pharmaceutical Safety and Environmental Health Bureau. Standards for Fair Advertising of Drugs and Related Products. Division Director Notice. (The criteria for judging advertising expression.)