AI Programming 01 — The Foundations of Code Generation: What an LLM Can and Cannot Write | AI Programming | Pharmaceutical Advertising Regulation: Material Creation, Review & Use in Japan

The Foundations of Code Generation ── What an LLM Can and Cannot Write── Mapping the principles and the limits of AI-written code, before anything else

Over the past two years, "letting AI write code" has become ordinary. You ask ChatGPT to write a function, GitHub Copilot (= a code-completion service) suggests the rest of a line, and a single instruction seems to make a screen come alive. Yet few people have accurately mapped the boundary between what AI can actually "write" and what it cannot. As the starting point of this ten-part AI Programming series, this installment first lays out the principles by which a large language model (= an AI trained on vast amounts of text, hereafter LLM) generates code, and the limits that necessarily follow from those principles. It is the map that grounds a single judgment call in pharmaceutical marketing, development, and medical affairs: how far can you trust the code and scripts that AI writes?

01How an LLM Writes Code ── It Is Only Guessing "the Next Word"

Let's clear up the most common misunderstanding first. An LLM does not write by understanding the meaning of a program. What it does is remarkably simple ── it predicts, one at a time, the word (token) most likely to come next as a continuation of the string so far, and lines those tokens up. This mechanism rests on a structure called the Transformer (= an architecture, published in 2017, that computes which words each word in a text should pay attention to).

To an LLM, code is just another kind of text. Write def, and a function name is likely to follow. Write for i in, and the target of the loop tends to come next. Because the model has learned from an enormous volume of public code, these "continuations" come out looking like surprisingly natural programs.

The crucial point is that this prediction optimizes not for "correctness" but for "plausibility." If a piece of code is grammatically natural and resembles a common way of writing, the LLM will output it with full confidence ── even when it does not run. In a single sentence, the principle is this.

The core principle: An LLM is not writing "correct code"; it is writing "the code most likely to appear in this context, given its training data." In many cases that coincides with correct code, but nothing inside the model guarantees the match.

02Strengths and Weaknesses ── The Boundary Is Set by "How Much Was in the Training Data"

From the principle of "predicting the continuation," the boundary between strengths and weaknesses follows directly. Patterns that appear abundantly in the training data are strengths; rare patterns are weaknesses. That explains almost everything.

Strength 01

Routine implementation

writes "the common shape"

API calls, data shaping, reading and writing CSV, regular expressions, standard algorithms. For code with countless precedents in the world, both accuracy and speed are high.

Strength 02

Translation and conversion

writes "the restatement"

From Python to JavaScript, from pseudocode to real code, from an error message to a proposed fix. Moving one format into another is well suited to prediction.

Weakness 01

Highly novel design

writes "the unprecedented shape"

Specifications unique to one organization, in-house APIs, architectures that do not yet exist anywhere. With no model to imitate in the training data, it drifts easily into plausible fabrication (discussed below).

Weakness 02

Strict computation and counting

where "precision" is required

Multi-digit arithmetic, boundary conditions, off-by-one (= off by one) judgments. Probabilistic prediction tends to miss this kind of single, unique correct answer.

The evaluation study of Codex (= a GPT for code generation), published by Chen et al. in 2021, put this boundary into numbers. On a test called HumanEval (= 164 hand-written coding problems), the rate of getting a basic function right on the first try was around 30 percent. But when the model is allowed to attempt the same problem dozens of times and counted as correct if even one attempt succeeds, the success rate rises sharply. "Misses on the first shot, but hits if it fires enough rounds" ── this is the raw nature of LLM code generation, and while later models improved the accuracy, the property itself has not changed.

03The Hallucination Trap ── Calling Functions That Don't Exist, With Full Confidence

The most dangerous form of an LLM's limits is hallucination (= plausible fabrication). Nonexistent libraries, unimplemented functions, wrong argument order ── the LLM writes these with exactly the same confidence as correct code. As a rule, it will not tell you, "This is probably wrong."

Why does it happen? Go back to the principle and it is obvious. Because the LLM writes "the plausible continuation," "the function you wish existed" gets output as if it really existed. If the model decides "there is probably a function that converts a date to the Japanese imperial calendar," it will calmly write a call to a nonexistent to_wareki(). The code looks flawless, and it breaks only when you run it.

How it shows up in practice: Hallucination does not appear as "obviously strange code." It appears as "code that looks perfect but, when executed, is reported as not existing." That is exactly why reading it is not enough to feel safe. Until you run it and verify, AI-written code is a "hypothesis."

04What Is the Human Role ── From Author to Verifier and Designer

Taking all of this together, the human role in AI-era programming comes into focus. The more AI takes on the "writing," the more human work shifts to "what to have it write" and "whether what was written is correct." From the person moving their hands to the person setting the frame and verifying.

Concretely, the work that remains for humans comes down to three things.

Define the requirements ── What do you want to build, and what output should it return for what input? When this is vague, AI fills in "plausible" code while staying vague.
Verify correctness ── Does the code that came out really run, does it break at boundary conditions, does it meet the requirements? In principle this cannot be left entirely to AI.
Bear responsibility ── The decision to publish, deliver, or put into production. "Because AI wrote it" is never a defense, in any situation.

The idea that "if AI writes the code, people get an easier job" is only half right. The effort of writing goes down, but the responsibility for verification and design grows heavier instead. We will confirm this asymmetry again and again throughout the series.

05Why Verification Is Mandatory ── "It Ran" Is Not "It's Correct"

You run AI-written code once, it produces the expected result, and you relax ── that is the most common pitfall. "It ran" is not "it's correct." It merely happened to work for the input you gave; any number of other inputs may still break it.

Verifying traditional code and verifying AI-generated code place their emphasis differently. Laid out, it looks like this.

When a human writes it all	When AI writes the code
The author knows the intent	You first have to read it to confirm intent and implementation have not diverged
Bugs appear as "typos"	Bugs slip in as "plausible fabrications," in a form that looks correct
The existence of the libraries used is self-evident	You have to check, one by one, whether the libraries and functions actually exist
Testing is a finishing step	Testing becomes the central step that determines whether it can be trusted

So in AI programming, testing (= a mechanism that lines up inputs and expected outputs and checks them automatically) is upgraded from "nice to have" to "mandatory." Verification you could skip with human-written code cannot be skipped for AI-generated code ── the faster you can write, the thicker you make the checking step. That is how you strike the balance.

06Cautions for Medical Software ── Domains Where "Can Generate" Doesn't Mean "Can Use As-Is"

For readers approaching this series from the pharmaceutical and medical fields, there is one point worth stressing especially hard. AI being able to "generate" code, and that code being permissible to "use" in a medical context, are entirely separate matters.

Software that handles patient data, calculates dosages, or bears on diagnostic or treatment decisions falls under the international standard IEC 62304 (= the life-cycle standard for medical device software) as medical device software. This standard requires a record (traceability) of "who verified what, and how" at each stage of development. "We can't explain the basis for the contents because AI generated it" does not fly under this framework.

Further, when you build software that outputs information related to medicine and pharmaceuticals, the advertising regulations of the Pharmaceutical and Medical Devices Act (PMD Act) stand behind it. Under the Act, the prohibition of exaggerated advertising is in Article 66, the prohibition of advertising unapproved drugs is in Article 68, and the proper conduct of information provision in sales information provision activities is in Article 68-2. Even for explanatory text or reports generated by AI, these measures do not loosen in the slightest. Judgment is made not on "who wrote it" but on "what is written."

The boundary line: For in-house aggregation scripts or automating document formatting, using AI-generated code is realistic. For the parts that bear on patient safety or regulated information provision, however, even if the code can be generated, it cannot be used unless it passes the heavy stages of verification, recording, and review. Between "can build" and "can use" stands the wall of regulation.

07The Order of Adoption ── Start Where Risk Is Low, Verify, Then Expand

So how do you bring AI programming into the field? There is one principle ── start where a failure does little harm, build up the verification machinery, and expand little by little into more responsibility-heavy territory. The reverse order ── putting AI into critical production processing from the outset ── is the approach most to be avoided.

Stage 1 ── throwaway work: one-off data shaping, log aggregation, draft generation. Begin in territory where a mistake just means doing it over.
Stage 2 ── repeated internal work: automating routine reports, assisting in-house tools. Add tests and confirm it can be used repeatedly.
Stage 3 ── production with verification: build it into business processes on the firm premise of always passing human review and tests.
Stage 4 ── regulated targets: territory bearing on patient safety, medical devices, and regulated information. Only through frameworks such as IEC 62304 and formal review.

The fence common to every stage is the verification described in Section 5. The higher the stage, the thicker you make the checking step ── if you think of the purpose of raising speed as freeing up time to spend on verification, you will not get the order wrong.

08The Map of All Ten Installments

Here is the structure of the ten installments this series covers, mapped out in advance. Use it as your compass while reading.

Vol. 01
The Foundations of Code Generation ── What an LLM Can and Cannot Write (this piece)
The whole map of principles and limits; the starting point of the series
Vol. 02
Using Copilot ── Practical Craft for Completion-Type AI
AI used inside the editor; writing without over-trusting completion
Vol. 03
Conversational Coding ── Working With ChatGPT / Claude
How to give instructions, pass context, and design the back-and-forth
Vol. 04
Prompt Design ── Writing Instructions That Get Through to AI
The technique of passing requirements as a frame; reducing ambiguity
Vol. 05
Testing and Verification ── The Machinery for Trusting AI Code
Automated tests, boundary conditions, review patterns
Vol. 06
Debugging ── Hunting the Cause Together With AI
How to read errors, how to hand them to AI, how to chase root cause
Vol. 07
Agentic Development ── AI Running Multiple Steps on Its Own
The light and shadow of autonomous execution; designing the scope you delegate
Vol. 08
Security and Confidentiality ── How to Protect Code and Information
Handling confidential information; vulnerabilities in generated output
Vol. 09
Implementation in Pharma ── Using AI Under Regulation
IEC 62304, the PMD Act, and coexistence with internal review
Vol. 10
Integration ── Making AI Programming Take Root in the Team
Operating rules, division of responsibility, design as an organization

09Connections to Other Chapters ── Reading Alongside AI Marketing and Material Review

The AI Programming series connects to the other chapters on this site as follows. Reading them together makes your understanding of AI three-dimensional.

AI Marketing Vol. 1 ── Marketing Redefined ── The whole map of an era in which content is mass-generated by AI. This series covers the "technology on the making side" of that.
AI Marketing Vol. 5 ── Balancing Speed and Review ── How to review generated output with a human-in-the-loop (= a mechanism where a person intervenes partway). The verification philosophy is shared with Vol. 5 of this series.
Material Review series ── The practice of review that receives generated output at the end. Whether code or promotional material, review stands between "can build" and "can use."

In Closing

The era of AI writing code has genuinely arrived. But that "writing" is not done by understanding meaning; it is only lining up the most plausible continuation within the training data. That is precisely why it is astonishingly fast at routine work, and quietly wrong at unprecedented design and strict computation. It writes nonexistent functions with the same confidence as correct code ── this property will not disappear as models get smarter.

The conclusion this map points to is simple. Let AI write, and have humans verify. The faster you can write, the thicker you make the checking step. In the pharmaceutical and medical fields especially, between what can be generated and what can be used stand the walls of verification, recording, and regulation. Next time, map in hand, we move to the nearest entry point ── the practical craft of Copilot-type AI that completes lines inside the editor.

Key Points ── Three to Take Away

An LLM does not write code by understanding meaning; it only predicts, by probability, "the word likely to come next in this context." That is why it is good at routine, precedent-rich code but poor at unprecedented design and strict computation, and why it can write nonexistent functions with the same confidence as correct code (hallucination).
The more AI takes on the "writing," the more human work shifts to "what to have it write (requirements definition)" and "whether what was written is correct (verification)." "It ran" is not "it's correct," and testing is upgraded from optional to mandatory. Responsibility is not waived by "because AI wrote it."
In pharma and medicine, being able to generate and being able to use are different things. Code that bears on patient safety or regulated information sits under the traceability of IEC 62304 and the measures of the PMD Act (exaggeration Art. 66 / unapproved Art. 68 / information provision Art. 68-2), and judgment is made not on "who built it" but on "what is written." Adopt from low-harm territory first, thickening verification as you expand.

Sources & References

Chen, M. et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374, 2021. (The original source on the code-generation LLM "Codex" and HumanEval evaluation; shows the gap between first-try accuracy and repeated sampling.)
Vaswani, A. et al. Attention Is All You Need. Advances in Neural Information Processing Systems 30 (NeurIPS), 2017. (The original paper on the Transformer architecture; the basis for next-word prediction.)
Ji, Z. et al. Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, Vol. 55, No. 12, 2023. (A survey that systematically organizes hallucination in generative AI.)
OpenAI. OpenAI Platform Documentation. OpenAI, accessed 2026. (Official documentation on the capabilities and constraints of each model.)
Anthropic. Claude Documentation. Anthropic, accessed 2026. (Official documentation on how to use Claude and its constraints.)
International Electrotechnical Commission. IEC 62304:2006 Medical device software — Software life cycle processes. IEC, 2006 (Amendment 1: 2015). (The life-cycle standard for medical device software; the primary source for traceability requirements.)
Ministry of Health, Labour and Welfare. Act on Securing Quality, Efficacy and Safety of Products Including Pharmaceuticals and Medical Devices (PMD Act), Articles 66, 68, and 68-2. (The respective provisions on exaggerated advertising, unapproved advertising, and sales information provision activities.)

← Back to AI Programming