AI Programming 04 — Writing Tests With AI: Between Generation and TDD | AI Programming | Pharmaceutical Advertising Regulation: Material Creation, Review & Use in Japan

Writing Tests With AI — Between Generation and TDD— The real power of test generation, the coverage trap, and verifying medical software

If you can have AI write your code, you can have it write your tests too. In fact, generating test code is one of the tasks today's generative AI (= AI that automatically produces text or code) does best. Hand it a function of a few dozen lines, and in seconds you get back dozens of tests. But "having lots of tests" and "having quality protected" are not the same thing. This installment measures the real power and the limits of letting AI write tests, examines the trap hidden in coverage (= the proportion of your code that was executed during testing), and goes all the way to how verification should be designed in domains where lives are at stake, such as medical software.

01What Are Tests Actually For?

Before we get into test automation, let's restate exactly what tests are for. Leave this vague and rush into "let AI make life easy," and you end up piling up tests that grow in number but thin in meaning. Tests serve two broad roles.

One is to confirm that the code you just wrote behaves as intended. The other, and the more important one, is to guarantee that when you later change the code, nothing has broken. The first pays off once; the second pays off for a long time. Software is not built once and finished — it is fixed continuously, and precisely because of that, tests become the safety net that grants permission to change. Without tests, people grow too afraid to touch the code.

From this vantage point, the conditions for a good test come into focus. It breaks when it should break, and does not break when it should not. In other words, it catches real bugs and stays silent for unrelated changes. Whether the tests AI mass-produces satisfy this condition — that is the question running through this whole installment.

02Letting AI Write Tests — Its Real Power

To put the conclusion first: AI test generation is genuinely useful in practice. Especially in the following situations, it is faster and less prone to omissions than writing by hand.

Strength 01

Covering the routine

"filling the gaps"

Mechanically laying out normal cases, error cases, and boundary values (= input values right at the dividing line) is AI's home turf. It won't forget the "empty string," "0," or "maximum value" that people find tedious and tend to skip.

Strength 02

Building the scaffolding

"producing the template"

The test framework, the setup of mocks (= fake parts used in place of the real ones), the repetitive configuration code — it prepares all the tedious groundwork at once. People can then scrutinize what goes on top of it.

Strength 03

Aiding comprehension

"putting intent into words"

Hand it existing code and it articulates, in the form of tests, an understanding like "this function works on these assumptions." Where the specifications are thin, this alone becomes valuable.

The limits, on the other hand, are just as clear. Because AI derives tests from what the code "is doing," if the code has a bug, it locks that bug in as "correct behavior," bug and all. This is called ratifying the implementation. Hand it a buggy function, and AI writes a test that reproduces the bug, and because that test "passes," the mistake gets a seal of approval. Tests should be written against the specification, yet AI writes them against the code in front of it. This is the biggest trap.

03The Coverage Trap — When Numbers Lie

The classic metric for measuring the quantity of tests is coverage. Hear "90% coverage" and it feels as if nine-tenths of the code is protected. But this number often deceives. What coverage measures is only "whether that line of code was executed during testing," not "whether that behavior was verified to be correct."

What coverage shows	What coverage does not show
That the line was passed through during the test run	Whether that line's output is correct
Which parts were never tried at all	Whether combinations of branches were tried
A rough gauge of the test's "breadth"	The test's "depth" or "rigor"

Here is an extreme example. A test that merely calls a function and never checks the result at all (= has no assertion) still executes the code line, so coverage goes up. Ask AI to "raise the coverage," and this is what happens: it fills the number with hollow tests. The figure reads 90%, yet in reality nothing is protected — that is the state that gets created.

Coverage is not a target; it is a map: Martin Fowler warns against setting a coverage figure itself as a goal. The useful way to use it is as a tool for finding "the parts that are not tested at all." The moment you turn it into a quota — "achieve 90%" — both people and AI rush to pad lines with no substance. Raising the number is easy; raising the quality is hard. Not confusing the two is the dividing line in the age of mass production.

04TDD and AI — The Point of Reversing the Order

The old answer to avoiding the trap of ratifying the implementation is TDD (= Test-Driven Development. An approach where you write the tests before the code). Systematized by Kent Beck in 2002, its order goes like this. First write a failing test, then write the minimum code to make it pass, and finally tidy up. You cycle this "red → green → refactor" in small steps.

The crucial thing here is the order. In TDD the test comes first, so the test expresses the specification of "this is how it should behave." Since the implementation does not yet exist, there is nothing to ratify. This becomes all the more effective in an age where AI writes the code. A person fixes the specification in the form of tests first, and has AI write code within that frame. AI then cannot freely "embellish"; it can only move inside the fence of the tests.

There are two styles of combining the two:

Human writes tests, AI implements — a person writes the specification as tests first, and has AI write code that satisfies them. Because the person holds the initiative over the specification, it is easier to prevent ratification. Suited to demanding situations such as medical systems.
AI writes tests, human reviews — when adding tests to existing code after the fact, have AI produce a first draft and let a person scrutinize "is this correct as a specification?" Fast, but the danger of ratification remains, so the rigor of the review is the lifeline.

The principle common to both is one and the same. It is the human, not the AI, who decides "what is correct." AI writes fast, but it holds no standard of correctness. Only a person who understands the specification holds that standard.

05Regression Testing — Protecting "Change It Without Breaking It"

The second role of tests — guaranteeing that future changes have not broken anything — is carried by regression testing (= repeatedly checking that a previously fixed defect has not recurred). You fix one spot in the code, and a distant, unrelated place gets caught up and breaks. Verifying this by hand every time is unrealistic. So you run the tests you wrote once, automatically and repeatedly.

Here too AI is useful, but caution is needed. When you find a bug and fix it, adding one test that reproduces that bug is the royal road of regression testing. Have AI fix only the code, and it will often skip this "add a test to prevent recurrence." It fixes, but does not erect the fence that stops it happening again. As a result, the same bug returns again and again in a changed form.

Another trap is the flaky test (= a test whose result changes every time it runs, an unstable test). Among AI-generated tests, some slip in that depend on the time or the execution order and occasionally fail. As flaky tests increase, people start ignoring warnings as "another false alarm," and end up overlooking even the warnings of real bugs. An alarm that sounds too often is more dangerous than one that never sounds. For regression testing, reliability — sounding only when it should — matters more than quantity.

06Verifying Medical Software — IEC 62304 as the Foundation

So far this has been general theory. But if you handle software in a pharmaceutical or medical setting, testing becomes not a "good habit" but a regulatory obligation. At the center sits IEC 62304 (= the international standard defining the life cycle of medical device software. Medical device software — Software life cycle processes). It defines the framework for developing and verifying software embedded in a medical device, or software that is itself a medical device (= SaMD).

The core idea of this standard is that it varies the required rigor by how much harm reaches the patient when the software breaks. The safety classes divide into three levels.

Safety class	Consequence a failure could bring	Rigor of verification required
Class A	No possibility of injury or damage to health	Basic quality management suffices
Class B	Possibility of non-serious injury	Systematic verification of design, unit, and integration
Class C	Possibility of death or serious injury	Rigorous documentation, tracing, and verification across all processes

Here the relationship with AI-generated tests comes into question. What IEC 62304 requires is not that tests "exist," but that each and every requirement is verified by a corresponding test, and that this correspondence can be traced (= traceability). Even if AI spits out a mass of tests, if you cannot show which requirement each one ties to, they are not subject to evaluation under the regulation. Sheer quantity does not count as evidence.

"It worked" is not verification: The U.S. FDA's "General Principles of Software Validation" (2002) also does not regard a test having passed as, in itself, the completion of verification. What is required is confirming against pre-defined acceptance criteria, in a planned way, with records kept. If you use AI-generated tests, a person must be able to explain on what grounds those tests can be said to be correct. Even for generated artifacts, the responsibility for verification remains with the developer.

07Operation — The Mechanism That Keeps Tests Running

Tests are not written once and done; they gain meaning only when run automatically on every change. This "run automatically" mechanism is CI (= Continuous Integration. A mechanism that automatically runs the tests every time the code changes). In an age when AI produces code fast, this automated checkpoint becomes all the more important, because changes arrive faster than a person can follow by eye.

Here are a few principles to uphold in operation.

If a test is red, do not move forward — do not add features while leaving a failed test unaddressed. Stack on a broken foundation, and isolating the cause grows exponentially harder.
Crush flaky tests the moment you find them — leave false alarms alone and the trust in all warnings is lost. Fix unstable tests, or quarantine them.
Pass AI-generated tests through human review too — being generated does not make them correct. Especially in medical systems, the record of the review itself becomes regulatory evidence.

What matters is the idea of using the automation of CI not to spare human judgment, but to concentrate human judgment where it weighs the most. Leave the simple "is it broken?" check to the machine, and let people face the hard question "is this specification really correct?" This division of labor is the key to keeping tests running without being swallowed by speed.

08Connections to Other Chapters on This Site

This installment gains depth when read together with the following chapters.

AI Programming Vol. 5 — Refactoring — only with the safety net of tests can AI-assisted code improvement be done safely. This installment firms up that prerequisite.
Material Review series — the structure of a person reviewing a generated artifact and passing or stopping it runs in common between code review and material review.
Diary — a set of essays depicting, as a human endeavor, the attitude of not settling for "it worked" but continuing to ask whether it is correct.

In Closing

AI has made it possible to write tests fast and in bulk. This is an opportunity. But the abundance of tests is not the same as high quality. Because AI derives tests from the code in front of it, it locks in the bug as "correct," bug and all — the trap of ratifying the implementation. The coverage figure, too, shows that a line was executed but not that it is correct. The moment you make the number a target, both people and AI rush to pad lines with no substance.

That is exactly why we take the order back into our own hands. A person fixes the specification in the form of tests first, and has AI write code inside that fence — the TDD mindset works precisely in the age of mass production. For medical software, what IEC 62304 requires is not the number of tests but that the correspondence between requirements and tests can be traced and explained in the record. "It worked" is not verification. It is the human, not the AI, who holds the standard of correctness. The power to write fast should be used precisely to create the time for people to concentrate on correctness. Next time, on top of this safety net, we enter how to carry out AI-assisted refactoring safely.

Key Points — Three to Take Away

AI test generation is genuinely useful for covering the routine and building scaffolding. But by its "derive from the code" nature, its biggest trap is ratifying the implementation — locking in the bug as correct behavior. It is the human, not the AI, who decides the standard of correctness.
Coverage measures only "whether it was executed," not "whether it is correct." Make the number a target and you get line-padding with hollow tests. Use coverage not as a quota to hit but as a map for finding untested spots.
For medical software, IEC 62304 is the foundation. What is required is not the number of tests but that the correspondence between requirements and tests can be traced and explained in the record. "It worked" is not verification. With TDD, a person fixes the specification first and uses AI inside that fence.

Sources & References

Kent Beck. Test-Driven Development: By Example. Addison-Wesley, 2002. (The origin of TDD's "red → green → refactor")
International Electrotechnical Commission. IEC 62304:2006 Medical device software — Software life cycle processes (including Amendment 1: 2015). IEC, 2006/2015. (Safety classes and verification requirements for medical device software)
U.S. Food and Drug Administration. General Principles of Software Validation; Final Guidance for Industry and FDA Staff. FDA, 2002. (The acceptance-criteria view that does not treat "it worked" as verification)
Glenford J. Myers, Corey Sandler, Tom Badgett. The Art of Software Testing, 3rd ed. Wiley, 2011. (The classic of test design, including boundary values and error cases)
Martin Fowler. Test Coverage (bliki). martinfowler.com, 2012. (The guidance not to make a coverage figure a target)
ISO/IEC/IEEE. ISO/IEC/IEEE 29119 Software and systems engineering — Software testing. ISO/IEC/IEEE, 2013– (revisions ongoing). (The international standard for software testing)

← Back to AI Programming