AI Programming 05 — Refactoring: Safe Improvement with AI Assistance | AI Programming | Pharmaceutical Advertising Regulation: Material Creation, Review & Use in Japan

Refactoring ── Safe Improvement with AI Assistance── Fixing the internals in small diffs while preserving behavior

Why go out of your way to rework code that already runs? The answer is simple ── because "it runs" is exactly what makes it impossible to fix next time. Refactoring (= changing the internal structure without changing the externally observable behavior) is not the work of adding features. You return the same output for the same input, while improving readability and ease of change. AI (= generative AI, a tool that suggests code) makes this work faster. But speed cuts both ways, because AI will casually break the very promise of preserving behavior. This installment maps out the destinations and the fences for using AI's suggestions safely and to the full. We keep in mind software where the record matters ── analysis code and report generation of the kind handled in pharmaceutical settings.

01What Refactoring Is ── Fowler's Definition

First, let us be precise about the word. The term "refactoring" was popularized by Martin Fowler in his book Refactoring (1999, 2nd edition 2018). His definition is plain ── improving the internal structure of software without changing its externally observable behavior. The crucial part is the condition "without changing behavior." You are not fixing a bug, nor adding a feature. It does the same thing, but the insides are made clean. This single point separates refactoring from every other kind of work.

Why devote a whole book to such unglamorous work? Because code is not "written once and done." Analysis programs for medicines, and the machinery that produces reports, get reworked every time a regulation or a data format changes. When that happens, a tangled structure makes change difficult and breeds new bugs with every fix. Refactoring is, in effect, grading the land so that those future changes come cheap and safe.

Fowler likened adding features and refactoring to two separate hats. Be clearly aware of which hat you are wearing at any moment. If you mix work that changes behavior (adding features, fixing bugs) with work that does not (refactoring), you lose track of what caused the break. This discipline of "not mixing the hats" is the foundation for AI-assisted work as well.

02How to Use AI's Suggestions

Generative AI is good at suggesting refactorings. Split a long function, fold duplicates into one, rename variables to be clearer ── it shows these routine improvements in seconds. It is also skilled at mechanically catching duplication a human would miss. This is a place to accept its help without reservation.

But there is an order to how you use it. Handing the AI "clean up this code" and adopting whatever comes back as is ── this is the most dangerous approach. Because when the AI tidies the structure, it will sometimes quietly change the behavior in passing. It shifts a boundary condition slightly, changes how an exception is handled, changes how rounding is done. The look becomes clean, yet the output is subtly different. If this happens in pharmaceutical analysis code, the resulting numbers change.

Use 01

Take suggestions as "candidates"

"Consider, don't adopt"

The AI's output is a draft, not a finished product. Read what it changed one item at a time, understand the intent, and only then decide to accept or reject. Do not adopt without reading.

Use 02

Make it state "what changed"

"Explain the diff"

Have the AI explain the essence of the change in words. Ask explicitly, "Is behavior preserved?" Distrust any change it cannot explain.

Use 03

One kind at a time

"Don't mix"

Do not ask for name cleanups and structural splits at once. Narrowing to one kind of improvement keeps both review and verification easy to follow.

In short, the AI is a proposer, not a decider. The person decides. When this ordering collapses, you take on invisible behavioral changes in exchange for speed.

03Was Behavior Really Preserved? ── Verification

The promise "do not change behavior" cannot be kept by willpower or by eyeballing. You need a mechanism to confirm that it is being kept. At its center is automated testing (= a mechanism where one program automatically checks whether another program works correctly).

Kent Beck, in Test-Driven Development (2002), called tests a safety net (= a net that catches you when you fall). In refactoring, this net becomes your lifeline. Confirm that all tests pass before you touch the code, and confirm once more that they all pass after. If the results are the same before and after, you can say behavior was preserved. Without tests, refactoring becomes a "probably fine" gamble.

Stage	What to do	The question to confirm
Before starting	Run all existing tests and confirm green (all pass)	Is behavior currently pinned down by tests?
During work	Run the tests frequently, each time you make a small fix	Has this one move turned anything red (failing)?
After finishing	Confirm again that all tests are green	Is the externally observed result the same as before?

If you are refactoring code with thin tests, or none, the order reverses. First write tests that capture the current behavior. Michael Feathers, in Working Effectively with Legacy Code (2004), called these characterization tests (= tests that record the current behavior exactly as it is). Set aside for now whether it is correct; pin down "this is how it behaves now," and only then touch the structure. Having the AI write a first draft of the tests is useful, but whether those tests truly probe the boundaries is something a person inspects.

04The Principle of Small Diffs

The discipline that works best in refactoring is to fix small, one step at a time. Not rewriting large and testing all at once at the end. Fix small and confirm, fix small again and confirm. The finer you make these increments, the easier it is to pinpoint the moment something broke.

The reason lies in isolating failure. If you change 100 lines at once and a test goes red, the cause is somewhere in those 100 lines. It takes time to find. If you change only 5 lines and it goes red, the cause is within those 5 lines. You can revert at once and fix at once. The smaller the diff, the cheaper the mistake.

With AI assistance, this principle matters all the more. Ask, and the AI hands you a large rewrite in one breath. It looks highly finished, and you are tempted to adopt the whole thing. But large diffs tend to drift, in both review and verification, into "glance over the whole and OK." Small behavioral changes hide in that bulk and become invisible. Even when the AI produces large, the person takes it in small ── split the block, and bring it in unit by meaningful unit, inserting tests along the way. This work of translation is precisely what the person carries.

"It ran" is not "it is correct": When you put in a large diff at once and the tests happen to pass, you slip into the illusion that "it ran, so it is correct." But tests passing only speaks to the range the tests look at. Boundaries the tests cover thinly (rounding, missing values, boundary dates, and the like) slip past the pass. Keeping diffs small is also about keeping the region beyond the tests' reach small enough to follow with the human eye.

05The Role of Review ── Separate the Writer from the Reviewer

The AI writes the code, and the same AI evaluates it as "no problem" ── this is not verification. If the writer and the evaluator are the same, the writer's blind spot is the evaluator's blind spot. It is the same logic as, in the human world, not letting the person who created a material clear their own review. Separate the maker's role from the reviewer's role.

In a refactoring review, what the reviewer should ask differs a little from a feature-addition review. The central question is not "what new thing can it now do," but "does it truly do the same thing as before." Concretely, confirm the following.

Preservation of behavior ── Does the correspondence between input and output match what it was before the change? Is it the same, including behavior at boundaries, exceptions, and errors?
Size of the diff ── Does a single change stay within a size you can follow? Has too much been mixed in?
Backing by tests ── Do tests exist that guard this change, and are they green? Has something with thin tests been passed through silently?
Explanation of intent ── Can you read from the commit or comments why this structure was chosen? Has an AI suggestion been taken in without even knowing the reason?

Using the AI to assist review is useful. Have it list candidate oversights, and let the person make the final judgment. But keep the order ── separate the generation pass from the verification pass. Not letting the same context write and self-approve is the fence that breaks the chain of blind spots.

06Dangerous Patterns ── Failures Common Under AI Assistance

As the flip side of all this, we gather the forms to avoid. Every one of them happens for the reason "because it is fast."

Anti 01

Wholesale adoption

Taking in the AI's output without reading it. Because you proceed without grasping what changed, you cannot notice the shift in behavior.

Anti 02

Mixed hats

Fixing a feature while you are at it, in the course of refactoring. When it breaks, you cannot tell whether the cause is the structural change or the feature change.

Anti 03

Starting without tests

Touching the structure without stretching a safety net. You have no means to confirm sameness before and after, and it becomes a "probably fine" gamble.

Anti 04

Huge bulk change

Putting in a large diff at once. Both review and verification come down to glancing over the whole, and small deviations hide within.

What the four share is that they prioritize speed and skip the steps that confirm. The AI strengthens these temptations. It hands you a large, plausibly finished block in an instant. That is exactly why the side receiving the speed needs fences that do not give way.

07Operations ── In Settings Where the Record Matters

In software whose results bear on regulation or on the record, such as pharmaceutical analysis code and report generation, refactoring operations call for one more level of care. Japan's Ministry of Health, Labour and Welfare's "Guideline on Management of Computerized Systems for Marketing Authorization Holders of Drugs and Quasi-drugs" (2010) requires that changes to a system be controlled and left in the record. Even for an improvement that does not change behavior, the fact that a change was made, and the record confirming it was appropriate, must be kept ── this is the foundation of operations.

Concretely, run the following three. First, keep evidence that the results match before and after the change. Records of tests passing, and output comparisons on the same data, serve this end. Second, make it possible to trace who changed what, and why. Do not drop the AI's suggestion straight into version control (= a mechanism that keeps the entire history of changes); have the person add the intent in writing and record it. Third, have someone other than the maker confirm it. Build the review of Section 5 in as a formal step in the process.

AI assistance also fits well with these operations in some respects. Have it draft the summary of a change, produce a first draft of tests, and surface the points of review ── it can lighten the labor of building the record. But submitting a generated record without a person reading it is the same failure as "wholesale adoption" in Section 6. A record has value in that someone can read it later and the meaning holds. Being able to make it fast, and being able to leave it correctly, are separate goals to be met together.

08Connections to Other Chapters on This Site

This installment deepens when read alongside the following chapters.

AI Programming Vol. 6 ── Documentation Generation ── How to leave the structure you tidied through refactoring as documents. It treats the limits of auto-generating documents from code.
Material Review Series ── The idea of separating the maker's role from the reviewer's role is the same in code review and in material review. The practice of person-to-person confirmation.
AI Marketing Vol. 5 ── AI-Generated Content Strategy ── A design that builds verification inside the speed of generation. The structure is shared, whether in code or in content.

Conclusion

The core of refactoring is exactly Fowler's definition ── improving the internals without changing the externally observed behavior. The AI makes this work faster, but speed is the flip side of the peril of breaking the promise to "preserve behavior." So the fences to keep are clear. Confirm sameness before and after with tests. Cut the diff into small increments. Separate the maker's role from the reviewer's role. And leave the fact of the change and its validity in the record.

The AI is a proposer, not a decider. Even when it hands you a large, clean block in one breath, the person takes it in small and confirms it one at a time. As long as you do not collapse this ordering and this hierarchy, AI-assisted refactoring becomes the grading of the land that makes future change cheap and safe. Next time, we leave the tidied structure as documents ── advancing to the attempt to auto-generate documents from code, and its limits.

Key Points ── 3 to Take Away

Refactoring is the work of tidying the internals without changing the externally observed behavior (Fowler's definition). Keep its "hat" separate from adding features and fixing bugs, and do not mix them. The AI is a proposer, not a decider ── read what it changed, understand it, and only then decide to accept or reject.
"Behavior was preserved" is confirmed by tests, not by willpower. Confirm all tests are green before you start, and confirm again after you finish. For code with thin tests, first write tests that capture the current behavior (characterization tests), then begin.
Cut the diff into small increments, separate the maker's role from the reviewer's role, and leave the fact of the change and its validity in the record. Being able to make it fast and being able to leave it correctly are separate goals. Do not let the same AI write and self-approve; separate the generation pass from the verification pass.

Sources & References

Martin Fowler. Refactoring: Improving the Design of Existing Code. Addison-Wesley, 1999 (2nd ed. 2018). (The original source for the definition and procedures of "refactoring.")
Kent Beck. Test-Driven Development: By Example. Addison-Wesley, 2002. (Source for the idea of tests as a safety net.)
Michael Feathers. Working Effectively with Legacy Code. Prentice Hall, 2004. (Approaches to code with thin tests, such as characterization tests.)
Robert C. Martin. Clean Code: A Handbook of Agile Software Craftsmanship. Prentice Hall, 2008. (Principles of readability, such as naming and function splitting.)
Ministry of Health, Labour and Welfare (Japan). Guideline on Management of Computerized Systems for Marketing Authorization Holders of Drugs and Quasi-drugs. 2010. (Primary source on change control for software where the record matters.)

← Back to AI Programming