Two reviewers can look at the same work and split between "L2" and "L3." That doesn't always mean one is wrong. It often means the evidence behind the judgment differs sharply between visible skills and hidden ones. This time we face the question head-on: how far can a verdict itself be trusted?
Test numbers versus a doctor's touch
Think of a medical check-up. Weight and blood pressure come out as numbers from a machine, so almost anyone gets the same reading. But when a doctor presses your belly and says "a little tight here," that touch can be read differently by different people. Both tell us about the body, yet they differ in how sure we can be.
The same thing happens when we measure a creator's skill. In the previous piece we built a table linking what we can see in a deliverable to a level. But even with the table, if the evidence we fit into it is thin, the verdict wobbles. The trust we can place in a judgment is set, before the fineness of the ruler, by how visible the evidence is. Call this observability: able to be checked from the outside.
A judgment is the product of a sound ruler and visible evidence. A good ruler with thin evidence still yields an unsure verdict.
Visible skills and hidden skills
Picture a cook. The taste and plating of the dish in front of you can be checked by eating it. That is easy to observe. But whether this cook can adapt when a new ingredient arrives cannot be seen from one plate. It shows only after several dishes under different conditions.
The creator's eight skills also line up unevenly by observability. Whether one can return to the source can be settled clearly by matching the quote against the original. Whether claims and cautions are balanced can be seen by measuring their proportions. But the reasoning behind a design choice is hard to see without the person's own words and a record across several jobs. The less visible a skill, the wider a single deliverable makes the verdict swing.
| Kind of skill | Visibility | Trust in verdict | How to check |
|---|---|---|---|
| Return to the source | High (can match the original) | High | Match each quote against the source one by one |
| Balance of claims | Mid-high (can measure proportion) | Mid-high | Compare lines/area of benefit versus caution |
| Why it was made (adaptation, judgment) | Low (reasons in the head) | Low (needs evidence) | Person's account plus repeats across jobs |
Don't erase the wobble — name it and handle it
When two proofreaders read the same proof, the errors they catch differ a little. Cutting down to one reader to avoid that is counterproductive. The disagreement itself flags where judgment is genuinely hard. The same holds for our verdicts: when raters split on a level, don't erase the split — record it and handle it.
Three moves help. First, look across several jobs, not one deliverable; whether one success was skill or luck becomes clear over two or three. Second, ask the person to briefly say why they designed it that way; if they can give a reason, that is evidence of adaptation (L3). Third, never use a single rater; have two or more look, and when they split, return to the source to align. Source-grounding, the necessary floor, rarely splits, so it can serve as the shared reference point.
When verdicts split, go back to the source before deciding a winner. Grounding is a floor no one can move, so aligning there lets the discussion move forward.
Attach confidence to the verdict
A forecast doesn't just say "rain tomorrow"; it says "70% chance." That number lets you decide for yourself whether to carry an umbrella. A creator's verdict is the same: "L3 (confidence: medium; one job only, with the person's account)" is far more honest and usable than a bare "L3."
A low-confidence verdict is not a mistake; it is the state of "not enough evidence yet." That makes the next step clear: gather more evidence. Conversely, declaring "L3" with no confidence noted lets thin evidence walk around as settled fact. That undermines the foundation for the multiple-eyes approach of the next piece and the pass/fail and development of the last. Attaching confidence is also a promise that protects the later steps.
| How it is written | What the reader can do | Danger |
|---|---|---|
| Just "L3" | Only believe it as is | Thin evidence turns into settled fact |
| "L3 (confidence: medium, one job)" | Add evidence or hold | Low (limits are stated) |
One thing must not be forgotten: high confidence only means "measured well," which is separate from "high skill." If we settle for surely measuring the visible floor skills, we drop the hidden adaptation skills. Confidence (how surely it was measured) and level (how much skill there is) are two separate axes; watch both. Mixing them slides into the bias of judging people only by what is easy to measure.
Measuring Skill from Work and Behavior ── Map of all 10 episodes
- Vol. 1: Measure by the Materials Actually Made, Not by Impressions or Self-Report ── A material maker's skill is measured from the actual deliverables and observable conduct, not from self-report or others' impressions.
- Vol. 2: Tracing the Brief, the Choices, and the Result — In Order ── Read a creator's skill from evidence by walking through one real project in order: the brief, the thinking, the actions, and the result.
- Vol. 3: Reading "Faithfulness to the Facts" and "Craft of Delivery" Out of the Work Itself ── This installment shows how to recode a finished piece into two axes — faithfulness to the facts and the craft of getting it across — by reading concrete clues, not impressions.
- Vol. 4: The Rules That Keep Measurement Honest ── Six ground rules that keep the evaluator from drifting when measuring an author's real skill.
- Vol. 5: Three Rulers: Accuracy, Clarity, and Balance ── Defines three rulers for grading material-making skill and scores each on a four-step scale: accuracy as the floor, clarity as the reach, and balance as the adjustment between too much and too little.
- Vol. 6: How to Decide the Level — Returning to the Source Sets the Ceiling ── Work that cannot be traced back to its source cannot earn a higher level, however polished it looks. Grounding sets the ceiling.
- Vol. 7: What Deliverables Signal Which Level ── An anchor table that reads a creator's level (L1-L4) from visible deliverables and behavior patterns.
- Vol. 8 (this episode): How Far Can We Trust a Judgment? ── How sure a level judgment is depends on how visible the evidence is; less observable skills produce shakier judgments, so we attach a confidence to each verdict.
- Vol. 9: Combine More Than Self-Assessment: Add the Reviewer's and Requester's View ── Layering four viewpoints — self, reviewer, requester, and AI — surfaces the deviations of omission that a single pair of eyes cannot see.
- Vol. 10 (final): Connecting the Measurement to Pass/Fail and a Development Plan ── The finale links the score to the pass floor and a plan for what to grow next.
Whether a verdict can be trusted depends not only on the rater's skill but on how visible the evidence is. Visible floor skills can be measured firmly; hidden adaptation skills wobble. Rather than erasing that wobble, we thicken the evidence with multiple jobs, the person's account, and multiple eyes, then align at the unmovable floor of the source.
And every verdict carries a confidence: not just "L3" but "L3, medium confidence, one job of evidence." This is a matter of honesty and, at the same time, the foundation that supports the next piece's multiple evaluations and the final piece's pass/fail and development. Never confuse having measured surely with having high skill — that is the dividing line that makes a judgment worthy of trust.
- Observability sets how much a verdict can be trusted. Even a sound ruler wobbles when evidence is invisible. Returning to the source is visible; why it was made is hidden.
- Don't erase the wobble — handle it with thicker evidence. Add multiple jobs, the person's account, and multiple raters; when they split, align at the unmovable floor of the source.
- Attach confidence to the verdict. Note confidence and amount of evidence alongside "L3." Measured surely and high skill are separate axes; watch both.
- Japan's Ministry of Health, Labour and Welfare "Standards for Fair Advertising of Drugs" and related notices — a public basis for judging whether a material's wording stays grounded in fact.
- Japan Pharmaceutical Manufacturers Association "Promotion Code" — a general reference for conduct creators must follow, such as balancing benefit and caution.
- General accounts of the Behavioral Event Interview (BEI) and the STAR method — an interview technique that traces "situation, task, action, result" in the person's own words to gather evidence of hard-to-see judgment.
- Textbook literature on competency assessment — how to infer skill levels from observable behavioral evidence and how to handle inter-rater variation (reliability).