The Hazard of Impression and Self-Report ── We Measure Only Demonstrated Behavior

When we evaluate people, the least reliable inputs are impression and self-report. "She is excellent" and "I am an L4" drift in meaning depending on who says them and who hears them — much as saying a dish "tastes good" leaves open what exactly is good about it. To cut off this wobble, this series narrows what we measure to a single point: not the ability someone holds, not the results they posted, but only the behavior they actually demonstrated. Part 1 explains, plainly and in line with the source's rules, why this narrowing is what makes a level genuinely measurable.

Why "excellent" cannot be measured

Words about a material reviewer's skill (the reviewer who checks whether ads and promotional materials meet regulations) usually start from the conclusion: "Her risk detection is sharp," "I can already judge from principles." Each may be true inside the speaker, but neither works as evaluation material. The same word "sharp" means, to one person, "never missing a wording violation," and to another, "seeing through how a chart is dressed up to mislead." When the same word points to different things, two evaluations cannot line up — just as "good communicator" on an interview sheet means something different to each interviewer.

Self-report has the same weakness. "I am an L4" tells you how accurate the person's self-image is, not the skill they can actually deliver. The calibration gap Δ ── the gap between self-rating and reality, treated in a later part ── is exactly the mechanism that records this self-image error in a separate column and never mixes it into the value of skill itself. So at the very start, the foundation of evaluation must be lifted off "report."

Why do this? Unless we measure with the same stick, evaluations in Tokyo and London cannot even be compared. Build evaluation on words and impressions, and it ends up turning on who said it.

Narrowing what we measure to behavior actually performed

The source's idea is simple. We measure neither the ability held nor the outcomes, but only behavior actually demonstrated (called "behavioral evidence" ── proof that something was done). The basis is a clear finding from behavioral science: how a person acted in the past best predicts how they will act in the future. However much motivation, character, or latent ability is discussed, if it has not shown up as behavior, it is not counted as evidence.

Once narrowed to behavior, the nature of evaluation changes. Break behavior into facts anyone would agree on, check them against a shared measuring stick, and evaluation becomes a question of how well the behavior matches the stick, not of how the evaluator views the person. It is like a medical check-up ── instead of a doctor saying "looks healthy" on a hunch, the verdict rests on shared numbers like blood pressure and blood sugar. So even across nationality, language, function, and rank, people can reach the same judgment. Looking at the same record of behavior, an evaluator in Tokyo and one in London arrive at the same reading ── that is what the design aims at.

STAR as the tool that extracts the "two axes"

So how is behavior drawn out? The AI digs into a single "behavioral event" through the layers of Situation (S), Task (T), Action (A), Result (R), and Thinking (+). This is the STAR method common in interviews ── asking, in order, "in what situation (S), what was asked of you (T), what did you do (A), and what happened (R)." This work adds "why you judged that way (+ Thinking)."

What matters is which part of the extracted fact corresponds to which axis. This work measures skill on two axes. One is abstraction α ── the "depth of thinking": is the ground of judgment mere reliance on the written wording, or does it trace back to the intent and principle behind the regulation? The other is scope σ ── the "breadth of reach": only one's own remit, or does it extend to other domains and unprecedented cases? And grounding g ── the "backing": did the behavior actually happen, not just get claimed? Scope σ appears in Action A, abstraction α in Thinking +, grounding g in Result R. STAR thus becomes the tool that extracts both axes directly. The table maps the layers heard to the objects measured.

Layer heard (STAR)	Object that surfaces	Role in evaluation
Situation S / Task T	Context (the circumstances)	A premise that prevents misreading. It does not act on L directly but frames how to read the behavior correctly
Action A (most important)	Scope σ (breadth of reach)	What was concretely done. Did it reach another domain or an unprecedented case
Result R	Grounding g (backing)	The outcome and impact. Confirming it actually happened
+ Thinking (motive)	Abstraction α (depth of thinking)	Why it was judged so. Reliance on wording, or building up from principle and intent

In short, Action A brings scope σ to the surface, Thinking + brings abstraction α, and Result R backs up grounding g. Why? Because the score must not shift with how smoothly the story is told. Only the behavior actually performed fixes the coordinates.

Six rules that keep listening clean (BEI axioms)

To make behavior alone the evidence, the listening itself needs brakes. The source sets six rules that the AI dialogue ── a Behavioral Event Interview, BEI, a method that digs into concrete past events ── must observe. In cooking terms, it is like washing your hands and the cutting board before the taste test. Break them and the measured value itself is fouled. They follow in order.

Behaviorism ── Count as evidence only what was "actually done," not "can do / could do." Motivation and character are not measured.
Past fact ── No hypothetical "what if" questions. Ask only about events that actually happened. This is the source of the backing (grounding g).
Subject isolation ── Replace "we" with "you / that person," and carve out the one individual's contribution.
Evidence ── Record the concrete behavior (verbs, "did X") that supports a conclusion, not the conclusion (adjectives like "excellent").
Reproducibility ── A one-off fluke, or a recurring habit? Corroborate across multiple events (check the same thing happened in other cases too).
Non-leading ── Do not hint at the desirable answer. The moment the questioner's expectation enters, the value is fouled.

Backing supports the level ── the "grounding ceiling"

Once behavior is read, how is L (the skill rank) decided? The key is the rule that "talk alone is not counted." A claim with no backing (grounding g=0) does not raise the level. On each axis, take only the highest band that is backed and supported. The source calls this the grounding ceiling ── the "ceiling" of how far the backing reaches. Like the focus on a photo, what counts as skill is only what is sharply in focus; a blurry claim is left out of the count.

The idea in words. The grounding ceiling of abstraction (written A-hat) reads as "the highest band where the total backing gathered by evidence at or above that band reaches the required amount (threshold τ_g, default 2)." Put simply: you can only claim up to the band where enough backing has accumulated. The grounding ceiling of scope (S-hat) works the same, but it closes one loophole ── no matter how many same-type cases you pile up, scope is capped at "band 1," and only when two or more cases from different kinds of domains are present is "band 2 or higher" allowed. The trick of racking up case counts on experience alone is blocked here. Why? Because doing the same job 100 times does not widen your "breadth." It looks intricate, but the point is one sentence ── "only when the behavior actually performed satisfies that band on both depth (α) and breadth (σ) may you claim that level." Skip the formulas; that one sentence carries the point.

Binding three documents into one

This measurement design pulls three separate documents into one operation. Each answers a different question. In recipe terms, it is like one booklet covering "how to choose ingredients," "the plating standard," and "the taste score sheet." The table shows the division of roles.

Document	Question it answers	Place in this series
BEI blueprint (the listening plan)	How to listen so that behavior alone becomes evidence	STAR and the six rules. The design of uncontaminated listening
Framework (Series 1, the stick)	What to measure, on which coordinates	The two axes and L1–L4. The map of what is measured
Scoring algorithm (the scoring steps)	How to compute L from evidence	The grounding ceiling and "projection onto the main diagonal." The rules of calculation

This work makes the three actually runnable as an AI dialogue of the person plus several third parties. Series 1, "Framework," fixed the two axes and L1–L4; Series 2, "Qualifying Line," fixed the non-compensatory gate (a pass/fail line where a weak side fails the whole and cannot be offset by trading one against the other). This series fills in the "how to measure" between them. It closes the gap that sat between what is measured (Series 1) and how pass or fail is decided (Series 2) ── how to derive L from evidence ── using behavioral evidence and multiple eyes.

Measurement Design ── Map of all 10 episodes

Vol. 2: Listening Through STAR ── Situation, Task, Action, Result, Thought ── Pick just one thing that actually happened in the past and ask about it in five parts: the setting (Situation), what was assigned (Task), what the person did (Action), what came of it (Result), and why they decided as they did (Thinking). Spend more than half the time on the Action, write down what they did as verbs, check through the Result that it really happened, and draw out the root of the judgment through the Thinking.
Vol. 3: Encoding to Two Axes ── Action Reveals Scope, Thought Reveals Abstraction ── Turning one "what they actually did" story heard in an interview into three readings — how widely they moved (scope sigma), what reasoning they used (abstraction alpha), and whether it really happened (grounding g) — worked through a concrete material-review example.
Vol. 4: The Six BEI Principles ── Axioms That Keep the Measurement Clean ── What a person actually did, told through a four-point way of asking, gets converted into three rulers: depth of thinking, reach of action, and whether a real episode backs it up. The person's reading is then the highest level that the episodes actually support. This installment explains, with everyday examples, the six interview manners that keep that conversion from getting muddied.
Vol. 5: Three Bands ── The Scales of Abstraction α, Scope σ, and Grounding g ── Before any level verdict, this issue sets the three rulers for measuring the behavior we heard: how high the reasoning goes, how far the action reached, and how firmly it is backed by fact. Measured in steps, not scores.
Vol. 6: How L Is Decided ── The Grounding Ceiling and Projection to the Diagonal ── Talk without backing does not raise the level. Take only the reach that real behavior confirms, even out the two measures, and read L.
Vol. 7: The Behaviors That Separate Levels ── Eight-Dimension Anchors and Boundaries ── Using a sample book of "what they actually did" (the anchor table), we match a person's account to the closest sample to decide the level (L1 to L4). All eight abilities are measured by the same method.
Vol. 8: Confidence and Observability ── How Far to Trust a Reading ── An episode about putting a number on how sure a rating is. Confidence C comes from how much evidence there is, whether the story holds together, and whether the rater could see it; observability o comes from being well placed and actually producing evidence; their product, weight w, feeds the final tally.
Vol. 9: Multi-Party AI Dialogue ── Corroboration for Others' Level, Divergence for Calibration ── One pair of eyes cannot measure a person. The subject and several colleagues take the same structured interview (BEI); each vote is weighted by how well that person actually saw the scene, and only readings that other votes back up are bound into an outside view of the level. The gap from the subject's self-rating is kept in a separate column as "how accurately they see themselves," not as ability.
Vol. 10 (final): From Integrated Output to the Qualifying Line ── The Record and the Operating Procedure ── The closing piece of Series 3 on measurement design. In plain terms it explains how the per-person, per-item score sheet hands each number to the right checkpoint in the pass/fail decision, and walks through the seven steps for actually running the measurement.

In closing

Part 1 comes down to one claim. L (the skill rank) is decided neither by self-report nor by impression. We measure only behavior actually demonstrated, break that behavior into facts anyone would agree on, and read it as how well it matches a shared measuring stick. Evaluation thus moves from a clash of personal impressions to a matching task against the stick.

From the next part, this idea is brought down into actual computation: how the evidence heard through STAR is sorted into depth α, breadth σ, and backing g, and how L is derived from the grounding ceiling. We advance from the "why" of the design to the "how" of the measurement.

Key Points ── Three to take with you

Measure only behavior actually performed ── Neither held ability nor outcomes; past behavior is treated as "the evidence that best predicts future behavior." Motivation and character are not measured.
STAR is the tool that extracts the two axes ── Action A surfaces breadth σ, Thinking + surfaces depth α, Result R surfaces backing g, and the six rules keep listening clean.
Backing supports the level ── Talk alone (g=0) is not counted; the "grounding ceiling" reads the highest backed band per axis, with same-type stacking capped at scope band 1.

Sources & references

McClelland, D. C. Testing for Competence Rather Than for Intelligence. American Psychologist, 1973. (Origin of the competency movement: measure behavior rather than aptitude tests.)
Boyatzis, R. E. The Competent Manager: A Model for Effective Performance. Wiley, 1982. (Systematized competence measurement via the Behavioral Event Interview, BEI.)
Smith, P. C., & Kendall, L. M. Retranslation of Expectations: An Approach to the Construction of Unambiguous Anchors for Rating Scales. Journal of Applied Psychology, 1963. (Origin of Behaviorally Anchored Rating Scales, BARS; theoretical support for utterance anchors.)
Spencer, L. M., & Spencer, S. M. Competence at Work: Models for Superior Performance. Wiley, 1993. (Practical standard for BEI and encoding.)

← Back to Measurement Design