In the earlier parts we set up a coordinate grid built from two rulers (the line of reasoning, and how far a person can move), the four levels L1 through L4, and a bottom line below which someone fails. One question remains: from a person's own account, how do we actually read those coordinates? This piece deals only with that. The key is not to confuse which part of a "what they concretely did in the past" story maps to which ruler. The way of asking itself becomes the measuring device.
Do not confuse the story you hear with the ruler you measure
An interview is like a recipe. A list of ingredients tells you nothing about the cook's skill. Only by seeing how the hands actually moved, and what came out of the oven, do you learn the skill. The source nails the same point first. You listen across four layers — situation (what kind of scene), task (what was asked), action (what they concretely did), result (how it turned out), and plus-thinking (why they judged that way). This way of asking is called STAR for short. But the four layers do not become the score directly. The glamour of the scene and the difficulty of the task (situation and task) are only background for understanding the story; they do not move the score. Only three things move it — action, result, and plus-thinking. Mix this up, and you overrate a story that has a flashy stage but an ordinary core. Why fix the mapping first? Because it is the foundation that keeps you from drifting later.
| The story you hear (STAR) | The ruler it maps to | What to look at |
|---|---|---|
| Situation / Task (the scene and what was asked) | Background (does not move the score directly) | A premise that prevents misjudgment. It only frames how to read the story. |
| Action (what they concretely did / most important) | Scope sigma = how far they can move | How far the person reached. Did their hands extend to other fields or unprecedented cases? |
| Result (how it turned out) | Grounding g = backing for the story | Whether the action really happened. The source of the evidence. |
| Plus-thinking (why they judged that way) | Abstraction alpha = line of reasoning | What the judgment rested on. "Because the text says so" or "reasoning from the aim"? |
In one sentence: what they did shows how far they can move, why they did it shows their line of reasoning, and how it turned out backs up the story. The fixed way of asking becomes the very tool for sampling all three. We now unpack them in turn.
What they did shows how far they can move
Scope sigma (how far they can move) is the band that measures how far the person reached. Just as a health check reports a number in bands (normal, watch, needs follow-up), we read it in four bands. 0 is staying inside one familiar case, 1 is several cases but of the same type, 2 is applying it to a different field, and 3 is reaching an unprecedented, cross-cutting structure. The key point: this reach shows up only in "what they actually did." However fine a theory the person recites, if their hands stayed inside a familiar case, the reach stays 0. Why? Because you can dress up the telling, but you cannot dress up where the action actually landed.
There is a device here that seals off the experience-reliance trap. Handle dozens of cases of the same type, and the reach is still capped at band 1. Cook the same dish a hundred times and you are still "good at one staple" — you have not shown you can move from Japanese cooking to baking bread. Put the idea in words: only once a certain amount of evidence (by default, two pieces' worth) has piled up at a given reach or above do we credit that reach. And band 2 or higher counts only when two or more different fields are present. As a formula, the scope ceiling S-hat is "the largest s such that the backing mass of evidence at reach s or above hits the threshold tau_g (default 2), restricted to s of 1 or below, or — for s of 2 or above — at least two distinct fields." In short, not the quantity of experience but its quality — whether it moved to a structurally different place — is the condition for band 2.
Why they did it shows their line of reasoning
Abstraction alpha (line of reasoning) measures what the judgment rested on. 0 is leaning on text and procedure, 1 is linking several requirements, 2 is reasoning from a principle or aim, 3 is forming a new principle or type yourself — four bands. This shows up in the layer where the person says why they judged as they did. With the same action, the line of reasoning shifts sharply depending on whether the basis was "because the article says so" or "because the aim of the regulation is such." Why ask about the motive at all? Because the same right answer reached by rote versus by reasoning gives wildly different odds of repeating it in a new setting.
The easy mistake here is to misread a smooth reciter of principles as having a high line of reasoning. A salesperson with a slick pitch does not necessarily understand the product. Likewise, the source treats this as an "ungrounded claim" (grounding g equals 0) and does not raise the level. Not the polish of the telling, but the principle tied to "what they concretely did in the past," is what makes the reasoning stand. The next section is that brake.
How it turned out is the backing for the story
Grounding g (backing) is the weight that keeps the reasoning and reach readings from being castles in the air. 0 is claim only (no concrete event), 1 is a concrete past event where you can say "when, who, what," 2 is backed by a counter-case or confirmed by repetition across several events — three steps. The core of the rule is simple: an ungrounded claim does not raise the level. On each ruler take the highest band that has backing and is supported, then combine the two to read L.
Combining them is like focusing a camera. Call the reasoning ceiling A-hat and the reach ceiling S-hat. Put the idea in words: take the midpoint of the two ceilings; if it is low it reads L1, and as it rises, L2, L3, L4. As a formula, compute the average p equals (A-hat plus S-hat) divided by 2; below 0.5 is L1, 0.5 up to 1.5 is L2, 1.5 up to 2.5 is L3, 2.5 or above is L4. The gap b equals A-hat minus S-hat tells you the direction of what is unfinished: positive means theory-heavy (top-heavy), negative means experience-reliant (the hands move but cannot put it into words), near zero means both wheels are present. In one sentence: take the highest L that grounded behavior satisfies on both the reasoning and the reach; do not count claims alone. Why require both? Because high on only one side is not yet real competence.
Breaking one event into three: a material-review example
Let us bring it down to the concrete. The scene is this. In a trial of a drug, the main measure showed no difference. An asset (a promotional piece) takes that graph, adds an arrow, and emphasizes the point where the lines cross — a borderline sheet that makes a non-difference look superior. Here are four testimonies about it, encoded along the source's anchors (model utterance examples). Just as several doctors read the same X-ray, competence splits by the reaction to the same asset.
| L / coordinate | Utterance anchor (testimony heard in interview) | Encoding |
|---|---|---|
| L1 (0,0) | Judged "no problem, since it doesn't say superior." Looking only at the surface of the text. | alpha0 sigma0 g1 |
| L2 (1,1) | Reacted to a familiar emphasis pattern, such as "an overblown golden-cross-grade display." | alpha1 sigma1 g1 |
| L3 (2,2) | Saw the intent of the construction — "the axis, arrow, and layout make a non-difference look superior" — and caught the same trick in a figure-free patient booklet. | alpha2 sigma2 g1 (2 fields) |
| L4 (3,3) | Defined "even objective material can manipulate impressions through presentation" as a new review lens, and other reviewers now use that lens. | alpha3 sigma3 g2 (adopted by others) |
The L1-to-L2 difference is whether they only read the surface of the text or matched it to a familiar pattern. The L2-to-L3 difference is whether, having put the intent into words, they could apply it to a separate field (the figure-free booklet). The L3-to-L4 difference is whether others use the review lens the person built. Every boundary is decided not by the person's character or impression but by which utterance anchor the testimony sits closest to. All the evaluator does is match the testimony to the nearest entry on the left and assign the encoding on the right. Why this way? To stop each evaluator's standard from drifting.
Four places where encoding easily slips
Here are the typical encoding errors in the field, set against the source's six principles for listening. Each follows from the single line: make only behavior the evidence. It is like sharing the common misjudgments up front so several referees do not miss the same foul.
- Confusing how hard the scene was with how far they can move. Having handled a hard case does not raise the reach. Only whether, in that case, they actually reached a separate field counts. Why? Because "was present at a tough site" and "got something done at a tough site" are different things.
- Reading smooth principle-talk as a high line of reasoning. Stating a principle counts as 0 unless it ties to a concrete past behavior (backing of 1 or more). Eloquence is not measured.
- Mistaking many same-type cases for reach 2. Stacking the same type, however many times, caps the reach at band 1. Two or more different fields is the condition for band 2.
- Failing to separate out "we." Unless team outcomes are brought down to that person's individual action, whose credit it is stays undefined. Pinning the subject is exactly what decides whose reach and reasoning it was.
What the four share is mistaking the surface impressiveness of testimony — difficulty, fluency, count, team outcome — for the substance of the behavior. Encoding always returns to a concrete event that can be backed by how it turned out.
Measurement Design ── Map of all 10 episodes
- Vol. 2: Listening Through STAR ── Situation, Task, Action, Result, Thought ── Pick just one thing that actually happened in the past and ask about it in five parts: the setting (Situation), what was assigned (Task), what the person did (Action), what came of it (Result), and why they decided as they did (Thinking). Spend more than half the time on the Action, write down what they did as verbs, check through the Result that it really happened, and draw out the root of the judgment through the Thinking.
- Vol. 3 (this episode): Encoding to Two Axes ── Action Reveals Scope, Thought Reveals Abstraction ── Turning one "what they actually did" story heard in an interview into three readings — how widely they moved (scope sigma), what reasoning they used (abstraction alpha), and whether it really happened (grounding g) — worked through a concrete material-review example.
- Vol. 4: The Six BEI Principles ── Axioms That Keep the Measurement Clean ── What a person actually did, told through a four-point way of asking, gets converted into three rulers: depth of thinking, reach of action, and whether a real episode backs it up. The person's reading is then the highest level that the episodes actually support. This installment explains, with everyday examples, the six interview manners that keep that conversion from getting muddied.
- Vol. 5: Three Bands ── The Scales of Abstraction α, Scope σ, and Grounding g ── Before any level verdict, this issue sets the three rulers for measuring the behavior we heard: how high the reasoning goes, how far the action reached, and how firmly it is backed by fact. Measured in steps, not scores.
- Vol. 6: How L Is Decided ── The Grounding Ceiling and Projection to the Diagonal ── Talk without backing does not raise the level. Take only the reach that real behavior confirms, even out the two measures, and read L.
- Vol. 7: The Behaviors That Separate Levels ── Eight-Dimension Anchors and Boundaries ── Using a sample book of "what they actually did" (the anchor table), we match a person's account to the closest sample to decide the level (L1 to L4). All eight abilities are measured by the same method.
- Vol. 8: Confidence and Observability ── How Far to Trust a Reading ── An episode about putting a number on how sure a rating is. Confidence C comes from how much evidence there is, whether the story holds together, and whether the rater could see it; observability o comes from being well placed and actually producing evidence; their product, weight w, feeds the final tally.
- Vol. 9: Multi-Party AI Dialogue ── Corroboration for Others' Level, Divergence for Calibration ── One pair of eyes cannot measure a person. The subject and several colleagues take the same structured interview (BEI); each vote is weighted by how well that person actually saw the scene, and only readings that other votes back up are bound into an outside view of the level. The gap from the subject's self-rating is kept in a separate column as "how accurately they see themselves," not as ability.
- Vol. 10 (final): From Integrated Output to the Qualifying Line ── The Record and the Operating Procedure ── The closing piece of Series 3 on measurement design. In plain terms it explains how the per-person, per-item score sheet hands each number to the right checkpoint in the pass/fail decision, and walks through the seven steps for actually running the measurement.
Part 3 comes down to one thing: the fixed way of asking (STAR) is at once a way of listening and a way of measuring — a tool that samples how far they can move from what they did, the line of reasoning from why they did it, and the backing from how it turned out. Fix this mapping, and evaluation changes shape, from "how do I see that person" to "which model utterance does the testimony sit closest to." From character judgment to pattern matching, you could say.
Next time we move to how far a single reading like this may be treated as "final" — separating provisional from final by the quantity of evidence, the coherence of the account, and whether the rater could even see it. It is the entrance to the multi-rater design where one set of eyes is backed up by several.
- Each part of the story maps one-to-one to a ruler. What they did samples how far they can move, why they did it samples the line of reasoning, how it turned out samples the backing; the scene and the task are background and do not move the score.
- An ungrounded claim does not raise the level. On each ruler take the highest band that has backing, look at the midpoint of the two, and read L. Actual behavior, not eloquence, decides it.
- Encoding is just picking the closest model utterance. The evaluator matches testimony to a model utterance and assigns the encoding, so L is decided by closeness to a behavior pattern, not by an impression of the person.
- McClelland, D. C. Testing for Competence Rather Than for "Intelligence". American Psychologist, 1973. (Origin of treating past behavior as a predictor of competence.)
- Boyatzis, R. E. The Competent Manager: A Model for Effective Performance. Wiley, 1982. (Systematization of the Behavioral Event Interview, BEI.)
- Smith, P. C., & Kendall, L. M. Retranslation of Expectations: An Approach to the Construction of Unambiguous Anchors for Rating Scales. Journal of Applied Psychology, 1963. (Source of behaviorally anchored rating scales, BARS, and anchoring.)
- Spencer, L. M., & Spencer, S. M. Competence at Work: Models for Superior Performance. Wiley, 1993. (Design mapping competency levels to behavioral indicators.)