How L Is Decided ── The Grounding Ceiling and Projection to the Diagonal

L is not some single point on a line. It is one reading, taken by reading two measures — depth of thinking and breadth of application — from what a person actually did, then evening the two out. Earlier parts fixed what the two measures mean and the rule that the pass line cannot be cleared by one measure alone. This part covers how those scales actually become a number. One rule matters most: talk without backing, however well told, does not raise the level by even a single step.

The Grounding Ceiling — Not "the Highest Claimed" but "the Highest Backed"

First, the words. Three measures show up in this part. "Depth of thinking" (alpha in the jargon) is how deep a reason a person acted on — by the literal wording, or reasoned from principle. "Breadth of application" (sigma) is how wide a range their judgment reached — only one type of case, or another field entirely. "Backing" (g, also called grounding) is whether the talk is supported by an event that actually happened: a bare claim has zero backing; a concrete past event told in full has backing. In a checkup, saying "I run every day" is easy; what counts is the weight and blood-pressure numbers. Take facts, not words.

When the interview ends, you are left with several "pieces of behavior" about one ratee, each tagged with depth, breadth, and backing. It is tempting to take the highest depth that appeared and be done. That is the trap, because people speak most fluently about the big ideas they cannot back up.

So before evening out the two measures, fix per axis the highest level that actual events support. Call it the grounding ceiling. The depth ceiling (A-hat) is set this way: for the pieces showing a depth at or above some level, add up their backing, and take the highest level whose sum reaches the threshold (tau_g, default 2). In words: if someone claims to have reasoned from principle, that reasoning must be backed by concrete past events worth two points in total before the depth is accepted as a reach. Why two points? Because we want to confirm a repeatable real ability, not a one-off accident. The formula is just a footnote: A-hat is the highest level a whose backing-sum over pieces of depth a or more is at least 2.

The breadth ceiling (S-hat) works the same way, with one extra fence. No matter how many similar-type cases are stacked, breadth is capped at level 1. To grant level 2 (applied to another field) or higher, at least two backed fields are required. Why the fence? A veteran who has handled a hundred similar cases is certainly fast, but that is repetition of one type, not proof of reaching another field. In cooking: making the same curry a hundred times does not prove you can make stewed dishes in general. Count does not widen breadth; only a difference of field does.

Projection onto the Main Diagonal — Two Readings into One

Once A-hat and S-hat are set, L comes out almost automatically. The ideal is for depth and breadth to grow equally: (depth 0, breadth 0), (1,1), (2,2), (3,3) — a diagonal climbing in step. Call it the main road. A real person need not sit neatly on it; depth can be high while breadth lags. So we "even out" the coordinate onto the main road. That is the projection, and the method is plain: take the average of the two ceilings, then cut it into four bands to get L. Average below 0.5 is L1, 0.5 to 1.5 is L2, 1.5 to 2.5 is L3, 2.5 or more is L4.

Why the average? It matters. Someone high on depth but not breadth, or wide on breadth but thin on depth, lands in the middle once averaged. So even a spike on one axis cannot lift L unless the other supports it. This embeds the pass line's "one axis alone cannot clear it" a second time, inside the L computation. Like baseball, where it takes several umpires agreeing to call an out, the level rises only when both axes line up. In one sentence: take the highest L for which backed behavior meets that band on both depth and breadth; claims alone do not count.

Stage	What you do	Failure it blocks
Encoding (translating)	Tag each piece with depth, breadth, backing	Evaluation by impression or adjectives
Grounding ceiling	Highest band whose backing-sum reaches the threshold (2) becomes A-hat/S-hat	Adopting unbacked big talk
Breadth cap	Same-type repetition capped at level 1; level 2+ needs two different fields	Padding by count (experience-reliance)
Even onto main road	Average the two ceilings, cut into a band	Over-rating someone high on one axis only
Wing (lean)	Depth minus breadth records the direction of incompleteness	Mis-prescribing development

The Wing — Same L, Different Substance

Rounding L to one number sheds part of the two measures. At the same L2, one person can state principle but has not yet applied it to another field, while another has seen many sites but is thin on reasoning — their next steps are opposite. The wing keeps this spill, written as depth minus breadth (b in the jargon). Positive means depth leads: a tilt toward armchair reasoning. Negative means breadth leads: a tilt toward experience-reliance and over-detection (catching too much). Near zero sits neatly on the main road.

The wing is not high or low ability; it is an arrow pointing at the direction of incompleteness. In a photo, brightness may be fine, but which way is the focus off — that is what it shows. For positive, push hands-on application in another field; for negative, push articulation from principle. It is not the single number L that decides the next move but the pair of L and the wing. So even after L is produced, keep the raw A-hat and S-hat coordinates as the primary output rather than throwing them away.

Wing (depth − breadth)	Coordinate example	Reading	Next move
Positive (armchair)	Depth 2, breadth 0	States principle but no evidence of applying it to another field	Assign hands-on application in another field
Near 0 (main road)	Depth 2, breadth 2	Depth and breadth balanced	Go meet the backing condition of the next level
Negative (experience-reliant)	Depth 0, breadth 2	Broad exposure, thin on turning it into principle	Have them put the basis of judgment into words

A Worked Example — Run It on Risk Detection

The material: a graph from a trial with no difference on the primary endpoint (the most important comparison in that trial), with an arrow highlighting the crossing point. Suppose three testimonies are all heard as real past events. First, the ratee reacted to a "golden-cross-grade" overstatement (depth 1, breadth 1, backed). Second, they saw that the axis choice, arrow, and layout made a non-different result look superior — reading the intent of the presentation — and caught the same trick even in a figureless patient leaflet (depth 2, breadth 2, backed, two fields). Third is a claim — "I defined impression manipulation as a new lens and others now use it" — but no concrete event of who used it when was produced (depth 3, breadth 3, zero backing).

Now compute the grounding ceiling. On depth, backing at level 3 or above is zero, so it cannot be taken. Backing at level 2 or above is only the second testimony, summing to 1 — short of the threshold of 2. Here per-piece corroboration kicks in: if that second testimony is confirmed across two different fields, each field counts as backing 1, the level-2 sum reaches 2, and the threshold is met. So the depth ceiling is 2. On breadth, level 2 or above needs two different fields, which the second testimony supplies, so the breadth ceiling is also 2. The third has zero backing, so neither its depth nor its breadth raises any ceiling. The average is (2+2)/2 = 2, which bands to L3, with the wing at zero, on the main road. The third testimony, however fluent, contributed nothing because it had no backing. This is where "talk without backing does not raise the level" actually bites.

If even one concrete event were attached to the third — "in the April review, reviewer Tanaka used my lens to send a piece back" — things change. Depth 3 gains 1 point of backing, and if adoption by others were also confirmed, backing reaches 2 and the depth ceiling could move to 3. Same testimony, but the presence or absence of backing decides a whole step. Not the polish of the telling but the fact of who, when, what decides L.

The Strictness Knobs — Where to Set Them

This rule has two knobs for adjusting strictness. One is the backing threshold (tau_g), how much backing a level requires. The default of 2 demands "a repeatable behavior, not a one-off accident"; loosen it to 1 and a single event raises a level, tighten it to 3 and three points of corroboration are required. The other, covered in later parts, is the corroboration threshold (theta, default 0.5) for combining several people; it acts on "how many eyes agreed." So the backing threshold sets strictness on the depth of evidence within one person, and the other knob on the agreement of several eyes — independently. Like a checkup: how far to trust one test result (depth) versus how many doctors gave the same finding (agreement), tuned separately.

Knob	Where it acts	Loosen (lower the value)	Tighten (raise the value)
Backing threshold (tau_g, default 2)	Depth of backing within one person	A single behavior raises a level	Demands repetition or counter-evidence
Saturation count (n*, default 3)	Count where a reading's confidence tops out	Treated as "settled" on few pieces	Treated as "settled" only on many
Corroboration threshold (theta, default 0.5)	Agreement among several people (later part)	A few high ratings lift it	Demands a majority of corroboration

The knobs are not turned at will on the floor. If values differ between people, the same testimony yields a different L and the measurement falls apart. At the calibration session all raters share the same values, and they move them only by discussion, only when regulation updates or the anchors grow stale. If umpires each changed the strike zone mid-game, no call could be trusted.

Measurement Design ── Map of all 10 episodes

Vol. 2: Listening Through STAR ── Situation, Task, Action, Result, Thought ── Pick just one thing that actually happened in the past and ask about it in five parts: the setting (Situation), what was assigned (Task), what the person did (Action), what came of it (Result), and why they decided as they did (Thinking). Spend more than half the time on the Action, write down what they did as verbs, check through the Result that it really happened, and draw out the root of the judgment through the Thinking.
Vol. 3: Encoding to Two Axes ── Action Reveals Scope, Thought Reveals Abstraction ── Turning one "what they actually did" story heard in an interview into three readings — how widely they moved (scope sigma), what reasoning they used (abstraction alpha), and whether it really happened (grounding g) — worked through a concrete material-review example.
Vol. 4: The Six BEI Principles ── Axioms That Keep the Measurement Clean ── What a person actually did, told through a four-point way of asking, gets converted into three rulers: depth of thinking, reach of action, and whether a real episode backs it up. The person's reading is then the highest level that the episodes actually support. This installment explains, with everyday examples, the six interview manners that keep that conversion from getting muddied.
Vol. 5: Three Bands ── The Scales of Abstraction α, Scope σ, and Grounding g ── Before any level verdict, this issue sets the three rulers for measuring the behavior we heard: how high the reasoning goes, how far the action reached, and how firmly it is backed by fact. Measured in steps, not scores.
Vol. 6 (this episode): How L Is Decided ── The Grounding Ceiling and Projection to the Diagonal ── Talk without backing does not raise the level. Take only the reach that real behavior confirms, even out the two measures, and read L.
Vol. 7: The Behaviors That Separate Levels ── Eight-Dimension Anchors and Boundaries ── Using a sample book of "what they actually did" (the anchor table), we match a person's account to the closest sample to decide the level (L1 to L4). All eight abilities are measured by the same method.
Vol. 8: Confidence and Observability ── How Far to Trust a Reading ── An episode about putting a number on how sure a rating is. Confidence C comes from how much evidence there is, whether the story holds together, and whether the rater could see it; observability o comes from being well placed and actually producing evidence; their product, weight w, feeds the final tally.
Vol. 9: Multi-Party AI Dialogue ── Corroboration for Others' Level, Divergence for Calibration ── One pair of eyes cannot measure a person. The subject and several colleagues take the same structured interview (BEI); each vote is weighted by how well that person actually saw the scene, and only readings that other votes back up are bound into an outside view of the level. The gap from the subject's self-rating is kept in a separate column as "how accurately they see themselves," not as ability.
Vol. 10 (final): From Integrated Output to the Qualifying Line ── The Record and the Operating Procedure ── The closing piece of Series 3 on measurement design. In plain terms it explains how the per-person, per-item score sheet hands each number to the right checkpoint in the pass/fail decision, and walks through the seven steps for actually running the measurement.

In closing

Deciding L comes down to two steps: per axis, take the highest level that real behavior backed up; and even those two into one reading. Throughout, the fence holds — talk without backing raises the level by not a single step. So the fluent talker gains nothing, and the person who left behind concrete behavior, however unglamorous, is correctly placed higher.

Even rounded to one number, the wing (depth minus breadth) keeps the direction of incompleteness. L passes to the pass/fail decision as material; the wing becomes the entry point for how to develop the person. The next part moves to the confidence C that measures how far a single reading may be treated as settled, and to observability — whether that person could even see the behavior in the first place.

Key Points ── Three to take with you

The grounding ceiling binds the level. Per axis, take only the highest level whose backing from real events reaches the threshold (default 2); unbacked big talk does not count.
Read L by evening out the two scales. Average depth and breadth, cut into L1-L4, and build in the rule that one axis spiking alone cannot lift it.
The wing splits how to develop a person. The sign of depth minus breadth records armchair (positive) versus experience-reliant (negative), pointing the next move in opposite directions even at the same L.

Sources & references

McClelland, D. C. Testing for Competence Rather Than for "Intelligence". American Psychologist, 1973.(The founding idea of measuring by behavioral evidence)
Boyatzis, R. E. The Competent Manager: A Model for Effective Performance. Wiley, 1982.(The prototype of behavioral event interviewing, BEI)
Smith, P. C., & Kendall, L. M. Retranslation of Expectations: An Approach to the Construction of Unambiguous Anchors for Rating Scales. Journal of Applied Psychology, 1963.(BARS, basis for anchored band definitions)
Spencer, L. M., & Spencer, S. M. Competence at Work: Models for Superior Performance. Wiley, 1993.(Level-distinguishing indicators and competency banding)

← Back to Measurement Design