The Behaviors That Separate Levels ── Eight-Dimension Anchors and Boundaries

Someone says "that person is an L3." But what did they look at to decide L3? Through part six we settled that a level (the assessment step we call L) is not decided by self-report but computed from evidence of actions actually performed. This time we turn that evidence into a table: which statement corresponds to which level. What separates levels is neither title nor a smooth way with words. Only the type of "what they actually did," drawn out concretely in the interview, decides the level.

A level is a reading off a position

First, a quick check of terms. This assessment measures ability with two rulers. One is abstraction, alpha: how deep the thinking behind the action was. Did the person follow the wording, or reason from principle? The other is scope, sigma: how wide the reach was. Just the case at hand, or applied to another field too? Through part six, we encoded a person's account into bands (steps) on these two rulers and derived a position.

The level L is just a single value read by placing that position on a diagonal line (the "main road," where both rulers have grown about equally). It is like a health check producing one index from height and weight. So before debating "what is an L3," we first fix the link between position and level. We call this the generic schema, a common reference table usable for any ability. Using a common form, we avoid measuring the eight abilities with eight different rulers. Why standardize? Because if the ruler differs per ability, the standard drifts from assessor to assessor.

The reference points are four spots on the diagonal. Alpha and sigma both 0 gives L1, both 1 gives L2, both 2 gives L3, both 3 gives L4. Real people sometimes fall off this diagonal (all theory and little experience, or the reverse, experience-bound but unable to state a principle). The handling of that "lean" belongs to part six; here we first internalize the plain types on the diagonal. With the plain types in mind, an off-diagonal person can be read as "which ruler grew and which stalled."

The generic schema: types of action actually performed

The generic schema is still abstract. Each level is told apart by the following "type of action actually performed." As a cooking analogy: you judge skill not by whether someone can read out the recipe, but by tasting the dish they actually made. The judge checks whether at least one concrete action fitting the type can be drawn out in the interview. The crux: only when the verb of "what they actually did," not the adjective of "it went well" or "she was excellent," matches the type does the person go to that level. Why? Because anyone can voice a fine impression, but actions are hard to fake.

L1 (0,0) carries out a fixed procedure

Handled just that one case per the manual and wording. No reasoning from principle, no applying it elsewhere. One concrete event is enough as evidence.

L2 (1,1) bundles same-type cases

Noticed "this is that pattern I saw before" and handled several similar cases together. One concrete example plus a remark referring to the pattern is enough.

L3 (2,2) applies a principle to another field

Reasoned from principle ("the underlying intent is this") and applied it to a structurally different field. States the principle in why they judged so, and shows the application in the actual action. Must be backed by two or more different fields.

L4 (3,3) creates a new standard others use

Created a new principle or judgment standard itself, and others are using it. Reaches problems with no precedent. Backed by the fact that "others use it" and by artifacts (standard documents, teaching materials).

L / position	How to tell (type of action actually performed)	Backing / boundary condition
L1 (0,0)	Handled just that one case per wording and procedure; no principle, no application	One concrete example suffices
L2 (1,1)	Noticed "this is that pattern" and bundled several similar cases	One concrete example, pattern mentioned
L3 (2,2)	Reasoned from principle and intent and applied it to a structurally different field	Backed in two or more different fields (g greater than or equal to 1)
L4 (3,3)	Created a new principle or standard, and others are using it	Backed by others' adoption / artifacts (standards, materials)

The anchor table: fixing one ability by "sample statements"

What turns the schema into a tool you can decide with, without hesitation, is a per-ability anchor table (a book of sample statements). An anchor is the "a statement like this means this level" reference point. Take "risk detection" as the example. The scene: in a drug trial, the main endpoint showed no difference, yet a promotional piece highlights the crossing point of that graph with an arrow. Seeing the same material, how does a reviewer at each level react? We fix the type of statement and the matching numbers (alpha, sigma, g) as samples.

The use is simple, much like several umpires judging the same play. Match the person's testimony to the closest sample statement on the left of the table (this is nearest-neighbor matching, fitting to the closest sample) and take the numbers on the right as-is. The boundary is set by "which sample it is closest to," not by the assessor's mood. Why? Because deciding by a feeling of "probably L3-ish" makes the answer change person to person. Concretely: L1 looks only at the wording and judges "it doesn't say superior, so no problem." L2 reacts to a familiar trick of emphasis, such as a "golden-cross-grade superlative." L3 sees through the presentation trick, "the axis, arrow, and layout make a non-difference look superior," and catches the same trick even in a figure-free patient leaflet (reaching two different fields). L4 defines a new lens, "even objective material can be slanted by presentation," and other reviewers now use that lens.

L / position	Sample statement (type of testimony heard in interview)	Numbers
L1 (0,0)	Judged from wording alone: "it doesn't say superior, so no problem"	alpha0 sigma0 g1
L2 (1,1)	Reacted to a familiar emphasis trick such as a "golden-cross-grade superlative"	alpha1 sigma1 g1
L3 (2,2)	Saw through the axis/arrow/layout trick; caught the same trick in a figure-free leaflet	alpha2 sigma2 g1 (2 fields)
L4 (3,3)	Defined "slanting by presentation" as a new lens; other reviewers use it	alpha3 sigma3 g2 (others adopt)

How to tell apart all eight abilities

The method seen for risk detection is prepared the same way for the other seven abilities. For each ability, the minimum action that must be heard to place someone at a level is fixed in one line. The premise throughout is that it is backed by a concrete past event (g greater than or equal to 1, meaning "an event that actually happened"). The table below lists the distinguishing points for eight abilities by four levels. The assessor keeps it at hand and fits the testimony into the closest cell. Why a list? Because with samples in hand, judgment runs on the same standard instead of the mood of the moment.

Ability	L1	L2	L3	L4
01 Knowledge	Opens the rule to notice a requirement	States key requirements without checking the document	Connects to a separate issue via intent	Criteria/materials they drafted are in use
02 Intelligence	Classifies per label and stops	Notices by recalling a similar case	Drops the label and judged reality from principle	Others use the judgment principle they made
03 Risk detection	Flags only clear wording violations	Reacts to a familiar emphasis trick	Caught the presentation trick or omission in another field too	Defined a new risk type and made it stick
04 Sixth sense	No instance of sensing "something is off"	Sensed "something is off" in a familiar area	Could explain that unease with logic afterward	Taught others where the unease lives and grew the instinct
05 Communication	Just handed over the article as-is	Supplied the intent so it got through	Translated into the other's situation so it landed	Designed terms/standards by which anyone reaches the same judgment
06 Driving behavior change	Instructed each time; it recurred	That case was fixed with understanding	The other party raised first-draft quality on their own	Made it stick as the culture of several teams
07 Relationship building	Seen as an enemy; consultations stopped coming	Consultation comes at the needed stage	Consulted "first" precisely for their independence	Built early cross-unit consultation into a system
08 Density of trust	Judgment is pushed back and overturned	Passes within the owned area	Respected across units; no objection stands	Judgment becomes the company-wide standard/precedent

Where to draw the boundaries

Even with the table done, what causes hesitation in the field is the line between adjacent levels. L2 versus L3, and L3 versus L4, are the hard ones. As with focusing a camera, clearly near and clearly far are easy, but the middle is hard to tell. So we fix these in words.

The L2/L3 line is "did they state a principle and then apply it to a structurally different field?" L2 stops at reacting to a familiar same type. No matter how many same-type cases are piled up, as long as they are repetitions, scope sigma is capped at step 1 (the backing formula is built so that stacking same-type cases does not count as a step-up). Why seal it off? To avoid mistaking "experience-bound, just did the volume" for real ability. Rising to L3 happens only when an action that states the principle or intent and applies it to a structurally different field is backed in two or more different fields. "Noticed" is not enough; "caught the same trick in another field" is the condition.

The L3/L4 line is "are others using the judgment standard the person made?" L3 is the stage where the person can judge reality from principle. L4 is the stage where that judgment framework has left the person's hands and settled as the standard lens, materials, or criteria of other reviewers. The backing here is not the person's narrative but the external facts of "others adopted it" and "it remains as an artifact." "Conceived a new lens" alone does not reach L4. A concrete fact of "who uses what, and when" is required. Why insist on external facts? Because self-assessment alone can be inflated without limit.

At either line, the person goes to the upper step only when backed by a concrete past event rather than a claim. This is the aim of nearest-neighbor matching: fit the testimony to the closest sample statement and take that sample's numbers. The assessor's feeling that "this is probably L3" is not entered into the judgment, because a hunch is not evidence.

Measurement Design ── Map of all 10 episodes

Vol. 2: Listening Through STAR ── Situation, Task, Action, Result, Thought ── Pick just one thing that actually happened in the past and ask about it in five parts: the setting (Situation), what was assigned (Task), what the person did (Action), what came of it (Result), and why they decided as they did (Thinking). Spend more than half the time on the Action, write down what they did as verbs, check through the Result that it really happened, and draw out the root of the judgment through the Thinking.
Vol. 3: Encoding to Two Axes ── Action Reveals Scope, Thought Reveals Abstraction ── Turning one "what they actually did" story heard in an interview into three readings — how widely they moved (scope sigma), what reasoning they used (abstraction alpha), and whether it really happened (grounding g) — worked through a concrete material-review example.
Vol. 4: The Six BEI Principles ── Axioms That Keep the Measurement Clean ── What a person actually did, told through a four-point way of asking, gets converted into three rulers: depth of thinking, reach of action, and whether a real episode backs it up. The person's reading is then the highest level that the episodes actually support. This installment explains, with everyday examples, the six interview manners that keep that conversion from getting muddied.
Vol. 5: Three Bands ── The Scales of Abstraction α, Scope σ, and Grounding g ── Before any level verdict, this issue sets the three rulers for measuring the behavior we heard: how high the reasoning goes, how far the action reached, and how firmly it is backed by fact. Measured in steps, not scores.
Vol. 6: How L Is Decided ── The Grounding Ceiling and Projection to the Diagonal ── Talk without backing does not raise the level. Take only the reach that real behavior confirms, even out the two measures, and read L.
Vol. 7 (this episode): The Behaviors That Separate Levels ── Eight-Dimension Anchors and Boundaries ── Using a sample book of "what they actually did" (the anchor table), we match a person's account to the closest sample to decide the level (L1 to L4). All eight abilities are measured by the same method.
Vol. 8: Confidence and Observability ── How Far to Trust a Reading ── An episode about putting a number on how sure a rating is. Confidence C comes from how much evidence there is, whether the story holds together, and whether the rater could see it; observability o comes from being well placed and actually producing evidence; their product, weight w, feeds the final tally.
Vol. 9: Multi-Party AI Dialogue ── Corroboration for Others' Level, Divergence for Calibration ── One pair of eyes cannot measure a person. The subject and several colleagues take the same structured interview (BEI); each vote is weighted by how well that person actually saw the scene, and only readings that other votes back up are bound into an outside view of the level. The gap from the subject's self-rating is kept in a separate column as "how accurately they see themselves," not as ability.
Vol. 10 (final): From Integrated Output to the Qualifying Line ── The Record and the Operating Procedure ── The closing piece of Series 3 on measurement design. In plain terms it explains how the per-person, per-item score sheet hands each number to the right checkpoint in the pass/fail decision, and walks through the seven steps for actually running the measurement.

In closing

What separates levels is neither title, nor self-report, nor the assessor's feeling. It is whether the concrete action heard in the interview fits the generic-schema (common reference table) type and meets the boundary condition against the neighboring level (applying to another field, adoption by others) with an event that actually happened. The eight-ability anchor table turns that judgment into a single question: which statement is it closest to.

Next time we connect this single-person reading from sample-matching to a weighted integration of the person and several third parties. A level seen by one pair of eyes is still only midway through the measurement.

Key Points ── Three to take with you

One common reference table measures eight abilities the same way. The plain diagonal types, where abstraction alpha and scope sigma both at step n give level n+1, serve as the reference, and each level is told apart by the "type of action actually performed."
The anchor table (book of sample statements) guarantees you can decide without hesitation. Just fit the person's testimony to the closest sample statement and take the numbers on the right (alpha, sigma, g); the boundary is not left to the assessor's mood.
The hard boundaries are fixed by "events that actually happened." L2/L3 checks "stated a principle and applied it to another field (two different fields)," and L3/L4 checks "others are using your standard (adoption by others)," each via concrete past facts.

Sources & references

McClelland, D. C. Testing for Competence Rather Than for Intelligence. American Psychologist, 1973. Origin of measuring demonstrated behavior rather than intelligence.
Boyatzis, R. E. The Competent Manager: A Model for Effective Performance. Wiley, 1982. Framework for identifying competencies from behavioral evidence via the behavioral event interview (BEI).
Smith, P. C., & Kendall, L. M. Retranslation of Expectations: An Approach to the Construction of Unambiguous Anchors for Rating Scales. Journal of Applied Psychology, 1963. The source of behaviorally anchored rating scales (BARS), the lineage of this part's anchor table.
Spencer, L. M., & Spencer, S. M. Competence at Work: Models for Superior Performance. Wiley, 1993. Systematization of level scaling and behavioral indicators.

← Back to Measurement Design