Confidence and Observability ── How Far to Trust a Reading

Even when the L number is out, the job is not done. Two L3 readings can differ wildly: one backed by five concrete past events told without contradiction, another resting on a single event and only the person's own word. Without a number for that gap, a thin reading and a thick one get combined at equal weight. It is like a health check: you would not treat a blood pressure measured many times the same as one taken once. Episode 8 sets confidence C, which says how sure a reading is, and observability o, which says whether the rater could see that item at all, then passes their product as the weight into the tally.

Readings you may finalize, and readings to hold

In Episodes 6 and 7, each rater read the highest level a person actually showed (we call this level L). But readings differ in thickness. Some are backed by many concrete examples; others are shaky. The source (Part 3.2) names this certainty confidence C. When C is low, the L number is not finalized even if it has been computed. It is held, and you either add evidence or have someone else look.

An analogy: judging a cooking contest by tasting one dish and declaring someone a master is risky. Only after several dishes, all consistently good, can you say it with confidence. Why separate final from held? Because mixing a thinly supported verdict into the final result drags the whole thing toward it. Three conditions let you finalize: enough evidence, a story that holds together, and whether the person could see it at all. Miss any one and C falls, leaving the reading provisional. Let us take them in turn.

The three questions behind confidence C

A formal-looking formula appears, but it is three easy questions. C scores three things — (1) is there enough evidence, (2) is the account free of contradictions, (3) was the rater placed to see it — each from 0 to 1, then adds them up. Skip the formula; remember the three questions and you have it.

First, is there enough evidence? Call this saturation. Count the concrete events (supported pieces of evidence); three earns full marks, one earns a third. Why three? A single event might be luck. Only something repeated counts as real ability. Second, is the account free of contradictions? Call this coherence. Do the same person's experience, reasoning, and values fit together? If the "why I judged so" from a hard case clashes with another story, the score drops. Third, was the rater placed to see it? Call this observability, covered next.
(As a formula: C = weight x saturation + weight x coherence + weight x observability, where saturation is "count divided by 3, capped at 1.")

Observability o — no weight without sight

Observability o measures whether the rater could actually see that item. The source (Part 4.1) sets it as a product of two things: being well placed to see (positional fitness), and actually producing evidence. The product matters. Even a well-placed rater who cannot produce a single concrete piece of evidence drives the second number near zero, so the product o falls too.

Why a product? To refuse weight based on a title alone. In soccer, a referee in a good spot still cannot blow the whistle if they did not actually see the foul. Same here. A manager, for instance, is well placed to see how much trust a report carries (trust density). But if they cannot recount a single concrete scene where the person's judgment was pushed back or upheld, their o on that item is low. Conversely, a field owner who received a flag holds living evidence — how they revised their own first draft — on the narrow items of communication and getting people to act on their own. So o is high there.

Who sees what, by standpoint

Who can see which item is set by who was actually present, not by title. The source's staffing table gives a starting point for visibility. As with tasting food, whoever was in the kitchen sees the cook's skill best.

Rater standpoint	Items easy to see	Visibility o (guide)
Self	All items (self-report, subjective)	— (not counted on the other-side; used for the self-gap)
Co-review peer	Seeing power (knowledge, noticing, sixth sense, intelligence)	0.8–0.9 (high)
Line manager	Moving power (communication, relations), trust density	0.6–0.8 (mid–high)
Field owner who received a flag	Communication, getting people to act on their own	0.7–0.9 (high, that item)
Cross-function stakeholder	Relationship building, trust density (across units)	0.5–0.7 (mid)

These figures are only a start. Real o appears only after multiplying by "did they produce evidence." Even in a high row, real o falls below the tabled number when no evidence comes out. Sitting in a good seat does not prove you watched.

Weight w = visibility x certainty — how strong a voice is

Finally, multiply observability o (could they see it) by confidence C (is it certain). This is the weight w, how strong a voice is when opinions are combined. The more a rater could see (high o) and the thicker and more consistent their evidence (high C), the more strongly they bear on that item's verdict. Thin evidence, or an unsighted standpoint, is made light automatically. Why a product? Because if either side is weak, we want the whole to be weak: seeing without evidence means nothing, and evidence without sight cannot be trusted. Next episode's corroboration sets the outside view (the other-level) from the level a majority of this weight supports.
(Formula: w = o x C.)

How to staff the raters

Once o and w are numbers, you can judge a panel before measuring. The source's rule is simple: each item must be covered by two or more people with visibility o of 0.7 or higher. Why two? If only one rates high and no second pair of eyes backs it up, the item was never really measured. Only with several judges can one person's oversight or mistake be caught. The table distinguishes a sound panel from a broken one.

Aspect	Sound panel	Broken panel
People at o >= 0.7 per item	Two or more (corroboration possible)	One or fewer (no corroboration)
How people are chosen	By whether they saw the behavior	By diversity of title or attribute
Can they produce evidence	Each brings concrete evidence	Position named, evidence absent
Handling of unsighted readings	Weight w shrinks, made light automatically	Thin readings enter the average

What to do when C is low

Do not force a low-C reading to final. The source is plain: too few events, a contradictory account, or no standpoint to see it — any of these lowers C and leaves the reading provisional. The fix splits by symptom. Too little evidence (low saturation): draw a few more events from the same person. A contradictory account (low coherence): check the clash with the person and use concrete facts to settle which was the real behavior. No one who can see it (low observation): add someone who can see that item to the panel. The shared point across all three: each thickens the evidence rather than raising the score. Evidence sets the score. Raising it without evidence is like rewriting a health-check number instead of measuring again — not allowed.

Measurement Design ── Map of all 10 episodes

Vol. 2: Listening Through STAR ── Situation, Task, Action, Result, Thought ── Pick just one thing that actually happened in the past and ask about it in five parts: the setting (Situation), what was assigned (Task), what the person did (Action), what came of it (Result), and why they decided as they did (Thinking). Spend more than half the time on the Action, write down what they did as verbs, check through the Result that it really happened, and draw out the root of the judgment through the Thinking.
Vol. 3: Encoding to Two Axes ── Action Reveals Scope, Thought Reveals Abstraction ── Turning one "what they actually did" story heard in an interview into three readings — how widely they moved (scope sigma), what reasoning they used (abstraction alpha), and whether it really happened (grounding g) — worked through a concrete material-review example.
Vol. 4: The Six BEI Principles ── Axioms That Keep the Measurement Clean ── What a person actually did, told through a four-point way of asking, gets converted into three rulers: depth of thinking, reach of action, and whether a real episode backs it up. The person's reading is then the highest level that the episodes actually support. This installment explains, with everyday examples, the six interview manners that keep that conversion from getting muddied.
Vol. 5: Three Bands ── The Scales of Abstraction α, Scope σ, and Grounding g ── Before any level verdict, this issue sets the three rulers for measuring the behavior we heard: how high the reasoning goes, how far the action reached, and how firmly it is backed by fact. Measured in steps, not scores.
Vol. 6: How L Is Decided ── The Grounding Ceiling and Projection to the Diagonal ── Talk without backing does not raise the level. Take only the reach that real behavior confirms, even out the two measures, and read L.
Vol. 7: The Behaviors That Separate Levels ── Eight-Dimension Anchors and Boundaries ── Using a sample book of "what they actually did" (the anchor table), we match a person's account to the closest sample to decide the level (L1 to L4). All eight abilities are measured by the same method.
Vol. 8 (this episode): Confidence and Observability ── How Far to Trust a Reading ── An episode about putting a number on how sure a rating is. Confidence C comes from how much evidence there is, whether the story holds together, and whether the rater could see it; observability o comes from being well placed and actually producing evidence; their product, weight w, feeds the final tally.
Vol. 9: Multi-Party AI Dialogue ── Corroboration for Others' Level, Divergence for Calibration ── One pair of eyes cannot measure a person. The subject and several colleagues take the same structured interview (BEI); each vote is weighted by how well that person actually saw the scene, and only readings that other votes back up are bound into an outside view of the level. The gap from the subject's self-rating is kept in a separate column as "how accurately they see themselves," not as ability.
Vol. 10 (final): From Integrated Output to the Qualifying Line ── The Record and the Operating Procedure ── The closing piece of Series 3 on measurement design. In plain terms it explains how the per-person, per-item score sheet hands each number to the right checkpoint in the pass/fail decision, and walks through the seven steps for actually running the measurement.

In closing

Confidence C and observability o do not touch the L number. L stays the highest level the evidence actually shows; C and o are a separate measure deciding how much weight a reading may carry. Their product, w = visibility x certainty, becomes how strong a voice is in next episode's tally (third-party integration).

If, when staffing, each item is covered by two or more people at visibility 0.7 or higher, the formula makes thin evidence and unsighted readings light on its own. Hold certainty as a number. That is what stops one high rating from moving the whole.

Key Points ── Three to take with you

Confidence C is three questions — (1) enough evidence (saturation, full at three), (2) no contradictions (coherence), (3) placed to see it (observability). If low, the reading stays provisional, not final.
Observability o is a product — well placed to see times actually producing evidence. Even a high standpoint loses o when no evidence comes out; a title alone earns no weight.
Staffing: two or more at o >= 0.7 per item — passed into the tally as weight w = visibility x certainty. Choose people by whether they saw the behavior, not by attribute diversity.

Sources & references

McClelland, D. C. Testing for Competence Rather Than for Intelligence. American Psychologist, 1973. (origin of behavioral-evidence assessment)
Boyatzis, R. E. The Competent Manager: A Model for Effective Performance. Wiley, 1982. (extracting behavioral events via BEI)
Cohen, J. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 1960. (inter-rater agreement kappa, certainty of observation)
Shrout, P. E. & Fleiss, J. L. Intraclass Correlations: Uses in Assessing Rater Reliability. Psychological Bulletin, 1979. (reliability and weighting across multiple raters)
Spencer, L. M. & Spencer, S. M. Competence at Work: Models for Superior Performance. Wiley, 1993. (dimension-level observability and competency measurement)

← Back to Measurement Design