Over nine episodes we traced the path: take the concrete behavior heard in the interview, translate it into three yardsticks (depth of thinking, width of view, and grounding in fact), read the highest rung the person actually reached, and bundle the readings of the person and several others, weighted. This final episode covers what pulls all of it onto a single sheet — the score sheet — and the procedure that runs it in the field. The sheet is not decoration. It is a hand-off slip that decides which number goes to which checkpoint in the pass/fail decision, the seam that places measurement and pass/fail on the same evidence.
The Score Sheet Is a Hand-Off Slip
For each item, the person's measurement result is pulled onto one page. We call it the score sheet. You build one for each of the eight items, bundle them into a single profile, and pass it to the pass/fail decision. A health check-up makes this easy to picture: blood pressure, blood sugar, and so on are lined up item by item, and at the end "needs re-test" or "no abnormality" is decided — that one sheet. What matters is that this page carries not only the conclusion of measurement but also how far that conclusion may be trusted. Why? Because without knowing how trustworthy a reading is, you cannot tell whether a low value is "genuinely low" or "just measured badly."
Here are the values on the sheet. At the center is the coordinate of two yardsticks (depth of thinking and width of view). Added to it: the "capability as seen by others," which bundles several third-party readings; the person's own self-rating and the gap between them; the lean toward strengths or weaknesses and its direction; the certainty of the reading (a confidence level); how little the raters disagree (an agreement level); a mark for finalize-or-hold; and the record of the testimony that grounds it. One promise is kept here: never mix "capability" and "the gap in self-perception" in the same field. Capability is capability, the gap is the gap, each in its own column. Mix them, and a confident person looks more capable than they are.
Which Checkpoint Each Value Goes To
The pass/fail decision (covered in Series 2) was a row of checkpoints. One excellent checkpoint cannot make up for another that falls short. Picture airport security: a flawless baggage scan does not let you through if you fail the boarding-pass check. Fail any one and you stop. Each value on the score sheet is built to match one checkpoint exactly. The table below is the whole hand-off at a glance.
| Score-sheet value | Plain meaning | Destination checkpoint | What it does there |
|---|---|---|---|
| Capability seen by others (L_other) | A measure of the person's capability, bundled from several third-party readings | Required-level floor | Judges, as a floor, whether each item reaches the required height. One item's shortfall cannot be filled by another. |
| Gap (Delta = self-rating − others' rating) | How accurate self-perception is — seeing oneself too high or too low | Self-perception check | Not added to or subtracted from capability. Only read separately as over- or under-claiming. |
| Lean (b) and its direction | The tilt toward strengths or weaknesses — theory-first or experience-bound | Direction-of-lean check | Passes whether the person is head-heavy (b>0) or experience-bound (b<0) — which way there is still room to grow. |
| Confidence (C) and agreement (Ag) | The certainty of the reading and how little the raters disagree | Measurement-holds check | Confirms whether the measurement itself stands. If low, judgment is held. |
| Coordinate (depth, width) | The raw values of the two yardsticks | Development prescription / record | Shows which yardstick is low and opens the entry to what to grow next. |
When this mapping holds, the way capability is decided (this series) and the way pass/fail follows from it (the decision) no longer diverge — both rest on the same evidence of behavior actually performed. Why does this matter? Because if the yardstick that measures the reviewer and the yardstick that turns the result into pass/fail were different objects, the evaluation could no longer explain why it came out as it did. The score sheet is the clasp that prevents that split.
Following One Item's Calculation From Start to Finish
The source gives a single-stroke procedure showing how one item's score sheet is assembled. Like a recipe, step by step the same ingredients become one dish. Traced in words, it runs like this.
- Score each person separately. The person and each third party take the same interview-style dialogue. From the behavior they describe, gather evidence and read the person's rung (the method seen in episodes 3 through 7).
- Bundle third parties by "corroboration." Each third party gets a weight of "how directly they saw it × how certain the reading is." The highest rung that a majority of that weight places "at or above" becomes the "capability seen by others." A lone high rating does not lift it if no other testimony backs it up. This backing-up is corroboration — confirming a claim through several testimonies.
- Compute the gap. Subtract others' rating from the self-rating and write it in a separate column. Do not touch the capability value at all.
- Measure disagreement. Compare third parties two at a time; the fraction of pairs whose rungs differ by one or less is the agreement level. A one-rung difference is allowed.
- Check that the measurement holds. If the total weight is too small, or agreement is too low, or fewer than two third parties saw it well (people in a position to observe closely), hold the judgment and send it to a meeting where raters align their eyes.
- Finalize the score sheet. If the conditions are met, tie the coordinate, others' rating, self-rating, gap, lean, confidence, and agreement into one page and return it.
One thing is worth remembering. The cut-off of "a majority" (the corroboration line, 0.5 in the source) amounts to taking a vote right at the middle. Raise this line and you demand stronger backing before granting a high rung — a cautious operation; lower it and it loosens. It is like deciding, in a sport judged by several referees, how many must agree for a hit to count. This line is a design choice about how much risk the organization will accept, not the mood of an evaluator on a given day.
The Operating Procedure ── From Preparation to Re-Measurement
Even with correct formulas, the readings get dirty if the field procedure collapses. The source gives seven plain steps, and the order itself carries meaning. Skip the first — "aligning eyes" — and every precise calculation downstream becomes a building on sand, because you would be measuring while everyone's ruler markings differ.
| Step | What to do | Line to keep |
|---|---|---|
| 1 Align eyes | Before measuring, all raters compare real material and real-behavior examples for each rung | Only when these match does the yardstick become a shared standard across nationality and department |
| 2 Choose raters | Pick person + third parties so each item has at least two who "saw it well" | Prefer "did they actually see the behavior" over variety of position |
| 3 Run the interview dialogue | Each person takes the dialogue individually; collect both a success case and a hard case | Spend 50–60% of the time digging into "what did you do" |
| 4 Record in their own words | Keep verbs, not adjectives; tie the three yardsticks' codes to this record | Not "was excellent" but "proposed X to Y and carried out Z" |
| 5 Compute and finalize | Produce each item's score sheet with the earlier formulas | Items where the measurement does not hold are not finalized; re-measure |
| 6 Pass/fail via the decision | Run the finalized sheet through the pass/fail decision | A low value is not failure but an entry to what to grow (no punishment) |
| 7 Re-measure periodically | Re-measure as regulations update or standards go stale | Competency rises and falls over time; do not freeze it at one measurement |
The fourth step, "record in their own words," is unglamorous but central. The three yardsticks' codes become checkable later only when tied to this verb record. When someone later asks "why is this item rung 3," the place to return is not the impression "was excellent" but the raw testimony of who, when, and what. Like a photo's focus, returning to a blurred impression lets you confirm nothing. That is why the testimony record is part of the score sheet.
The No-Punishment Principle ── How to Read a Low Value
Finally, a reminder of the "no punishment" named in step six. An item where "capability seen by others" is low is not proof the person has failed. The direction of the lean tells whether they are head-heavy or experience-bound, and the missing floor shows which behavior to add. The purpose of measurement is not to judge but to name which concrete behavior to build next. When a value comes back bad at a health check-up, the doctor does not say "you fail" but "cut the salt and exercise" — pointing to the next move. The same here. A person who rates themselves low (a negative gap) is not docked for "lack of confidence," by the same logic: the gap sits in a separate column as the accuracy of self-perception, not as capability. The moment the measuring side breaks that separation, the person being evaluated loses the incentive to speak honestly, and the evidence itself grows thin.
Measurement Design ── Map of all 10 episodes
- Vol. 2: Listening Through STAR ── Situation, Task, Action, Result, Thought ── Pick just one thing that actually happened in the past and ask about it in five parts: the setting (Situation), what was assigned (Task), what the person did (Action), what came of it (Result), and why they decided as they did (Thinking). Spend more than half the time on the Action, write down what they did as verbs, check through the Result that it really happened, and draw out the root of the judgment through the Thinking.
- Vol. 3: Encoding to Two Axes ── Action Reveals Scope, Thought Reveals Abstraction ── Turning one "what they actually did" story heard in an interview into three readings — how widely they moved (scope sigma), what reasoning they used (abstraction alpha), and whether it really happened (grounding g) — worked through a concrete material-review example.
- Vol. 4: The Six BEI Principles ── Axioms That Keep the Measurement Clean ── What a person actually did, told through a four-point way of asking, gets converted into three rulers: depth of thinking, reach of action, and whether a real episode backs it up. The person's reading is then the highest level that the episodes actually support. This installment explains, with everyday examples, the six interview manners that keep that conversion from getting muddied.
- Vol. 5: Three Bands ── The Scales of Abstraction α, Scope σ, and Grounding g ── Before any level verdict, this issue sets the three rulers for measuring the behavior we heard: how high the reasoning goes, how far the action reached, and how firmly it is backed by fact. Measured in steps, not scores.
- Vol. 6: How L Is Decided ── The Grounding Ceiling and Projection to the Diagonal ── Talk without backing does not raise the level. Take only the reach that real behavior confirms, even out the two measures, and read L.
- Vol. 7: The Behaviors That Separate Levels ── Eight-Dimension Anchors and Boundaries ── Using a sample book of "what they actually did" (the anchor table), we match a person's account to the closest sample to decide the level (L1 to L4). All eight abilities are measured by the same method.
- Vol. 8: Confidence and Observability ── How Far to Trust a Reading ── An episode about putting a number on how sure a rating is. Confidence C comes from how much evidence there is, whether the story holds together, and whether the rater could see it; observability o comes from being well placed and actually producing evidence; their product, weight w, feeds the final tally.
- Vol. 9: Multi-Party AI Dialogue ── Corroboration for Others' Level, Divergence for Calibration ── One pair of eyes cannot measure a person. The subject and several colleagues take the same structured interview (BEI); each vote is weighted by how well that person actually saw the scene, and only readings that other votes back up are bound into an outside view of the level. The gap from the subject's self-rating is kept in a separate column as "how accurately they see themselves," not as ability.
- Vol. 10 (this episode): From Integrated Output to the Qualifying Line ── The Record and the Operating Procedure ── The closing piece of Series 3 on measurement design. In plain terms it explains how the per-person, per-item score sheet hands each number to the right checkpoint in the pass/fail decision, and walks through the seven steps for actually running the measurement.
Across ten episodes there was only one point. Capability is not decided by self-report. Take the concrete behavior heard in the interview, translate it into three yardsticks — depth of thinking, width of view, grounding in fact — read it as the highest rung the grounding supports, and corroborate it with several eyes. This final episode's score sheet and procedure were the hand-off slip, and the carrying method itself, for passing that way of measuring to the pass/fail decision intact.
Keep measurement and pass/fail on the same evidence. Turn a low value from punishment into the naming of the next concrete behavior to do. As long as these two hold, rungs 1 through 4 are decided not as impression but as the degree of match with fact-grounded patterns of behavior. The rest is to keep re-measuring on a schedule and not let the shared standard rot.
- The score sheet is a hand-off slip. "Capability seen by others" goes to the required-level floor, the gap from self-rating to the self-perception check, the lean to the direction check, and confidence and agreement to the measurement-holds check, keeping measurement and pass/fail from diverging on the same evidence.
- Order carries meaning. Align eyes, choose raters, interview dialogue, record in their own words, compute and finalize, pass/fail decision, periodic re-measurement. Skip the first alignment and the downstream calculations become a building on sand.
- Read it without punishment. A low value is not failure but an entry to "what to grow next," shown by the missing floor and the direction of the lean. The gap from self-rating is never added to or subtracted from capability; it sits in a separate column as the accuracy of self-perception.
- Boyatzis, R. E. The Competent Manager: A Model for Effective Performance. Wiley, 1982. (Theoretical basis for BEI and behavioral evidence.)
- McClelland, D. C. Testing for Competence Rather Than for Intelligence. American Psychologist, 1973. (Origin of measuring competence through behavior.)
- Cohen, J. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 1960. (Foundation for inter-rater agreement, Ag.)
- Spencer, L. M. & Spencer, S. M. Competence at Work: Models for Superior Performance. Wiley, 1993. (Competency dictionaries and level-by-level behavioral indicators.)
- Smith, P. C. & Kendall, L. M. Retranslation of Expectations: An Approach to the Construction of Unambiguous Anchors for Rating Scales. Journal of Applied Psychology, 1963. (Source of anchor alignment, the BARS lineage.)