Multi-Party AI Dialogue ── Corroboration for Others' Level, Divergence for Calibration

Two people watch the same person; one says "she's clearly senior level," the other "still mid-level at best." This happens all the time. They saw different scenes and stand in different places. Leave the measurement to one rater and that rater's blind spot becomes the result. This part is about binding several people's views into one number. The key is to corroborate, not to count heads; to take a weighted middle, not a plain average.

Why one rater cannot measure

The starting point is simple: "one person's view is always biased somewhere." Leave a health checkup to a single doctor and whatever that doctor missed stays missed. Work ability is the same. Knowledge and risk-detection are seen best by colleagues who reviewed alongside you; how you communicate and build trust is seen best by a manager or the field staff you gave feedback to. So no two witnesses carry equal value. For a given scene, a rater who never actually saw the behavior brings thin substance, however confident they sound.

That is why this design picks raters not by rank but by "which scene did they actually see." For each ability, it assigns at least two people who watched that scene closely. Variety of standing is not itself the goal. "Did they see the behavior" comes first, because no number of votes from people who did not see it changes the fact that they did not see it.

"Did they actually see it," and the weight of a vote

So for each rater we produce a number for how well they could see that ability. Call it observability o — in plain terms, "how clear the view was." The idea: were they positioned to see it, and can they actually recount the evidence (concrete events)? Multiply those two. So even a senior manager who cannot recount a single concrete event for that ability gets a low view score. Testimony, not rank, sets the weight.

The vote's weight is this "view" times the "certainty of the read." Certainty (confidence C), defined in earlier parts, comes from the amount of evidence, how well it fits together, and the clarity of view — how trustworthy that read is. As a formula, w = o × C: weight = view × certainty. Only when a person both saw it clearly (o) and gave coherent, sufficient evidence (C) does their vote become heavy. Like a photograph: only a shot that is in focus (o) and sharp across several frames (C) works as strong evidence.

How to bind colleagues' views — corroboration, not average or vote

This is the most important point of the part. When combining several colleagues' ratings, we use neither an average nor a plain majority vote. We use corroboration — the way a reporter won't run a story on one source alone, but only once several people independently confirm it.

In words: for each level (say "senior or above"), compute "the share of weight from people placing the person at that level or higher." Take the highest level whose share clears a preset corroboration line (θ). As a formula, for level ℓ compute the weighted share φ, and L_other (the outside view of the level) = the highest ℓ whose φ clears θ (W is the sum of colleague weights). You can skip the formula. The point: only a level backed by a majority of the weight gets through.

The corroboration line starts at 0.5. That equals "the weighted middle value (a weighted median)." The meaning is plain: one person rating high will not raise the level unless a majority of the weight backs that level. It rises only once corroborated. Why do this — an average lets one extreme high rating drag the whole upward; a vote treats a non-witness's ballot the same as a witness's. Corroboration prevents both. Raise the line above 0.5 and you demand stronger backing for higher levels — a more cautious stance.

The gap from self-rating — calibration, not capability

The subject also takes the same AI interview and produces their own level (L_self). The difference between this and the combined colleague level (L_other) is the gap (Δ). The formula is simple: Δ = self-rating − others' rating. Positive means they see themselves high (over-claiming), negative means low (modesty), near zero means accurate self-perception.

One line must never be crossed: this gap is not "capability." It is "how accurately you see yourself." So the source says flatly — "the gap is never added to or subtracted from the capability score; keep it in a separate column and read it as the quality of self-perception." We do not punish the self-inflater by lowering their level, nor reward the modest by raising it. The others' rating stays the others' rating; the gap sits beside it. Why — in a health checkup, the patient's self-report ("I feel fine") and the test results are recorded separately as a matter of course. Mix the two and neither stays trustworthy. Keeping the three layers — coordinate, capability, self-perception — unmixed to the end is the backbone of this design.

Agreement among raters — is it even measurable

If colleagues' ratings scatter all over, that ability is not yet measurable. When several referees in a sport disagree sharply, the first thing to suspect is not the player but that the rule reading isn't aligned — same idea here, so we check inter-rater agreement (Ag). The idea: "a one-step difference is forgiven." As a formula, take the share of pairs whose difference is within one band (Ag = pairs within ±1 band ÷ all pairs). Splitting between "senior" and "near-senior" is within interpretive range; splitting between "junior" and "senior" means people are reading the ruler differently.

The condition for finalizing a judgment (validity condition G0) requires three things at once: the total weight above a floor, agreement above a floor, and at least two colleagues with "view ≥ 0.7." Miss any one and the judgment is paused. Then a calibration meeting aligns how people read the ruler, and the measurement is redone. In the source's words, "low agreement is a sign the ruler is read differently; fix the standard before judging the person." Do not mistake low agreement for a problem with the person being rated. It is a problem with the measurers — you cannot blame the player when the referees disagree.

Different questions for the subject and for colleagues

The AI rephrases the same interview skeleton to fit who it is talking to. What stays constant: it asks no "what if" hypotheticals — only real past events. Why — people answer hypotheticals with ideals, and the actual behavior never surfaces. To the subject: "tell me one time you actually did ~," drawing out the basis of judgment (how broad a principle they reasoned from = abstraction α) and how far the action reached (scope σ). To a colleague: "tell me one time you saw this person do ~," carving out what that individual actually did from the whole team's outcome. Shared prohibitions for both: no leading, no hypotheticals, no asking several things at once, no asking for opinions — and always return an abstract word to concrete behavior ("Specifically? What did you do next?").

The full computation for one ability

Trace the source's procedure in words, like following a recipe. First the subject and each colleague are interviewed, derive coordinates from concrete events, and get a level (L). Next, sum the colleagues' weights; for each level take the highest one whose weighted share clears the corroboration line as the others' rating (L_other). Subtract this from the subject's rating to get the gap (Δ). From the colleagues, take the share of pairs that landed within one band to get agreement (Ag). If the validity conditions on weight, agreement, and number of clear-view raters fail, pause the judgment (needs calibration); if they hold, return a record bundling coordinate, others' rating, self-rating, gap, imbalance (b), and certainty (C). Only with this whole sequence does the level get fixed not as "a vague impression" but as "agreement with the evidence of actual behavior."

Measurement Design ── Map of all 10 episodes

Vol. 2: Listening Through STAR ── Situation, Task, Action, Result, Thought ── Pick just one thing that actually happened in the past and ask about it in five parts: the setting (Situation), what was assigned (Task), what the person did (Action), what came of it (Result), and why they decided as they did (Thinking). Spend more than half the time on the Action, write down what they did as verbs, check through the Result that it really happened, and draw out the root of the judgment through the Thinking.
Vol. 3: Encoding to Two Axes ── Action Reveals Scope, Thought Reveals Abstraction ── Turning one "what they actually did" story heard in an interview into three readings — how widely they moved (scope sigma), what reasoning they used (abstraction alpha), and whether it really happened (grounding g) — worked through a concrete material-review example.
Vol. 4: The Six BEI Principles ── Axioms That Keep the Measurement Clean ── What a person actually did, told through a four-point way of asking, gets converted into three rulers: depth of thinking, reach of action, and whether a real episode backs it up. The person's reading is then the highest level that the episodes actually support. This installment explains, with everyday examples, the six interview manners that keep that conversion from getting muddied.
Vol. 5: Three Bands ── The Scales of Abstraction α, Scope σ, and Grounding g ── Before any level verdict, this issue sets the three rulers for measuring the behavior we heard: how high the reasoning goes, how far the action reached, and how firmly it is backed by fact. Measured in steps, not scores.
Vol. 6: How L Is Decided ── The Grounding Ceiling and Projection to the Diagonal ── Talk without backing does not raise the level. Take only the reach that real behavior confirms, even out the two measures, and read L.
Vol. 7: The Behaviors That Separate Levels ── Eight-Dimension Anchors and Boundaries ── Using a sample book of "what they actually did" (the anchor table), we match a person's account to the closest sample to decide the level (L1 to L4). All eight abilities are measured by the same method.
Vol. 8: Confidence and Observability ── How Far to Trust a Reading ── An episode about putting a number on how sure a rating is. Confidence C comes from how much evidence there is, whether the story holds together, and whether the rater could see it; observability o comes from being well placed and actually producing evidence; their product, weight w, feeds the final tally.
Vol. 9 (this episode): Multi-Party AI Dialogue ── Corroboration for Others' Level, Divergence for Calibration ── One pair of eyes cannot measure a person. The subject and several colleagues take the same structured interview (BEI); each vote is weighted by how well that person actually saw the scene, and only readings that other votes back up are bound into an outside view of the level. The gap from the subject's self-rating is kept in a separate column as "how accurately they see themselves," not as ability.
Vol. 10 (final): From Integrated Output to the Qualifying Line ── The Record and the Operating Procedure ── The closing piece of Series 3 on measurement design. In plain terms it explains how the per-person, per-item score sheet hands each number to the right checkpoint in the pass/fail decision, and walks through the seven steps for actually running the measurement.

In closing

What the multi-rater design changes is the protagonist of measurement: from "one excellent rater" to "corroborated multiple testimonies." One person's blind spot is filled by another's observation, the corroboration line stops a runaway single vote, and inter-rater agreement exposes the measurers' own lack of preparation. The level thereby approaches a value that does not hinge on who happened to watch.

And the gap from self-rating sits quietly in its own column. It is neither a bonus nor a penalty on ability, but a record of how accurately a person sees themselves. As long as capability (the others' rating) and self-perception (the gap) are never mixed, the measurement stays an entrance to development, not an instrument for judging people.

Key Points ── Three to take with you

The outside view of the level comes from corroboration — colleague ratings are bound not by average or vote but weighted by "did they see it (view) × certainty of the read," and counted only once several testimonies back it up. The starting line is 0.5 (a weighted middle), so an uncorroborated vote cannot raise the level.
The gap from self-rating is calibration, not capability — gap = self-rating − others' rating measures "how accurately you see yourself"; it is never added to ability but kept in a separate column. The self-inflater is not punished, the modest not over-praised — just as a health checkup records self-report and test results separately.
When ratings scatter, pause the judgment — if inter-rater agreement (a one-band difference is forgiven) is low, or fewer than two clear-view colleagues are present, align how people read the ruler in a calibration meeting before judging anyone. The split is the measurers' problem, not the rated person's.

Sources & references

McClelland, D. C. Testing for Competence Rather Than for "Intelligence". American Psychologist, 1973. (The origin of measuring by behavioral evidence.)
Boyatzis, R. E. The Competent Manager: A Model for Effective Performance. Wiley, 1982. (Foundation of the Behavioral Event Interview.)
Cohen, J. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 1960. (The classic inter-rater agreement κ; the conceptual root of Ag.)
Smith, P. C., & Kendall, L. M. Retranslation of Expectations: An Approach to the Construction of Unambiguous Anchors for Rating Scales. Journal of Applied Psychology, 1963. (Behavioral anchors and aligning interpretation.)
Spencer, L. M., & Spencer, S. M. Competence at Work: Models for Superior Performance. Wiley, 1993. (Practice of competency judgment via BEI.)

← Back to Measurement Design