Through Part 3 we saw how a four-point interview — Situation, Task, Action, Result, called STAR — gets translated into two rulers: depth of thinking and reach of action. But even a correct mechanism gives a muddy answer when the incoming material is muddy. It is like cooking: a perfect recipe is ruined by spoiled ingredients. Depending on how you ask, the same person can yield testimony that looks "quite ordinary" or "exceptionally able." The six principles introduced here are the foundation rules that prevent this muddying-by-interview. Ask only about what was actually done; ask only about past facts; pin the subject to the person; ask for actions, not adjectives; confirm repetition, not a one-off; do not lead the answer. Each looks obvious, yet in the field fewer than half are honored.

Why call them axioms

The source frames these six principles as "what keeps the measurement from getting muddy." In mathematics an axiom is a starting premise you accept before any proof begins. The reason to call these axioms is the same: every calculation rests on top of them.

Behind the work of turning ability into a score, each story you hear is quietly assumed to be "an action the person actually did in the past." A health check makes this clear. However precise the blood-test numbers are, if the collection tube is dirty the result means nothing. In the same way, the moment that assumption breaks, no amount of refined calculation saves the output — it becomes garbage. The six principles are the fences that physically enforce that assumption at the interview stage.

We take them in order. What matters is matching each principle to "which of the three rulers it protects." Muddying is not a vague "accuracy somehow drops." It takes a concrete form: one of the three rulers — depth of thinking (alpha), reach of action (sigma), backing by a real episode (g) — lands one step off from reality. Why keep this matching in mind? Because once you see which corner-cut throws off which ruler, you can no longer cut corners in the field.

PrincipleForbidden way of askingThe ruler it protects
Only what was actually done"Can you / could you do it?"Real-episode backing (g) — do not mistake willingness for action
Only past facts"What would you do if…"Backing (g) — a hypothetical has no episode behind it
Pin the subject to the personrecording "we did it" as-isReach of action (sigma) — do not claim others' work as the person's
Actions, not adjectivesaccepting "was excellent" as a conclusionDepth of thinking (alpha) — adjectives show no depth
Confirm repetitiondeciding ability from one eventBacking ceiling (g=2) — do not read luck as ability
Do not leadseeping the expected answer into the questionall rulers — the questioner's assumption mixes in

1. Only what was actually done — count only "did it"

The source's definition is plain: take as evidence only what was "actually done," not "can do / could do." Willingness and personality are not measured. What this protects is real-episode backing (g) — whether a genuine episode stands behind the claim. Backing becomes "present (1)" only when when, where, and what are concretely lined up. A claim made by mouth alone stays at "absent (0)."

Suppose someone says, "I'm sensitive to exaggeration risk." That is the adjective "sensitive," not an action. Backing stays at zero. So the interviewer does not score "sensitive" but pulls back: "Tell me one actual case where you caught it." Only when a concrete case appears does backing turn "present," and only then does evaluation of that event begin. Why be this strict? Because self-promotion in interviews inflates if left alone. Not adding willingness into ability is the first weir that stops a story from being padded.

2. Only past facts — a hypothetical has no backing

"What would you do if you saw a graph with no significant difference?" At first glance this seems to measure the power to see through things. But the source forbids it: a hypothetical question can never be a source of backing. A future or "what if" action has no who-when-what event at all. A non-existent event cannot be evaluated.

The danger of the hypothetical is that the answerer slips into their "ideal self." Without any intent to lie, "what if" narration comes out a step higher than actual behavior. The classic "model-student answer" in interviews is exactly this. So the interviewer always pulls back to "one event that actually happened" — "Tell me one concrete recent case where you saw through a figure with no significant difference." Asking only in the past tense keeps the backing genuine.

3. Pin the subject to the person — turn "we" back into "you"

This is the most frequent muddying in this kind of interview. The answerer unconsciously says "we judged it this way." Their individual action dissolves into the team's outcome and disappears. The source orders this turned back into "you / that person." What it protects is the reach of action (sigma).

This ruler looks at how far a person's action carried — how far into unfamiliar fields or unprecedented problems their hand reached. But recorded as "we," a reach achieved by someone else on the team becomes the person's credit. In sports terms, it is like recording the team's goal as one player's goal. When a third party evaluates, the principle is used in reverse: a chain of questions that cuts out "the one move that person actually made" from "the team's outcome." "Who first proposed that judgment?" "What did you yourself say in that moment?" An event from which an individual contribution cannot be cut out is not used as material for this ruler — because if it cannot be cut out, you cannot tell whose ability it is.

4. Actions, not adjectives — not "amazing" but "what did you do"

"Was excellent," "was careful" are impressions, not evidence. The source mandates recording the concrete action (the verb) that supports the conclusion, not the conclusion (the adjective). What it protects is depth of thinking (alpha).

This ruler reads what and how the person thought, on four steps: merely tracing the wording (shallow), connecting several conditions, building a line of reasoning from a principle or aim to reach a conclusion, or forming a brand-new principle itself (deep). This cannot be judged without reassembling "what was thought" in the language of action. Depth cannot be read from "excellent." It is like focusing a camera. "Excellent" is a blurry, out-of-focus photo with no visible outline. Only a chain of verbs — "stated that the provision's aim is to prevent misreading, then extended the reasoning to the layout of the figure" — brings it into focus and lets you read the depth as "reasoned from a principle." So the recording manner insists: "write it down word for word, in verbs, not adjectives." Allowing adjectives at the recording stage pads the later evaluation.

5. Confirm repetition — separate luck from ability

A one-time jackpot is not ability. The source requires checking, across several events, whether it was a one-off accident or a pattern that recurs. This connects directly to the rule of the "backing ceiling (g=2)." The top rank of backing, 2, is given only when "a counter-example (a case that did not work) was seen as well" or "reproduction was confirmed across several events." Only on reaching this does a high rank on that ruler become fixed.

Put in words: to claim a high rank, you need a certain amount of events backing judgment at that level. A single sharp catch (backing of 1), however impressive its content, is blocked by the ceiling and the high rank does not fix. Multiple referees make this clear: one referee calling "nice play" does not settle the call; only when several referees give the same call does it become solid. So the interviewer collects both success cases and difficult cases, and checks whether judgment at the same depth shows up in a different scene too. "Is there another case where you used the same way of seeing through it?" Sharpness that cannot be reproduced is held in reserve — because one instance of sharpness may be luck.

6. Do not lead — do not mix in the questioner's assumption

The last principle muddies most quietly and most deeply. "Do not hint at the desirable answer; the moment the questioner's expectation mixes in, the measurement is muddied." What it protects is not one ruler but all three. Leading pulls depth, reach, and backing all at once toward the direction the evaluator wants to see.

Ask "people usually notice this — you noticed too, right?" and the answerer narrates as if they noticed. The trouble with leading is that neither side readily becomes aware of the muddying; both mean well. Here lies one advantage of having an AI carry the same interview skeleton — an interviewer that holds no expectation can structurally reduce unconscious leading. Even so, the prohibitions are stated explicitly. Do not use leading questions, hypothetical questions, questions that ask several things at once, or questions asking for opinions. When an abstract word appears, always return to concrete action. Apply the return — "specifically? what did you do next?" — every time, carefully rather than mechanically. Because the moment you skip this small effort, the model-student answer gets recorded straight.

What happens when the six break

Violations advance silently. When leading and hypotheticals combine, the answerer's "ideal image" is recorded straight as "exceptionally able." Neglect pinning the subject, and the team's reach turns into the person's credit. Skip the repetition check, and one stroke of luck stacks up as ability. Each is a small slip on its own. But stacked across eight evaluation items and multiple evaluators, it can pass a person through the qualifying line (covered in Series 2) who should not pass. That qualifying line is a strict gate — "fail any single item and you are out" — and the buildup of small slips wrongly breaks through it. The sturdiness of the measurement rests, before any refinement of the formula, on whether these six fences hold in the field.

Measurement Design ── Map of all 10 episodes

  1. Vol. 2: Listening Through STAR ── Situation, Task, Action, Result, Thought ── Pick just one thing that actually happened in the past and ask about it in five parts: the setting (Situation), what was assigned (Task), what the person did (Action), what came of it (Result), and why they decided as they did (Thinking). Spend more than half the time on the Action, write down what they did as verbs, check through the Result that it really happened, and draw out the root of the judgment through the Thinking.
  2. Vol. 3: Encoding to Two Axes ── Action Reveals Scope, Thought Reveals Abstraction ── Turning one "what they actually did" story heard in an interview into three readings — how widely they moved (scope sigma), what reasoning they used (abstraction alpha), and whether it really happened (grounding g) — worked through a concrete material-review example.
  3. Vol. 4 (this episode): The Six BEI Principles ── Axioms That Keep the Measurement Clean ── What a person actually did, told through a four-point way of asking, gets converted into three rulers: depth of thinking, reach of action, and whether a real episode backs it up. The person's reading is then the highest level that the episodes actually support. This installment explains, with everyday examples, the six interview manners that keep that conversion from getting muddied.
  4. Vol. 5: Three Bands ── The Scales of Abstraction α, Scope σ, and Grounding g ── Before any level verdict, this issue sets the three rulers for measuring the behavior we heard: how high the reasoning goes, how far the action reached, and how firmly it is backed by fact. Measured in steps, not scores.
  5. Vol. 6: How L Is Decided ── The Grounding Ceiling and Projection to the Diagonal ── Talk without backing does not raise the level. Take only the reach that real behavior confirms, even out the two measures, and read L.
  6. Vol. 7: The Behaviors That Separate Levels ── Eight-Dimension Anchors and Boundaries ── Using a sample book of "what they actually did" (the anchor table), we match a person's account to the closest sample to decide the level (L1 to L4). All eight abilities are measured by the same method.
  7. Vol. 8: Confidence and Observability ── How Far to Trust a Reading ── An episode about putting a number on how sure a rating is. Confidence C comes from how much evidence there is, whether the story holds together, and whether the rater could see it; observability o comes from being well placed and actually producing evidence; their product, weight w, feeds the final tally.
  8. Vol. 9: Multi-Party AI Dialogue ── Corroboration for Others' Level, Divergence for Calibration ── One pair of eyes cannot measure a person. The subject and several colleagues take the same structured interview (BEI); each vote is weighted by how well that person actually saw the scene, and only readings that other votes back up are bound into an outside view of the level. The gap from the subject's self-rating is kept in a separate column as "how accurately they see themselves," not as ability.
  9. Vol. 10 (final): From Integrated Output to the Qualifying Line ── The Record and the Operating Procedure ── The closing piece of Series 3 on measurement design. In plain terms it explains how the per-person, per-item score sheet hands each number to the right checkpoint in the pass/fail decision, and walks through the seven steps for actually running the measurement.
In closing

All six are written less as "the right way to ask" than as "prohibitions on the wrong way to ask." Honoring them earns no bonus. Worse, breaking them does not merely cost points — it stops the measurement from holding at all. That is why they are axioms, the foundation rules. A broken formula announces itself: the numbers go wrong and you notice at once. But interview muddying advances quietly while the numbers still look clean. That is the most frightening part.

The next installment (Part 5) takes up the procedure for converting the backed evidence you have gathered into the three rulers — depth of thinking, reach of action, and real-episode backing. It explains the definition of each rank and how to tell neighboring ranks apart, using a concrete example list (the anchor table) for the power to see through risk. Only when the manner of asking (this part) and the manner of converting (the next) are both in place can a person's ability be read as "a reading supported by its backing."

Key Points ── Three to take with you
  1. Each principle is a fence guarding a specific ruler. "Only what was actually done" and "only past facts" guard real-episode backing; "pin the subject to the person" guards reach of action; "actions, not adjectives" guards depth of thinking; "confirm repetition" guards the backing ceiling; "do not lead" guards all of them. Muddying is not a vague accuracy drop but a specific ruler landing one step off.
  2. Hypotheticals, adjectives, and "we" cannot be used as evaluation material. A future "what if" has no event, so backing is zero; depth cannot be read from an adjective; leaving "we" intact claims another's work as the person's credit. So the interviewer always pulls back to past tense, verbs, and the individual move.
  3. Violations accumulate silently and wrongly pass the qualifying line. Small isolated slips, stacked across eight items and multiple evaluators, wrongly break through the strict "fail one item and you are out" gate. Sturdiness rests, before any refinement of the formula, on whether these six fences hold in the field.
Sources & references
  1. McClelland, D. C. Testing for Competence Rather Than for "Intelligence". American Psychologist, 1973. (Origin of measuring by behavioral indicators rather than aptitude tests)
  2. Boyatzis, R. E. The Competent Manager: A Model for Effective Performance. Wiley, 1982. (Systematization of the Behavioral Event Interview)
  3. Spencer, L. M. & Spencer, S. M. Competence at Work: Models for Superior Performance. Wiley, 1993. (BEI practical procedure and past-behavior discipline)
  4. Smith, P. C. & Kendall, L. M. Retranslation of Expectations: An Approach to the Construction of Unambiguous Anchors for Rating Scales. Journal of Applied Psychology, 1963. (BARS — anchors that curb subjectivity)
  5. Janz, T. Initial Comparisons of Patterned Behavior Description Interviews versus Unstructured Interviews. Journal of Applied Psychology, 1982. (Predictive validity of structured past-behavior interviews)