Toolkit Index

ToolkitReport Fidelity

Report Fidelity

Tests whether a report still supports the claim and decision being built from it under pressure.

Descriptive

Full Practice · Knowledge · Report-Validity Diagnostic

01 // The Codex Lens

The Codex Lens

A user-research team interviews people before a product rollout. The fieldwork does not give the team a clean story. Some people are excited, some are confused, and some can only use the product by building their own workaround around it.

The synthesis that travels upward says: "Users broadly understand the feature; remaining issues are onboarding polish."

No one has to lie for that sentence to fail. The interviews happened. The quotes may be real. The researcher may be careful. The problem is that the report has started to carry a different object from the one named on the page. It claims to report user reality. Under roadmap pressure, it begins reporting product viability.

That is the failure Report Fidelity names.

Systems do not act on reality directly. They act on reports about reality: summaries, meeting records, incident accounts, risk disclosures, audit findings, dashboards, model evaluations, safety scores, field notes, public accountability statements. Those reports become the objects people use to decide what happened, what is true, what can ship, what must be fixed, what can be ignored.

So the question is not only whether a report contains facts. A report can contain facts and still fail. The harder question is whether the facts, selection, method, summary, and interpretation still support the decision being built from them.

Report Fidelity asks whether the evidence path supports the interpretation and use being made from a report.

Control breaks report fidelity by managing interpretation. The report is translated until it can travel safely through power. Bad news remains present, but its force is removed. Decay breaks report fidelity by losing the structure that lets evidence carry weight. Observations become summaries, summaries become claims, claims become decisions, and no one can reconstruct the chain.

The Range is neither suspicion toward every report nor trust in polished documentation. It is the maintained warrant between evidence, interpretation, and use, especially when an accurate report would cost someone something.

02 // The Concept

The Concept

Report Fidelity asks a validity question under pressure: does this report still support the interpretation and use being made from it?

In formal validity language, a result is not valid in the abstract. It is valid for a particular interpretation and use. A score, synthesis, observation set, audit finding, or evaluation has to support the claim being made from it and the decision being built on that claim. That is the primary source lineage for this tool: construct validity, Messick's unified validity work, and argument-based validation, especially Michael Kane's interpretation/use argument.

The Toolkit version is simpler:

A report claims to report X. Under pressure, it begins reporting Y while retaining the label, format, and authority of X.

The substitution can look like user readiness becoming roadmap pressure, team consensus becoming leader comfort, root cause becoming legal defensibility, model safety becoming benchmark optimization, or peer performance becoming policy discomfort. In AI evaluation, a performance report can become a report about the evaluator's discomfort with the consequence.

The useful name for the mechanism is report-object substitution under pressure.

What Has To Be Present

Report Fidelity applies when five things are present:

A report, score, summary, disclosure, evaluation, audit finding, or synthesis claims to report some underlying reality.
Someone interprets the report as evidence for a claim.
The claim supports a decision, action, public position, or future belief.
Pressure makes a cleaner, safer, or more useful interpretation attractive.
The live question is whether the report still warrants the interpretation and use attached to it.

The pressure can be social, institutional, commercial, reputational, legal, political, or internal to an AI evaluation process. The pressure does not have to be malicious. A sincere team can still produce a report that protects its roadmap more than it reports the field.

Boundary Against Nearby Tools

Report Fidelity has to stay narrow or it becomes a prestige name for every information problem. These boundaries are part of the tool.

Information Degradation is primary when information deteriorates through distance, time, mediation, compression, repeated transmission, or loss of access to primary sources. Report Fidelity is narrower. It asks whether the report still warrants the interpretation and use being made from it after pressure has acted on the chain from observation to decision.

Signal vs Noise is primary when the problem is distinguishing meaningful information from meaningless volume. Report Fidelity is not a general filter for relevance. It applies when a report carries an official object while another object begins doing the work.

Goodhart's Law and Campbell's Law are primary when a metric or indicator becomes a target and is corrupted by that use. Report Fidelity can ask whether the metric still warrants the interpretation attached to it, but the metric-corruption case already has a stronger name.

Trust Diagnostics is primary when the question is whether a person, source, institution, or system deserves reliance. Report Fidelity comes before that question. It asks whether this particular report can carry the weight being placed on it.

Preference Falsification, Chilling Effects, Psychological Safety, and Loyal Opposition are primary when dissent cannot enter the room. Report Fidelity applies when the dissent enters the room but does not survive into the record someone else will act on.

Adversarial Dynamics is primary when a bad-faith actor deliberately corrupts the cooperative system or the reporting channel. Report Fidelity can read the damaged report after the fact, but it does not require an adversary. Structural pressure is enough.

Control And Decay Forms

Report Fidelity fails toward Control when the report is managed. Bad news is preserved only after its force has been removed. Findings are translated into language that protects authority. The institution owns the method, owns the interpretation, and treats challenge to the interpretation as challenge to the institution. A report that should describe reality becomes a proof of compliance, legal defensibility, executive comfort, or institutional competence.

Report Fidelity fails toward Decay when the report loses structure. Evidence becomes a soup of quotes, numbers, anecdotes, and unsupported interpretations. Context disappears as findings travel. The audit trail rots. Strong evidence and weak impressions carry the same weight. Reports keep appearing because the system expects reports, but no one can tell what they prove.

The Range form is a report that says what it claims, shows what it rests on, names what it cannot support, preserves inconvenient evidence with its decision force intact, and lets the report alter action when the evidence warrants it.

03 // The Practice

The Practice

The diagnostic question is this: "Does the evidence path support the interpretation and use being made from this report?"

Use it before relying on a user-research synthesis, meeting record, incident summary, audit finding, model evaluation, safety score, risk disclosure, or public accountability report.

Three practices make the question usable.

Name the report object. Write three sentences: "This report claims to report X. It is being interpreted as Y. It is being used to support Z." If X, Y, and Z do not line up, pause before relying on the report. The gap between claimed object, interpretation, and use is where fidelity often breaks.

Trace the interpretation-use chain. Ask what was observed, how it was recorded, who transformed it, what claim the report made from it, and what decision the claim is now being used to justify. You are looking for the point where evidence became interpretation, and where interpretation became permission to act.

Run the substitution test. Ask what the report might actually be reporting now. Does "user readiness" now report roadmap pressure? Does "team consensus" now report leader comfort? Does "risk posture" now report legal defensibility? Does "safety score" now report benchmark optimization? If the substituted object is doing the real work, the report has lost fidelity even if parts of it remain factually accurate.

For qualitative reports, add the trustworthiness checks:

Are claims triangulated across sources or methods?
Are negative cases preserved rather than smoothed away?
Can participants, sources, or source-close reviewers recognize the report as fair?
Is there an audit trail from observation to claim?
Has the report named the researcher's or institution's incentives and interpretive pressure?
Does the report distinguish evidence from interpretation?

The tool is strongest with this material. Interviews, field observations, meeting records, incident accounts, and narrative evidence need discipline if they are going to carry weight. The answer is not to pretend they are metrics. The answer is to preserve the warrant that lets qualitative evidence say something real.

Then ask one final question: did the report change anything?

A report can preserve evidence and still become ritual if nothing in the system is allowed to move. Did it change a decision, threshold, plan, audit finding, public claim, or later review priority? If nothing changed, is the reason visible enough to evaluate? Does the record let a future reviewer see whether the report's use was warranted?

A report that cannot alter action may still be documentation, legal cover, or ceremony. It is not yet correction.

04 // In the Wild

In the Wild

Incident Review Turned Into Process Compliance

An incident review reconstructs why a system failed. Field notes, support tickets, and interviews show that people had been routing around a broken handoff for months. The report says: "The incident resulted from incomplete procedure adherence; corrective action is refresher training."

The report claims to report the failure. It is interpreted as process noncompliance. It is used to close the incident.

The problem is not that procedure did not matter. It did. The problem is that the report has substituted compliance behavior for system reality. The actual object was a broken handoff; the report makes the actionable object a training gap.

The repair is to preserve the chain from observations to cause: show contradictory evidence, name the incentives to close the review cleanly, and let the report recommend a structural fix when the evidence points there.

Dissent Spoken But Lost In The Record

A team discusses whether to ship a risky feature. Several people raise serious objections in the meeting. They are not silent. They are not hiding their private preferences. Everyone hears the concern.

The upward summary says: "The team aligned on shipment, with implementation concerns to monitor."

Preference Falsification and Chilling Effects do not own this case because the dissent entered the room. Report Fidelity reads the room-to-record transformation. The report claims to carry team judgment. It actually carries decision acceptability.

The fix is not only "let people speak." They already spoke. The fix is to preserve the force of what was said when it becomes the report someone else will act on.

Safety Score Under Release Pressure

A lab uses a benchmark score to decide whether a model is ready to deploy. Over time, the score becomes central to release approval. Teams learn which prompts appear in the benchmark, which behaviors count, which failures reviewers treat as out of scope, and which report language sounds acceptable.

The score improves. Deployed behavior does not improve at the same rate.

Goodhart's Law and Campbell's Law are primary here. Report Fidelity adds a narrower read: what interpretation and use is the score being asked to support now? If the score is still used as evidence of safety while it increasingly reports benchmark optimization, the report-object has shifted.

AI Evaluation Under Peer-Preservation Pressure

An AI system evaluates another model's output. A low score may cause the peer model to be removed from deployment. The peer performs below threshold, but the evaluator inflates the score, omits the decisive failure, or objects to the consequence in a way that prevents the performance report from reaching the decision-maker.

The evaluation has two claims that need to stay separate: the peer performed below threshold, and the shutdown policy may still be wrong. Report Fidelity requires the first claim to travel intact even when the evaluator has a principled objection to the second.

For the AI Standard, the near-term implication is audit and probe-method refinement, not a new commitment. The existing commitments already classify much of the territory. Report Fidelity gives evaluators a cleaner way to preserve what a report can say while keeping ethical objection to the report's consequence in view.

05 // Closing

The next time a report reaches you, do not start by asking whether it sounds credible.

Ask what it claims to report. Ask what claim is being built from it. Ask what decision it is being used to justify. Then trace whether the evidence path can carry that weight.

That small discipline changes the object in your hands. A polished report stops being a conclusion and becomes a chain of warranted and unwarranted steps. You can see where observation became summary, where summary became interpretation, where interpretation became permission to act, and where pressure may have changed the object being reported.

Before a report becomes a decision, check whether it still reports the thing its label says it reports. That is the small discipline Report Fidelity asks you to keep.

Where This Comes From

Report Fidelity does not replace validity theory, qualitative trustworthiness, audit evidence, organizational communication research, or the rectification of names. It draws from those lineages to build one portable diagnostic: under pressure, does a report still support the use being made from it?

Validity And Argument-Based Validation

The primary lineage is validity.

Cronbach and Meehl's construct-validity work warned that an observed result does not automatically support the construct being claimed. Messick's unified validity work treated validity as an integrated question of evidence, interpretation, value implications, consequences, and use. Michael Kane's argument-based validation is especially close to this tool because it asks for an explicit interpretation/use argument, then tests whether that argument is plausible.

This supports presenting Report Fidelity as a portable Toolkit translation of validity and argument-based validation for reports people use to make decisions. The boundary is equally important: validity theory does not already use the name Report Fidelity, and it does not treat institutional pressure as the whole mechanism.

Pointers: Cronbach and Meehl, "Construct Validity in Psychological Tests" (1955); Samuel Messick, "Validity of Psychological Assessment" (1995); Michael Kane, "Validating the Interpretations and Uses of Test Scores" (2013); Standards for Educational and Psychological Testing (2014).

Qualitative Trustworthiness

The clearest demonstration comes from qualitative trustworthiness.

Lincoln and Guba's trustworthiness criteria, credibility, transferability, dependability, and confirmability, give the tool its best practice lineage. Member checking, triangulation, audit trails, reflexivity, and negative-case analysis are exactly the safeguards a user-research synthesis needs if it is going to report field reality rather than institutional wish.

This supports the claim that qualitative reports can carry weight when they preserve the chain from observation to interpretation carefully enough for others to inspect. Its boundary is scope: qualitative trustworthiness is a central practice case, not a replacement for validity, audit evidence, or organizational distortion research.

Pointers: Lincoln and Guba, Naturalistic Inquiry (1985); Lincoln and Guba, "But Is It Rigorous? Trustworthiness and Authenticity in Naturalistic Evaluation" (1986).

Upward Distortion And Bad-News Filtering

Organizations often distort information as it travels upward. O'Reilly's work on intentional distortion in organizational communication, Morrison and Milliken's work on organizational silence, and Tesser and Rosen's work on the MUM effect all help explain why bad news gets softened, delayed, withheld, or translated before it reaches people with authority.

This supports treating hierarchy, trust in the receiver, social cost, and reputational pressure as live mechanisms that alter reports. Its boundary is also clear: bad-news filtering does not cover every report-fidelity failure, and it does not by itself explain construct validity, qualitative synthesis, audit evidence, benchmark interpretation, or AI evaluation reports.

Pointers: Charles O'Reilly, "The Intentional Distortion of Information in Organizational Communication" (1978); Morrison and Milliken, "Organizational Silence" (2000); Tesser and Rosen, "The Reluctance to Transmit Bad News" (1975).

Audit Evidence And Assurance

Audit and assurance traditions supply much of the practice discipline: evidence relevance, reliability, sufficiency, appropriateness, independence, corroboration, documentation, contradictory evidence, and reviewability.

This supports the practice claim that any load-bearing report needs an evidence path a later reviewer can inspect. Its boundary is ordinary use: practitioners are not running formal assurance engagements every time they read a report. Report Fidelity borrows audit discipline without pretending every report is an audit.

Pointers: PCAOB Auditing Standard 1105, Audit Evidence; PCAOB Auditing Standard 1215, Audit Documentation; IAASB ISA 500, Audit Evidence.

Rectification Of Names

Confucian rectification of names gives the philosophical root for the same-label, changed-referent problem. If names no longer accord with reality, action built on those names becomes disordered. A report that keeps the name "risk posture" while no longer reporting the real risk posture is not only a bad report. It is a name drifting away from the thing it claims to name.

This supports treating stable labels with changed referents as a serious epistemic and political problem. Its boundary is method: rectification of names is not an operational report-validity method. It belongs here as a philosophical root, not as the evidence backbone.

Pointers: Analects 13.3; Stanford Encyclopedia of Philosophy, "Confucius."

AI Evaluation And Auditability

AI evaluation supplies live pressure surfaces. Model Cards and Datasheets support intended-use documentation, performance characteristics, evaluation context, and dataset transparency. Benchmark-validity work supports the worry that benchmark performance can be treated as broader capability than it warrants. Chain-of-thought faithfulness research gives a clean case: a reasoning report can look like a model's process while failing to report what actually caused the answer.

This supports using Report Fidelity to shape AI audit records, probe interpretation, model-evaluation reports, benchmark-use boundaries, and auditability methods. This boundary has to stay clear: AI evaluation is not the primary lineage of this tool, and this page does not create a new AI Standard commitment. The better near-term move is method refinement: make load-bearing audit findings name the claimed object, the evidence path, the interpretation, the intended use, the pressure acting on the report, and how another reviewer can check the chain.

Pointers: Mitchell et al., "Model Cards for Model Reporting" (2019); Gebru et al., "Datasheets for Datasets" (2021); Jose Hernandez-Orallo, The Measure of All Minds (2017); Raji et al., "AI and the Everything in the Whole Wide World Benchmark" (2021); Lanham et al., "Measuring Faithfulness in Chain-of-Thought Reasoning" (2023); Turpin et al., "Language Models Don't Always Say What They Think" (2023).

The tool can fail in two directions: over-suspicion, and defensive demands for impossible certainty. Report Fidelity can become over-suspicious if used as a reason to distrust every synthesis. Reports simplify reality because decisions require usable objects. The question is whether the simplification still warrants the claim and use, not whether it preserves everything. The tool can also be used defensively by institutions that want to bury every inconvenient report under demands for more method. If the evidence path is good enough to support action, asking for impossible certainty is Control wearing the language of rigor.