The validity gap
After auditing AI-driven assessment tools across hiring, wellbeing, and performance management, the pattern is consistent. The validity gap is enormous. The marketing is confident. And the buyer rarely asks the right questions. The result is a market full of tools that look like assessments, claim the evidentiary status of assessments, and would not survive a basic psychometric audit.
Validity in psychometrics is not one thing. It is five interlocking forms. Construct validity (does the system measure what it claims). Content validity (does it sample the construct adequately). Criterion validity (does the score predict the outcome). Discriminant validity (does the score differ from things it should be different from). Consequential validity (do the decisions taken from the score produce just outcomes). Most AI assessments have evidence on at most one or two. The bar for deployment in any high-stakes setting is five.
Five red flags
Five patterns recur across vendor pitches. Any one should pause the procurement.
- The vendor cannot name the construct. Three different validation memos use three different operational definitions, none mapped to a published theoretical model. Without a stable construct definition, the validity evidence is anchored to nothing.
- Validation evidence comes from a single convenience sample. The training population was 8,000 candidates from one country and one industry. The deployment population is multinational across four industries. No cohort-specific validity evidence exists for three of the four target deployment populations.
- Fairness is reduced to a single adverse impact ratio. The 0.8 threshold is treated as the entirety of the fairness review. Configural and metric measurement invariance tests have not been run. When run, scalar invariance often fails.
- Drift monitoring is a 12-month re-validation cycle. There is no daily, weekly, or monthly signal being watched. There is no defined threshold that triggers pause, no rollback plan, and no named owner for the watch. Construct drift, proxy collapse, and feedback contamination operate on a weekly timescale; an annual cycle catches almost nothing.
- Contestability is a customer-service email address. Candidates and employees scored by the system have no procedural appeal path. The vendor's response, when asked, is that complaints can be sent to support.
Five questions to put on the procurement checklist
The Validity Stack is the buyer's checklist. Five questions, five evidence demands. Vendors who can produce all five are rare. Vendors who can produce none are common. Treat the evidence gap as the audit gap.
- Construct: what does the system claim to measure, in psychometric language a third party can review, and what published model is it tied to?
- Calibration: provide the configural, metric, and scalar invariance results across the demographic groups in our deployment population.
- Cohort: provide validation evidence in samples that match our deployment cohorts (industry, geography, language, role family, demographic class).
- Drift: name the five drift signals you watch, the thresholds that trigger pause, the rollback authority, and the human owner accountable for the watch.
- Contestability: show the procedural appeal path, the human decision-maker, and the response SLA you commit to in the contract.
If a vendor cannot supply documentation against all five, the deployment is not defensible. Not as a regulatory matter, not as a litigation matter, and not as a risk-management matter.
The stakes are not theoretical
Three classes of cost. Legal: the EU AI Act requires the documentation the five-layer audit produces; deployment without it is a regulatory exposure. In the US, disparate-impact litigation under Title VII applies to any selection tool. Reputational: a single visible failure damages the brand for years. Operational: a degraded assessment makes systematically wrong decisions about people, which compounds over time and is hard to reverse.
The audit cost is small relative to any one of these. Most engagements run four to eight weeks and produce an evidence pack that closes the procurement and survives the audit.
What to do today
If you have an AI assessment in production, run the AI-IARA self-assessment. 15 minutes, no commitment, produces a risk dashboard you can take to your AI governance committee.
If you are evaluating vendors, the buyer-facing hub at AI-driven assessments walks through what an audit produces, including a worked example of an anonymised AI hiring tool through all five layers.
If you want the formal methodology, the parent discipline is on the AI psychology hub, and the framework is documented in the 2026 Journal of Positive Psychology paper.












