AI Governance

Why Most AI Assessment Tools Would Fail a Basic Validity Audit

After auditing AI assessment tools across hiring, wellbeing, and performance management, the pattern is consistent. The validity gap is enormous. The marketing is confident. And the buyer rarely asks the right questions. Here are the five they should.

by Prof. Llewellyn E. van Zyl (Ph.D)2 May 20264 min read

Back to Blog

As featured on

Key Takeaways

Most AI-driven assessment tools cannot produce evidence at three of the five Validity Stack layers. They are marketed as valid because nobody asks for the documentation.
Five recurring red flags: undefined construct, single-sample validation, fairness reduced to adverse impact, no drift monitoring, contestability as a customer-service email.
The Validity Stack is a procurement checklist. Five questions, five evidence demands. Vendors who can produce all five are rare.

The validity gap

After auditing AI-driven assessment tools across hiring, wellbeing, and performance management, the pattern is consistent. The validity gap is enormous. The marketing is confident. And the buyer rarely asks the right questions. The result is a market full of tools that look like assessments, claim the evidentiary status of assessments, and would not survive a basic psychometric audit.

Validity in psychometrics is not one thing. It is five interlocking forms. Construct validity (does the system measure what it claims). Content validity (does it sample the construct adequately). Criterion validity (does the score predict the outcome). Discriminant validity (does the score differ from things it should be different from). Consequential validity (do the decisions taken from the score produce just outcomes). Most AI assessments have evidence on at most one or two. The bar for deployment in any high-stakes setting is five.

Five red flags

Five patterns recur across vendor pitches. Any one should pause the procurement.

The vendor cannot name the construct. Three different validation memos use three different operational definitions, none mapped to a published theoretical model. Without a stable construct definition, the validity evidence is anchored to nothing.
Validation evidence comes from a single convenience sample. The training population was 8,000 candidates from one country and one industry. The deployment population is multinational across four industries. No cohort-specific validity evidence exists for three of the four target deployment populations.
Fairness is reduced to a single adverse impact ratio. The 0.8 threshold is treated as the entirety of the fairness review. Configural and metric measurement invariance tests have not been run. When run, scalar invariance often fails.
Drift monitoring is a 12-month re-validation cycle. There is no daily, weekly, or monthly signal being watched. There is no defined threshold that triggers pause, no rollback plan, and no named owner for the watch. Construct drift, proxy collapse, and feedback contamination operate on a weekly timescale; an annual cycle catches almost nothing.
Contestability is a customer-service email address. Candidates and employees scored by the system have no procedural appeal path. The vendor's response, when asked, is that complaints can be sent to support.

Five questions to put on the procurement checklist

The Validity Stack is the buyer's checklist. Five questions, five evidence demands. Vendors who can produce all five are rare. Vendors who can produce none are common. Treat the evidence gap as the audit gap.

Construct: what does the system claim to measure, in psychometric language a third party can review, and what published model is it tied to?
Calibration: provide the configural, metric, and scalar invariance results across the demographic groups in our deployment population.
Cohort: provide validation evidence in samples that match our deployment cohorts (industry, geography, language, role family, demographic class).
Drift: name the five drift signals you watch, the thresholds that trigger pause, the rollback authority, and the human owner accountable for the watch.
Contestability: show the procedural appeal path, the human decision-maker, and the response SLA you commit to in the contract.

If a vendor cannot supply documentation against all five, the deployment is not defensible. Not as a regulatory matter, not as a litigation matter, and not as a risk-management matter.

The stakes are not theoretical

Three classes of cost. Legal: the EU AI Act requires the documentation the five-layer audit produces; deployment without it is a regulatory exposure. In the US, disparate-impact litigation under Title VII applies to any selection tool. Reputational: a single visible failure damages the brand for years. Operational: a degraded assessment makes systematically wrong decisions about people, which compounds over time and is hard to reverse.

The audit cost is small relative to any one of these. Most engagements run four to eight weeks and produce an evidence pack that closes the procurement and survives the audit.

What to do today

If you have an AI assessment in production, run the AI-IARA self-assessment. 15 minutes, no commitment, produces a risk dashboard you can take to your AI governance committee.

If you are evaluating vendors, the buyer-facing hub at AI-driven assessments walks through what an audit produces, including a worked example of an anonymised AI hiring tool through all five layers.

If you want the formal methodology, the parent discipline is on the AI psychology hub, and the framework is documented in the 2026 Journal of Positive Psychology paper.

Prof. Llewellyn E. van Zyl (Ph.D)

Chief Solutions Architect

Psynalytics

Prof. Llewellyn E. van Zyl (Ph.D) is the leading voice in AI psychology. He designs, measures, and assures AI systems that make decisions about human beings.

You Might Also Enjoy

More articles on ai governance and related topics

AI Governance

When AI Becomes Your Therapist: The Audit Nobody Is Running

AI therapy bots are moving from beta product to clinical product without the validation any other clinical tool would require. The class-action wave is twelve months out. Here is what an AI-IARA audit catches before it lands.

2 May 20264 min read

Read Article

AI Governance

Construct Drift: The Silent Failure Mode in Deployed AI Assessment

Construct drift is the gradual shift in what an AI assessment is actually measuring after deployment, even when the model weights are frozen. It is the most expensive failure mode in deployed people-impact AI, and almost no one is watching for it.

2 May 20263 min read

Read Article

AI Governance

When Digital Twins Become Digital Liabilities

A digital twin is a continuous model of a person. Built without psychometric scaffolding, it accumulates errors for years before anyone notices. The audit-defensible alternative is named, and it is not subtle.

2 May 20263 min read

Read Article

See more articles

The Science Behind Safe AI

Weekly insights on Artificial Intelligence, Wellbeing science, and the psychology of trustworthy systems. Join 1,000+ forward thinking professionals.