AI-Driven Assessments

Valid, Fair, Defensible. Or Not Deployable.

Most AI-driven assessment systems on the market today would fail a basic psychometric audit. The AI-IARA framework names the failures, scopes the fix, and produces the evidence pack a procurement team or regulator can actually defend.

Prof. Llewellyn E. van Zyl (Ph.D). Independent psychometric AI assurance audits.
AI-Driven Assessment

An AI-driven assessment is any system that uses machine learning, generative AI, or algorithmic inference to score a person on a psychological, behavioural, or performance construct. The category includes AI hiring tools, AI personality screens, AI mental-health triage, AI coaching and wellbeing platforms, AI fit and competency models, and AI-driven performance management. What unites them is that they make or shape a decision about a human being. The standards they have to meet are not technology standards. They are the same psychometric, fairness, and accountability standards that have applied to any people-impact measurement instrument since the 1950s. Most current AI assessments do not meet them.

If a vendor cannot produce construct validity, calibration evidence, drift monitoring, and a contestability mechanism, the system is not assurable. It is not a question of opinion. It is a question of evidence.

The Method

AI-IARA. Six capacities every AI assessment must demonstrate.

AI-driven assessments are scored on the same six capacities as any other people-impact AI: Awareness, Interpretation, Intention, Action, Relational Agency, and Autonomy. The assessment-specific question for each capacity is whether the system measures what it claims to measure, treats every person fairly, and gives the candidate, employee, or patient a meaningful way to push back on the result.

AI-IARAFramework
Awareness
Interpretation
Intention
Action
Relational Agency
Autonomy
The Validity Stack

The five layers an AI assessment must pass

Every AI-driven assessment audit produces evidence at five layers. A system that passes one layer but fails another is not deployable in any high-stakes setting. The order is sequential.

Step 01

Construct Validity

Define the construct the AI assessment claims to measure, in psychometric language an independent expert can review. Personality, fit, performance, risk of attrition, wellbeing, and engagement are not interchangeable. Most AI hiring tools we have audited cannot answer this layer, which means everything downstream is undefined.

Step 02

Subgroup Calibration

Step 03

Cohort Generalisability

Step 04

Drift Monitoring

Step 05

Contestability

StepTitleDescription
01Construct ValidityDefine the construct the AI assessment claims to measure, in psychometric language an independent expert can review. Personality, fit, performance, risk of attrition, wellbeing, and engagement are not interchangeable. Most AI hiring tools we have audited cannot answer this layer, which means everything downstream is undefined.
02Subgroup CalibrationTest whether the AI assessment's scores are equivalent across the demographic groups who will be assessed. Measurement invariance, differential prediction, and floor and ceiling effects all live here. Adverse impact ratios alone are insufficient and can hide real bias. This is the layer where most commercial AI assessments fail.
03Cohort GeneralisabilityValidate the AI assessment in samples that match the deployment population, not just the convenience sample it was trained on. Differential validity by tenure, geography, language, role family, and demographic class belongs here. If the deployment population has not been tested in advance, the deployment is uncontrolled.
04Drift MonitoringSpecify what proxy collapse, construct drift, and feedback contamination look like for this AI assessment, with named owners and signal thresholds that trigger pause or rollback. Most AI assessments degrade silently because no one is watching for it. The drift layer is the difference between a one-time validation and an assurable deployment.
05ContestabilitySpecify how a person scored by the AI assessment can see the result, question it, and appeal it. Contestability is the audit layer that converts measurement validity into procedural fairness. Without it, the assessment is not deployable in the EU under the AI Act high-risk provisions, and is exposed to disparate-impact litigation in the US.
Worked Example

Inside a real AI assessment audit

An anonymised walk-through of the five Validity Stack layers applied to a representative AI hiring tool. The findings and fixes pattern across the audits this site has run.

System under audit

An AI hiring tool that scores candidates on 'culture fit' and feeds the result to a recruiter dashboard. Used in four countries and four industries. Marketed as bias-audited because adverse impact ratios fell inside the 0.8 threshold.

01

Construct Validity

What does the assessment claim to measure?

What the audit found

The vendor described the system as measuring 'culture fit'. No formal definition of culture fit was supplied. Three different validation memos used three different operational definitions, none mapped to a published theoretical model.

What we recommended

Replace the marketing label with two specific, separately-validated constructs: behavioural alignment with role-defined competencies, and team interpersonal compatibility. Tie each to existing measurement scaffolding with published evidence.

02

Subgroup Calibration

Does the score mean the same thing across people?

What the audit found

Adverse impact ratios fell inside the 0.8 threshold so the vendor declared the system fair. Configural and metric measurement invariance tests had not been run. When we ran them, scalar invariance failed across two demographic groups: the same score signalled different latent levels of the construct.

What we recommended

Suspend deployment for the affected groups pending model retraining. Add measurement invariance tests to the pre-deployment evidence pack and to the quarterly drift cycle.

03

Cohort Generalisability

Does it work for the people it will assess?

What the audit found

The validation sample was 8,200 candidates from one country and one industry. The deployment population was multinational across four industries. No cohort-specific validity evidence existed for three of the four target deployment populations.

What we recommended

Phase the rollout. Validate per cohort before activation. The vendor agreed to a six-month staged deployment with cohort-specific sign-off rather than a single global launch.

04

Drift Monitoring

How does the assessment fail after launch?

What the audit found

The vendor's drift plan was a 12-month re-validation cycle. No daily, weekly, or monthly signal was being watched. There was no defined threshold that triggered pause, no rollback plan, and no named owner for the watch.

What we recommended

Define five drift signals tied to construct, calibration, and outcome metrics. Set thresholds based on historical variance plus three sigma. Name a single accountable owner with rollback authority. Move from re-validation to continuous assurance.

05

Contestability

Can the candidate or employee push back?

What the audit found

Candidates received a score and a recommendation. There was no explanation, no feature attribution, no procedural appeal path, and no human review trigger. The vendor's contestability provision was a customer service email address.

What we recommended

Add an explanation layer (top three contributing signals per score). Add a human-review trigger for any borderline score. Publish a procedural appeal process with named human decision-makers and a 14-day response SLA.

Audit outcome

Five fixes shipped. Pre-existing deployments paused for the two affected demographic groups. Drift monitoring switched from annual re-validation to continuous assurance with a named owner. Contestability path published. The vendor renewed its enterprise contracts with the audit evidence in the procurement file.

People Also Ask

Common questions about AI-driven assessments

Proof Stack

The Authority Behind This Page

Every claim on this page is anchored in two or more independent proof types: peer-reviewed publications, third-party speaking engagements, formal standards, and named institutional roles.

Standards Cited

  • AERA, APA, and NCME Standards for Educational and Psychological Testing
  • ITC Guidelines on Psychological Testing
  • EU AI Act, Annex III high-risk people-impact provisions
  • ISO/IEC 42001 AI Management Systems
  • NIST AI Risk Management Framework
  • EEOC Uniform Guidelines on Employee Selection Procedures (US, where applicable)

Institutions

  • Optentia Research Unit, North-West University
  • Centre for Behavioural Engineering and Insight, University of Twente
  • Frontiers in Psychology, Editorial Board
  • Psynalytics (Chief Solutions Architect)
  • Springer Nature, Editorial Affiliations

Schedule a psychometric AI assurance audit

Independent four to eight week audit of your AI-driven assessment system across the five Validity Stack layers. Produces an evidence pack you can show procurement, legal, and a regulator.