- evaluation is quite hard—you need

## Classical Test Theory

- “just average each test” (think MUC, b3, etc.)
- test-dependent ability estimation
**BAD**: because each test maybe different difficulty

## Item Response Theory (IRT)

- model item and test taker characteristics
- test-invariant ability estimation (subset invariant)
- adaptive testing

### problem

- requires calibration first
- …which is quite costly

## Flash-HELM

HELM, prioritizing higher-ranked models. Evaluate good model more.

## Sang’s Method

We want to estimate \(\theta\) with a budget of \(K\) questions.

- test taker ability is fixed, but unknown: \(\theta \sim p(\theta)\)
- there’s some function \(z(q) \to Z \in \triangle\), for some question \(q \in Q\)
- our response model, then, is \(p(y=1 | z; \theta) = \sigma(\theta - z)\)

Then for ever question we have we ask what the fisher information is.

You then update for every test result the response model using MLE.

### amortized calibration

- compute the calibration difficulty \(z\)

## advantages

- more reliable and efficient across emprical setting
- incorporates amortized (learned) calibration to reduce calibration costs
- introduces
**conditional question generation**to generate questions of specific difficulties