Skip to content
Studeia Docs

Proficiency via IRT (2PL model, ENEM-style)

How Studeia computes proficiency via Item Response Theory (IRT): a 2PL model, calibration from responses, a 0–1000 ENEM-style scale and a CTT fallback. What it is (and isn't).

2026-06-22 7 min
Resposta curta

Studeia computes proficiency via Item Response Theory (IRT) using a 2PL model (discrimination + difficulty), ENEM-style — it is not INEP's official 3PL. Calibration needs about 20 responses per item; below that, it falls back to Classical Test Theory (CTT). Proficiency (theta) is estimated via EAP and converted to a 0–1000 ENEM-style scale, letting you measure more fairly than a simple percentage and identify problematic items. It's an honest 2PL IRT, not ENEM's official calculation.

Studeia offers a proficiency calculation via Item Response Theory (IRT) for ENEM-style assessments. Here we explain what it is, how it works — and, honestly, what it is not.

Quick answer

  • 2PL IRT model (discrimination + difficulty), ENEM-style
  • NOT INEP's official 3PL (no guessing parameter)
  • Calibration needs ~20 responses/item; below that, CTT fallback
  • Proficiency (theta) via EAP, 0–1000 scale
  • Useful to measure fairly and spot problematic items

What IRT is

Item Response Theory models the probability of answering a question correctly as a function of the student's proficiency and the item's characteristics. In the 2PL model, each item has two parameters:

  • Discrimination (a): how well the item separates strong from weak students.
  • Difficulty (b): the proficiency level at which the chance of a correct answer is 50%.

Unlike a simple percentage, IRT weighs each item differently — a hard, discriminating question "counts more" than an easy, ambiguous one.

Honesty: 2PL, not INEP's 3PL

The official ENEM uses a 3PL model that includes a third parameter (guessing). Studeia implements 2PL — faithful to the ENEM spirit, but not INEP's official calculation. We use it to rank proficiency and review items, without promising to reproduce the official score.

How it works

  1. Calibration: from responses, item discrimination and difficulty are estimated (via a CTT→logistic transformation).
  2. Minimum data: below ~20 responses per item, the platform uses CTT (percent correct) to avoid unstable estimates.
  3. Proficiency (theta): estimated via EAP (Expected A Posteriori) with a normal prior.
  4. Scale: theta converted to 0–1000, ENEM-style.

What to use it for

GoalHow IRT helps
Measure fairlyHard/discriminating items weigh differently
Compare studentsCommon 0–1000 scale
Review the examLow-discrimination items are removal candidates
ENEM simulationsProficiency in the exam's style

Limitations (stated)

  • It's 2PL, not 3PL — it doesn't reproduce ENEM's official score.
  • It needs response volume to calibrate (else, CTT).
  • It doesn't replace pedagogical analysis — it's a measurement tool.

FAQ

Is it INEP's 3PL? No — 2PL ENEM-style, without a guessing parameter.

How many responses to calibrate? ~20 per item; below that, CTT fallback.

On what scale? Theta via EAP converted to 0–1000.

What's it for? Measuring fairly, comparing students and reviewing items.


See the Quiz Engine and the ENEM test-prep use case.

FAQ

Does Studeia use the same 3PL IRT as ENEM/INEP?

No. Studeia implements a 2PL IRT model (two parameters: discrimination and difficulty), ENEM-style, but it is not INEP's official 3PL model (which includes a guessing parameter). It's an honest 2PL IRT, useful for ranking proficiency and spotting problematic items, without passing itself off as ENEM's official calculation.

How many responses are needed to calibrate?

2PL calibration needs a minimum volume of responses per item to be reliable — around 20 responses. Below that threshold, the platform falls back to Classical Test Theory (CTT, based on percent correct), avoiding unstable estimates with little data.

How is proficiency estimated and on what scale?

Proficiency (theta) is estimated via EAP (Expected A Posteriori) with a normal prior, from the calibrated 2PL parameters, and converted to a 0–1000 ENEM-style scale. This lets you compare students on a common ruler, rather than just by raw number correct.

What is IRT useful for in practice?

To measure proficiency more fairly than a simple percentage (hard, discriminating questions weigh differently), rank students on a comparable scale, and identify problematic items (low discrimination) for review. It's useful in ENEM-style test-prep and simulations and in large-scale assessments.

Veja tambem

Proficiency via IRT (2PL model, ENEM-style)