Statistics for the Entire Exam

Statistic Significance Acceptable Measures
Difficulty How hard is the item? Any measure, but extremely hard and extremely easy should be evaluated for relevancy
Reliability Is the test/item measuring in a consistent and systematic way. >.085
Validity The degree to which the test is measuring what it is supposed to measure. Used to determine content, construct and criterion related measures
Point Biserial (Item) Are the choices working well with one another? Correct answer: Positive, incorrect answer negative
Point Biserial
Can apply to test or item
Does the test/item differentiate between examinees who understand the content and those who do not. Any POSITIVE number


Of the many statistics generated for the entire exam, reliability is probably the most important.

NCC uses the alpha which gives us a numerical representation of how sure it can be ascertained that the measurements (examinee scores) are accurate.

How likely is it that an examinee would get the same score if tested again?

The higher the better!

  • 0.90 and above for core exams
  • 0.75 for subspecialty exams

Exam length can affect reliability: The longer the test, the higher the reliability.