How we measure PEP conformance

Basilisk is scored by the official python/typing conformance suite — the same test suite and scoring tool the typing community uses to grade pyright, mypy, pyrefly, ty, and others. We run that tool unmodified, on the real basilisk binary, on every change.

Today that gives 40.4%59 of 146 test files passing, 955 required errors caught, with 285 false positives and 36 missed required errors left to clear. 3 of 21 categories pass at 100%. The target is 100%; we ratchet toward it.

What the conformance suite is

The Python typing specification defines how the type system is supposed to behave — generics, protocols, dataclasses, TypedDict, overloads, literals, and the rest. To stop the spec from being aspirational, the typing community maintains a conformance test suite alongside it in the python/typing repository.

It works like this:

  • Each spec chapter has one or more test files — ordinary Python modules that exercise a feature and mark, with # E comments, every line where a conforming type checker must report an error (and, with # E[tag] groups, where one of several related errors is acceptable).
  • A small scoring tool runs a type checker over those files and diffs its output against the annotations. A file passes only if the diff is empty: every required error is reported, and nothing is reported on a line the suite does not mark.
  • The maintainers run every checker through it and publish the results table, which is how figures like pyright's ~99% or pyrefly's ~86% are produced.

This is the suite we use, at the pinned commit 268d0c4e. Because the same tool and the same files grade everyone, the number is comparable across checkers and is not something we can tune in our favour.

How a file is scored

The entire algorithm is two functions in the suite's main.pyget_expected_errors (reads the # E annotations) and diff_expected_errors (diffs them against the checker's output). A file passes iff that diff is empty:

  • the suite's rule (upstream_main.py:185): "Fail" if errors_diff.strip() else "Pass"

We count every diagnostic the checker emits — errors and warnings, with no diagnostic codes excluded. That is the strictest reading of the suite and matches how the reference checker, pyright, is graded. One unexpected diagnostic (a false positive) fails the whole file, which is why our false-positive count matters as much as the pass count.

How we run it without forking it

The suite's main.py is a batch harness for the python/typing maintainers: it grades all the known checkers at once, pulls in TOML config/reporting dependencies, and writes a results matrix. It has no way to invoke our binary. So, exactly as the suite does for every checker (PyrightTypeChecker, MypyTypeChecker, …), we add a thin adapter and reuse the suite's own scoring rather than reimplementing it. Our score.py:

  1. Adapter — runs basilisk check --output json and shapes the result into the {line: [errors]} dict the suite's functions expect (the one thing the suite can't do for us).
  2. Calculator — imports get_expected_errors and diff_expected_errors from a committed, byte-identical copy of the suite's main.py and calls them unmodified (score.py:287 mirrors the suite's own call at upstream_main.py:175). It contains no scoring logic of its own.
  3. Gate — compares the result against coverage-thresholds.json and fails CI on any regression.

To keep the calculator trustworthy, the vendored copy is sha256-pinned. score.py re-hashes it on every run and refuses to score if it has drifted (score.py:99), and this website re-hashes it again at build time:

Keeping the official file untouched is the whole point: the adapter and gate live in a separate, auditable file, so the calculator stays byte-for-byte the suite's own.

A correction we made

Our score used to be measured by an in-repo script of our own, and it was wrong. That script excluded several diagnostic codes from scoring and did not count false positives, so it reported numbers that climbed all the way to 100%. It was an honest mistake, not a tuned result — but it was still incorrect.

We replaced it with the official calculator described above. With every diagnostic counted and nothing excluded, the honest number is 40.4%:

100% 40.4% The checker did not get worse — the measurement got correct. 100% is the target we are working toward, not a claim about today.

The chart below is read straight from the git history of conformance/conformance_status.csv at build time: one point per commit that changed it, plotting the score that commit actually recorded.

Conformance score over timeFrom the earlier in-repo number to the official calculator
0% 25% 50% 75% 100% Apr 27 (69038b92): 83.4% — 121/145, 434 false positives · earlier in-repo harness Apr 27 Apr 27 (f4d6e27c): 91.1% — 133/146, 258 false positives · earlier in-repo harness Apr 27 (d690f130): 91.8% — 134/146, 177 false positives · earlier in-repo harness Apr 27 (c4ecadf2): 91.8% — 134/146, 174 false positives · earlier in-repo harness Apr 27 (bc8ac5e1): 92.5% — 135/146, 174 false positives · earlier in-repo harness May 30 (0c790c93): 92.5% — 135/146, 173 false positives · earlier in-repo harness May 30 May 30 (a2341e76): 92.5% — 135/146, 170 false positives · earlier in-repo harness Jun 3 (bf832a07): 93.2% — 136/146, 170 false positives · earlier in-repo harness Jun 3 Jun 3 (75aa31c9): 93.2% — 136/146, 126 false positives · earlier in-repo harness Jun 6 (19a0ad54): 93.8% — 137/146, 120 false positives · earlier in-repo harness Jun 6 Jun 12 (a273d83d): 98.6% — 144/146, 54 false positives · earlier in-repo harness Jun 12 Jun 19 (f9e14551): 98.6% — 144/146, 0 false positives · earlier in-repo harness Jun 19 Jun 21 (7bca6179): 100% — 146/146, 0 false positives · earlier in-repo harness Jun 21 Jun 23 (214e9812): 40.4% — 59/146, 285 false positives · official calculator Jun 23 100% 40.4%

On Jun 21 the in-repo script reported 100%. The official calculator, first run on Jun 23, reports 40.4% — a correction, not a regression.

  • Earlier in-repo script (some codes excluded, false positives not counted)
  • Official python/typing calculator

Each dot is a real commit to conformance/conformance_status.csv, recomputed every build. Hover a point for its date, commit, score, and false-positive count.

Where each category stands today

Read live from conformance/conformance_status.csv at build time:

CategoryPassingScore
Aliases3 / 742.9%
Annotations3 / 560%
Callables1 / 425%
Classes0 / 20%
Constructors1 / 616.7%
Dataclasses7 / 1643.8%
Directives7 / 1163.6%
Enums3 / 837.5%
Exceptions0 / 10%
Generics9 / 3030%
Historical1 / 1100%
Literals0 / 40%
NamedTuples4 / 4100%
Narrowing0 / 20%
Overloads0 / 40%
Protocols6 / 1346.2%
Qualifiers2 / 540%
Special types0 / 50%
Tuples0 / 30%
TypedDicts11 / 1478.6%
TypeForms1 / 1100%

Reproduce it yourself

# Builds the binary, fetches the (git-ignored) fixtures, runs the official
# python/typing calculator against them, writes conformance_status.csv, and
# enforces the ratchet gate from coverage-thresholds.json.
make conformance

It all lives in two files: conformance/score.py (our adapter + gate) and conformance/upstream_main.py (the suite's calculator, committed and sha256-pinned). The full annotation rules are in the python/typing conformance README.