How we measure PEP conformance
Basilisk is scored by the official python/typing conformance suite — the same test suite and scoring tool the typing community uses to grade pyright, mypy, pyrefly, ty, and others. We run that tool unmodified, on the real basilisk binary, on every change.
Today that gives 40.4% — 59 of 146 test files passing, 955 required errors caught, with 285 false positives and 36 missed required errors left to clear. 3 of 21 categories pass at 100%. The target is 100%; we ratchet toward it.
Python typing spec ↗ Conformance suite & README ↗ Published results ↗ Our scorer — score.py ↗ Vendored calculator ↗
What the conformance suite is
The Python typing specification defines how the type system is supposed to behave — generics, protocols, dataclasses, TypedDict, overloads, literals, and the rest. To stop the spec from being aspirational, the typing community maintains a conformance test suite alongside it in the python/typing repository.
It works like this:
- Each spec chapter has one or more test files — ordinary Python modules that exercise a feature and mark, with
# Ecomments, every line where a conforming type checker must report an error (and, with# E[tag]groups, where one of several related errors is acceptable). - A small scoring tool runs a type checker over those files and diffs its output against the annotations. A file passes only if the diff is empty: every required error is reported, and nothing is reported on a line the suite does not mark.
- The maintainers run every checker through it and publish the results table, which is how figures like pyright's ~99% or pyrefly's ~86% are produced.
This is the suite we use, at the pinned commit 268d0c4e. Because the same tool and the same files grade everyone, the number is comparable across checkers and is not something we can tune in our favour.
How a file is scored
The entire algorithm is two functions in the suite's main.py — get_expected_errors (reads the # E annotations) and diff_expected_errors (diffs them against the checker's output). A file passes iff that diff is empty:
- the suite's rule (
upstream_main.py:185):"Fail" if errors_diff.strip() else "Pass"
We count every diagnostic the checker emits — errors and warnings, with no diagnostic codes excluded. That is the strictest reading of the suite and matches how the reference checker, pyright, is graded. One unexpected diagnostic (a false positive) fails the whole file, which is why our false-positive count matters as much as the pass count.
How we run it without forking it
The suite's main.py is a batch harness for the python/typing maintainers: it grades all the known checkers at once, pulls in TOML config/reporting dependencies, and writes a results matrix. It has no way to invoke our binary. So, exactly as the suite does for every checker (PyrightTypeChecker, MypyTypeChecker, …), we add a thin adapter and reuse the suite's own scoring rather than reimplementing it. Our score.py:
- Adapter — runs
basilisk check --output jsonand shapes the result into the{line: [errors]}dict the suite's functions expect (the one thing the suite can't do for us). - Calculator — imports
get_expected_errorsanddiff_expected_errorsfrom a committed, byte-identical copy of the suite'smain.pyand calls them unmodified (score.py:287mirrors the suite's own call atupstream_main.py:175). It contains no scoring logic of its own. - Gate — compares the result against
coverage-thresholds.jsonand fails CI on any regression.
To keep the calculator trustworthy, the vendored copy is sha256-pinned. score.py re-hashes it on every run and refuses to score if it has drifted (score.py:99), and this website re-hashes it again at build time:
Keeping the official file untouched is the whole point: the adapter and gate live in a separate, auditable file, so the calculator stays byte-for-byte the suite's own.
A correction we made
Our score used to be measured by an in-repo script of our own, and it was wrong. That script excluded several diagnostic codes from scoring and did not count false positives, so it reported numbers that climbed all the way to 100%. It was an honest mistake, not a tuned result — but it was still incorrect.
We replaced it with the official calculator described above. With every diagnostic counted and nothing excluded, the honest number is 40.4%:
The chart below is read straight from the git history of conformance/conformance_status.csv at build time: one point per commit that changed it, plotting the score that commit actually recorded.
On Jun 21 the in-repo script reported 100%. The official calculator, first run on Jun 23, reports 40.4% — a correction, not a regression.
- Earlier in-repo script (some codes excluded, false positives not counted)
- Official
python/typingcalculator
Each dot is a real commit to conformance/conformance_status.csv, recomputed every build. Hover a point for its date, commit, score, and false-positive count.
Where each category stands today
Read live from conformance/conformance_status.csv at build time:
| Category | Passing | Score | |
|---|---|---|---|
| Aliases | 3 / 7 | 42.9% | |
| Annotations | 3 / 5 | 60% | |
| Callables | 1 / 4 | 25% | |
| Classes | 0 / 2 | 0% | |
| Constructors | 1 / 6 | 16.7% | |
| Dataclasses | 7 / 16 | 43.8% | |
| Directives | 7 / 11 | 63.6% | |
| Enums | 3 / 8 | 37.5% | |
| Exceptions | 0 / 1 | 0% | |
| Generics | 9 / 30 | 30% | |
| Historical | 1 / 1 | 100% | |
| Literals | 0 / 4 | 0% | |
| NamedTuples | 4 / 4 | 100% | |
| Narrowing | 0 / 2 | 0% | |
| Overloads | 0 / 4 | 0% | |
| Protocols | 6 / 13 | 46.2% | |
| Qualifiers | 2 / 5 | 40% | |
| Special types | 0 / 5 | 0% | |
| Tuples | 0 / 3 | 0% | |
| TypedDicts | 11 / 14 | 78.6% | |
| TypeForms | 1 / 1 | 100% |
Reproduce it yourself
# Builds the binary, fetches the (git-ignored) fixtures, runs the official
# python/typing calculator against them, writes conformance_status.csv, and
# enforces the ratchet gate from coverage-thresholds.json.
make conformance
It all lives in two files: conformance/score.py (our adapter + gate) and conformance/upstream_main.py (the suite's calculator, committed and sha256-pinned). The full annotation rules are in the python/typing conformance README.