1. The teacher only runs during field extraction. That step is offline. Once the fields are saved, the transformer is no longer needed. The student training and student-only inference scripts do not load the teacher at all. Compression refers to the field representation and the student head, not the extraction pass.
2. The HellaSwag file is a placeholder, not a required part of the method. It's included so the structure mirrors the paper’s tasks, and it points to the description in the text. The core experiments (RTE, SST-2, CIFAR-10 intention probe, etc.) all have complete working code paths.
3. The AN1 head is intentionally simple. Linear probes are the baseline way to test whether compressed intermediate representations preserve structure. The key result is how much task-relevant geometry survives in a low-rank field. The novelty is in the compression behavior, not in inventing a new classifier architecture.
4. The student model exists and is trained independently of the teacher. This is what produces the classification results in the paper. The student doesn't call the teacher during inference, which is exactly the point.
5. DistilBERT’s SST-2 score isn’t the relevant comparison. The experiment isn’t “beat a small transformer.” It’s “how far can a 256-dimensional compressed field distilled from a frozen 70B model get on a downstream task?” The result speaks to representational compression, not leaderboard performance.
6. The 2 tok/s number is for the specific configuration used in the economic section. Different hardware, precision modes, and serving stacks vary by an order of magnitude. The point was to illustrate cost scaling, not claim a universal throughput ceiling.
If there’s a specific part of the implementation you believe contradicts the paper, feel free to point to the line and we can discuss that human to human. The repo is small by design, so everything is easy to check directly without relying on LLM summaries.
I asked both Claude Code|Opus 4.5 and Codex|GPT 5.1 Codex Max (funny to ask LLMs, I know) to check the an1-core repo. I don't think they'd hallucinate on something like this (the code is quite small), but I do not claim expertise.
In short, both of them are saying that:
- The repo always runs the full teacher model to extract activations and uses them - see https://github.com/Anima-Core/an1-core/blob/main/an1_core/fi...
- There are weird stub files, e.g. the Hellaswag repro doesn't actually have the code to reproduce https://github.com/Anima-Core/an1-core/blob/main/experiments... "For full HellaSwag reproduction, see the paper" (why include the file at all then?)
- The actual "AN1 head" is just linear probing (freeze a pretrained model, train a classifier on its features). The full flow (as reported by CC) is "Text → [Full Transformer] → activations → [Tiny Head] → prediction"
Basically, there's no code to train a real "student" model that would run without the teacher.
===
The repo/paper say that there's a mythical "commercial version" that has all the goodies:
(repo)
> This reference implementation (an1-core) does not include the FPU, AN4, or other proprietary optimization components covered by these patents. It provides only the core scientific demonstration of the meaning fields phenomenon.
(paper)
> Production deployment: Optimized implementations (AN1-Turbo) with learned layer selection, adaptive loss scheduling, and CUDA-accelerated inference available under commercial license.
But right now we only have the code in the repo.
===
In the paper they show that the student model (30M params) gets ~82% on SST-2 (labels-only). But what what they don't show is that DistilBERT (>5 year old model) already achieves 91% on the same dataset despite only having 66M params.
Another weird tidbit from the paper - in the section where they show the economic impact, they claim that LLaMA 70B runs at 2 tok/s at batch size=1 on an H200. In reality that number is at least a magnitude bigger even without quantization, like 20-40 tok/s. With quantization it can easily be above 100 tok/s.