Comment by minimaxir - Hacker Neue

minimaxir Feb 25, 2025 parent

tl;dr the base ModernBERT was trained with code in mind unlike most encoder-only models (therefore assuming it was also trained on JSON/YAML objects) and also includes a custom tokenizer to support that, which is why I mention that indentation is important since different levels of indentation have different single tokens.

This is mostly theoetical and does require a deeper dive to confirm.

This item has no comments currently.

It looks like you have JavaScript disabled. This web app requires that JavaScript is enabled. Please enable JavaScript to use this site (or just go read Hacker News).

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous