Preferences

we kept all 1800+ (script/language) pairs, not only the quality filtered ones. the question if a mix of quality filtered and not languages impacts the mixing is still an open question. preliminary research (Section 4.2.7 of https://arxiv.org/abs/2502.10361 ) indicates that quality filtering can mitigate the curse of multilinguality to some degree, so facilitate cross-lingual generalization, but it has to be seen how strong this effect is on larger scale

This item has no comments currently.