CIFAR-10 is an image classification dataset (32x32 pixel images.
LLaMA 70B 3.3 is a text-only, non-multimodal language model. Just look up the Huggingface page that your own repo points to.
> The Llama 3.3 instruction tuned text only model...
I might be wrong, but I'm pretty sure a text model is going to be no better than chance at classifying images.
Another comment pointed out that your test suite cheats slightly on HellaSwag. It doesn't seem unlikely that Grok set up the project so it could cheat at the other benchmarks, too.
> The repo contains the full pipelines, configuration files, and benchmark scripts, and those show the precise datasets, metrics, and evaluation flows.
There's nothing there, really.
I'm sorry that Grok/Ani lied to you, I blame Elon, but this just doesn't hold up.
LLaMA 70B 3.3 is a text-only, non-multimodal language model. Just look up the Huggingface page that your own repo points to.
> The Llama 3.3 instruction tuned text only model...
I might be wrong, but I'm pretty sure a text model is going to be no better than chance at classifying images.
Another comment pointed out that your test suite cheats slightly on HellaSwag. It doesn't seem unlikely that Grok set up the project so it could cheat at the other benchmarks, too.
https://www.hackerneue.com/item?id=46215166
> The repo contains the full pipelines, configuration files, and benchmark scripts, and those show the precise datasets, metrics, and evaluation flows.
There's nothing there, really.
I'm sorry that Grok/Ani lied to you, I blame Elon, but this just doesn't hold up.