Comment by BikeShuester

BikeShuester Sep 5, 2024 parent

I'd suggest offering at least one free query to allow users to evaluate the service.

rushingcreek Sep 5, 2024

Our fast model, Phind Instant, is completely free

johndough Sep 5, 2024

Maybe OP was referring to Phind-405B (the model from the article). I certainly wonder how good the 405B model really is.

cjtrowbridge Sep 6, 2024

It's just an innovated (enshittified) version of Facebook's free 405b model.

fshr Sep 5, 2024

Why not let us try the new model for free like the 5 uses available for the 70B model? Seems like a no brainer to hook new users if what you're selling is worth it, eh?

swyx Sep 5, 2024

> The model, based on Meta Llama 3.1 8B, runs on a Phind-customized NVIDIA TensorRT-LLM inference server that offers extremely fast speeds on H100 GPUs. We start by running the model in FP8, and also enable flash decoding and fused CUDA kernels for MLP.

as far as i know you are running your own GPUs - what do you do in overload? have a queue system? what do you do in underload? just eat the costs? is there a "serverless" system here that makes sense/is anyone working on one?

rushingcreek Sep 6, 2024

We run the nodes "hot" and close to overload for peak throughput. That's why NVIDIA's XQA innovation was so interesting, because it allows for much higher throughput for a given latency budget: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source....

Serverless would make more sense if we had a significant underutilization problem.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous