The Versal AI edge SOMs are mildly overpriced. The boards are worth it, but in the embedded space Nvidia is offering the cheapest solutions so an FPGA based application will always need to justify the additional cost for slightly worse performance, by arguing that the application has latency requirements that a GPU cannot help with.
GPUs tend to perform worse when you have small batches and frequent kernel launches. This is especially annoying in cases where a simple kernel wide synchronization barrier could solve your problems, but CUDA expects you to not synchronize like that within the kernel, you're supposed to launch a sequence of kernels one after the other. That's not a good solution if a for loop over n iterations turns into n kernel calls.
kcb
CUDA offers grid wide cooperative groups which can synchronize pretty efficiently. And there's also graphs if you know the kernels you're launching ahead of time.
agustamir
> FPGA based application will always need to justify the additional cost for slightly worse performance
Do mean FPGA has slightly worse performance? Care to elaborate?
GPUs tend to perform worse when you have small batches and frequent kernel launches. This is especially annoying in cases where a simple kernel wide synchronization barrier could solve your problems, but CUDA expects you to not synchronize like that within the kernel, you're supposed to launch a sequence of kernels one after the other. That's not a good solution if a for loop over n iterations turns into n kernel calls.