I've written an open-source driver for the decoding side of the nvjpg module found in the Tegra X1 (ie. earlier hardware revision than the one in the A100).
I did some quick benchmarks against libjpeg-turbo, if that can give you an idea. I expect encoding performance would be similar.
Probably quite a bit, I don't know. The typical use case is to load up thousands of JPEGs at once to get good throughput despite copy overhead. You can see here the benchmark against jpeg-turbo: https://developer.nvidia.com/blog/leveraging-hardware-jpeg-d...
How does this impact the overall latency of encoding a single image?