In the common case where the processor dispatching those kernel calls is much faster than the kernel calls themselves, you're not likely to see a meaningful increase in throughput.
What you need to do first is get really optimized kernels (since that makes the dispatching relatively more expensive) and THEN this becomes worth doing. People who are really good a writing optimized GPU kernels are just not that easy to get a hold of right now.
What you need to do first is get really optimized kernels (since that makes the dispatching relatively more expensive) and THEN this becomes worth doing. People who are really good a writing optimized GPU kernels are just not that easy to get a hold of right now.