Definitely a better high level metric than nvidia-smi, and probably fine if you just want to get a very coarse idea of whether or not your are using the GPUs reasonably at all.
But when you get to the point where you care about a few percentage points of utilisation it's just not reliable enough as many things can impact energy consumption both ways. E.g. had a case were the GPU cluster we were using wasn't being cooled well enough, so you would gradually see power draw getting lower and lower as the GPUs were throttling themselves to not overheat.
You can also find cases were energy consumption is high but MFU/HFU isn't, like memory intensive workloads
But when you get to the point where you care about a few percentage points of utilisation it's just not reliable enough as many things can impact energy consumption both ways. E.g. had a case were the GPU cluster we were using wasn't being cooled well enough, so you would gradually see power draw getting lower and lower as the GPUs were throttling themselves to not overheat.
You can also find cases were energy consumption is high but MFU/HFU isn't, like memory intensive workloads