The strange thing is, I did have /MT:32 on (added in a comment at the bottom of the page because I had to go to bed). I like to stick with defaults but I'm not that inept. /J probably shouldn't matter for my use case because 125 MBps just isn't that much in the grand scheme of things.
Performing parallel copies is probably the big win with less than 10 Gb/s of network bandwidth. This will allow SMB multichannel to use multiple connections, hiding some of the slowness you can get with a single TCP connection.
When doing more than 1-2 GB/s of IO the page cache can start to slow IO down. That’s when unbuffered (direct) IO starts to show a lot of benefit.