Profilers also say nothing about queueing, and can very much mislead you if you care about latency in specific.
If your "slowness" is driven by a single function (or made of truly uncorrelated events), you can accurately measure your tails with a profile. If not, a trace will give you meaningfully more information.
Your slowness is always a function of the underlying building blocks, their performance distribution and bottleneck. And sure, two 90th percentiles can make for the 99th percentile. A profiler won't magically convey the information about what sequence of operations a request is doing under the hood.
I agree that having visibility into the requests via tracing can help zoom in on the problem. But so can having metrics on the underlying systems, e.g. if you have a queue in your system you could look at the performance of that queue.
I'll admit that most of my experience is tuning for maximal throughput rather than a given percentile, usually systems with high performance/throughput yield a much flatter distribution of latencies at a given workload. A rule of thumb. I also tend to think about my "budget" in the various parts of the system to get to my desired performance characteristics, a luxury you don't have on "legacy" systems you need to troubleshoot some behaviour on where tracing also lets you get a "cut" through the stack that shows you what's going on.
have a look at kutrace [https://github.com/dicksites/KUtrace]. it does.
random nit (obviously not having read the book): I would say p99 behaviour can be captured in profiling, it's just going to be the p99 of the profiles. E.g. if you sample 10,000 stack traces out of your executable, 1% of those are going to be in that p99, sort of by definition. Tracing through requests (e.g.) is useful but I wouldn't make as strong as a statement as that's the only way of understanding p99 performance based on my experience.