Comment by nsguy - Hacker Neue

nsguy Jul 8, 2024 parent

Book looks very interesting.

random nit (obviously not having read the book): I would say p99 behaviour can be captured in profiling, it's just going to be the p99 of the profiles. E.g. if you sample 10,000 stack traces out of your executable, 1% of those are going to be in that p99, sort of by definition. Tracing through requests (e.g.) is useful but I wouldn't make as strong as a statement as that's the only way of understanding p99 performance based on my experience.

pclmulqdq Jul 8, 2024

While this is correct in a literal sense, you are missing something that the profiler tells you about your p99 (and p99.9) tails of the end-to-end system: sources of "slowness" are often correlated in these requests. Some systems I have worked on have p99 times that are built out of a combination of 90th percentile events that you would find in a profiler. In this case, a profiler doesn't tell you about anything being particularly bad.

Profilers also say nothing about queueing, and can very much mislead you if you care about latency in specific.

If your "slowness" is driven by a single function (or made of truly uncorrelated events), you can accurately measure your tails with a profile. If not, a trace will give you meaningfully more information.

nsguy OP Jul 8, 2024

Sure. The profiler is going to give you information related to what it is looking at. If your bottleneck is disk I/O then you need to look at disk I/O. If your bottleneck is some other mechanism that's not purely cycles then you need to look at the relevant metrics.

Your slowness is always a function of the underlying building blocks, their performance distribution and bottleneck. And sure, two 90th percentiles can make for the 99th percentile. A profiler won't magically convey the information about what sequence of operations a request is doing under the hood.

I agree that having visibility into the requests via tracing can help zoom in on the problem. But so can having metrics on the underlying systems, e.g. if you have a queue in your system you could look at the performance of that queue.

I'll admit that most of my experience is tuning for maximal throughput rather than a given percentile, usually systems with high performance/throughput yield a much flatter distribution of latencies at a given workload. A rule of thumb. I also tend to think about my "budget" in the various parts of the system to get to my desired performance characteristics, a luxury you don't have on "legacy" systems you need to troubleshoot some behaviour on where tracing also lets you get a "cut" through the stack that shows you what's going on.

signa11 Jul 9, 2024

> ... A profiler won't magically convey the information about what sequence of operations a request is doing under the hood. ...

have a look at kutrace [https://github.com/dicksites/KUtrace]. it does.

eatonphil Jul 8, 2024

Yeah, bad choice of words. Good thing I didn't write the book. :) How about: the long tail of bad behavior can hide behind sampling profilers.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous