You will indeed need higher precision if you need a very large number of digits (more than what double precision gives you).
However, even with high precision, the algorithms you use to compute your moments matter: you might not have as many digits as you hope for. Variance, for example, is famously numerically unstable (sometimes giving a negative variance) when computed naively. A stable summation algorithm plugged into a variance computation (even a carefully written one) can sometimes give you six additional digits of precision in the variance that you might not get when doubling the precision (it depends on how big of a cancellation you had in your operations; too big, and doubling the precision will not be enough to recover, while a compensated summation would do fine).
I have seen this play out once in a clustering algorithm that would only work on some data when using a compensated summation to compute the variance.
You are also right, that if you can solve your problem with high precision, it's unlikely that just quad precision will be enough. For example the inverse Laplace transform. In some cases you need thousands of digits. The cases where double precision is not enough, but quad precision is are probably rare. But it's worth a try.
It's interesting that 64 bits are often necessary (as opposed to 32) but 128 bits so rarely are.
1. What’s the implications when parallelization is considered? I.e. is those accurate sum algorithms parallelizable, and how would it compares to traditional parallelized sum? 2. What is the performance penalty I should be expecting?
2. It varies wildly depending on the algorithm you are using, the implementation of said algorithm, and the operations you are performing outside of the summation (if you sum the results of a loop performing non-trivial operations, that can easily overlap with the summation overhead making it essentially free). The only rule of thumb I can give is that runtime cost increases with desired accuracy. My recommendation would be to switch from a naive sum to using an exact accumulator and observe the difference in output value (letting you know how impactful numerical errors in that sum were) and runtime (letting you know how expensive the switch is for you), then dial it down if needed.
In general, using a well-placed compensated (or exact) sum/dot product algorithm will solve most of your problems at a low cost. Nowadays, most people are familiar with Kahan summation (and related tricks like sorting numbers), but they do not know that it does not help much (often less than 128-bit floating points) and that the state of the art for those algorithms has moved quite a bit since. Nowadays, you can ensure that your sum/BLAS operation produces a result within one ulp of the exact result!
I highly recommend perusing the Handbook of Floating-Point Arithmetic (the Accurate Algorithm online page also seems fine [0]) to get an idea of what can be done. Implementations of these ideas can be found in most languages used for numerical computation (a quick Google search gave me options for C++/Python [1], Rust [2], etc.).
[0]: https://accurate-algorithms.readthedocs.io/en/latest/index.h...
[1]: https://github.com/yafshar/xsum
[2]: https://github.com/bsteinb/accurate