ddavis parent
OpenMP is great. I’ve done something similar to your second case (thread local objects that are filled in parallel and later combined). In the case of “OpenMP off” (pragmas ignored), is it possible to avoid the overhead of the thread local object essentially getting copied into the final object (since no OpenMP means only a single thread local object)? I avoided this by implementing a separate code path, but I’m just wondering if there are any tricks I missed that would allow still a single code path
Give one of the threads (thread ID 0, for instance) special privileges. Its list is the one everything else is appended to, then there's only concatenation or copying if you have more than one thread.
Or, pre-allocate the memory and let each thread write to its own subset of the final collection and avoid the combine step entirely. This works regardless of the number of threads you use so long as you know the maximum amount of memory you might need to allocate. If it has no calculable upper bound, you will need to use other techniques.