Frank Ch. Eigler mentions that elfutils has a more modern unwinding
library.
Could that perhaps solve your performance issues with libunwind?
I don't think so. The problem is two-fold.
First, we have to capture enough of the stack to do offline unwinding. I think the default
many people do here is about 8kb of stack. While the instruction pointer array might fit
in a couple cachelines, you now have an additional few pages to copy as well. And you
probably want those pages aligned in your capture format. So no you need to interleave
multiple types of data frames while padding for alignment.
Now do that a few thousand times a second.
The overhead here can be so great that it obscures what you're trying to find.
Furthermore, it's a good chance that you'll cause CPU packages to spin up to a
higher frequency, thusly hiding the exact performance issues you want to find or reduce to
avoid that.
Now, say you've done the work and captured stacks (what has now turned from a few MB
recording to a few GB recording) you need to decode them. We keep many
lookaside-maps/interval-trees in Sysprof to keep this overhead low, but now you have to
reference .eh/DWARF data. This is the slowest part of the whole process. What currently
takes a second or two could take you easily 10 minutes.
Now I understand not everyone has ADHD like me, but I wont even remember what I was doing
10 minutes later.