Why does the unwinding need to happen in the kernel? The kernel can
already asynchronously invoke userspace code in the form of signal
handlers. Is the problem that it is necessary to collect profiling
information in the middle of a system call, where another syscall
would see inconsistent (and potentially exploitable) kernel state?
One of the primary values of system-wide profiling is being able to see how a particular
library call might have caused undesirable code paths due to a syscall, and where/what was
reached (given high enough sampling rate).
Does that need to happen in kernel space? I don't know, perse, other than perf needs
to be able to do that work as it is what gives us the array of instruction pointers back.
There was some chatter a number of years ago in perf about how to handle ORC from
user-space, and if I'm summing this up correctly, it was basically..
- When sampling in PERF_CONTEXT_KERNEL, stop unwinding at the syscall boundary
- Append stacktrace samples to perf buffer ring
- Upon rescheduling, backtrace a single time into user-space, and expect the consumer to
know that N previous samples with matching task-id all have the user-space backtrace.
That's a pretty significant behavior change, and all tools would need surgery to
support it. I have no idea if that is paletable to either side of the debate, but it was
the one possible direction I saw.
It does have a number of pros, in that you can save a lot of unwinding time on
syscall-heavy workloads by doing user-space unwinding once, and from scheduler task queues
(so you can take faults), and can avoid the NMI context being the cost-center for
accounting. But the cons are significant in that the behavior change is expansive, effects
all tooling, and will require ORC data across the platform.
Ouch. That is a serious problem for a number of reasons, not least
of which is security. Having the kernel parse even more complex
untrusted input in C is a horrible idea.
It might seem that way by the description I gave, but we're just talking an array of
intptr_t or similar. There is no dereferencing or state machines like you have with DWARF.
Runtime resolution is also essentially bsearch() on interval arrays. I really don't
think it's the sort of thing that requires Rust.
As for eBPF, we'd still probably be in NMI context with this route, and would fail if
we had to page in ORC tables. So that means we'd either have to take a per-task memory
overhead to maintain the mutated form (probably unreasonable) or find a way for that to be
done from the task's space when returning from the syscall.
Christian, would this be sufficient for your needs?
I don't think so without significant work. The best case I see here is for perf to
support user-space unwinding within the task, be it ORC or DWARF, and that unwinder not
have to necessarily be in-tree with the kernel because we know they won't accept a
DWARF unwinder again.
It does pose some questions on what would happen with carefully crafted DWARF data.