Hi,
On June 16, 2022 8:53:59 PM UTC, Ben Cotton <bcotton(a)redhat.com> wrote:
https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer
This document represents a proposed Change. As part of the Changes
process, proposals are publicly announced in order to receive
community feedback. This proposal will only be implemented if approved
by the Fedora Engineering Steering Committee.
== Summary ==
Fedora will add -fno-omit-frame-pointer to the default C/C++
compilation flags, which will improve the effectiveness of profiling
and debugging tools.
== Owner ==
* Name: [[User:daandemeyer| Daan De Meyer]], [[User:Dcavalca| Davide
Cavalca]], [[ Andrii Nakryiko]]
* Email: daandemeyer(a)fb.com, dcavalca(a)fb.com, andriin(a)fb.com
== Detailed Description ==
Credits to Mirek Klimos, whose internal note on stacktrace unwinding
formed the basis for this change proposal (myreggg(a)gmail.com).
Any performance or efficiency work relies on accurate profiling data.
Sampling profilers probe the target program's call stack at regular
intervals and store the stack traces. If we collect enough of them, we
can closely approximate the real cost of a library or function with
minimal runtime overhead.
Stack trace capture what’s running on a thread. It should start with
clone - if the thread was created via clone syscall - or with _start -
if it’s the main thread of the process. The last function in the stack
trace is code that CPU is currently executing. If a stack starts with
[unknown] or any other symbol, it means it's not complete.
=== Unwinding ===
How does the profiler get the list of function names? There are two parts of it:
# Unwinding the stack - getting a list of virtual addresses pointing
to the executable code
# Symbolization - translating virtual addresses into human-readable
information, like function name, inlined functions at the address, or
file name and line number.
Unwinding is what we're interested in for the purpose of this
proposal. The important things are:
* Data on stack is split into frames, each frame belonging to one function.
* Right before each function call, the return address is put on the
stack. This is the instruction address in the caller to which we will
eventually return — and that's what we care about.
* One register, called the "frame pointer" or "base pointer" register
(RBP), is traditionally used to point to the beginning of the current
frame. Every function should back up RBP onto the stack and set it
properly at the very beginning.
The “frame pointer” part is achieved by adding push %rbp, mov
%rsp,%rbp to the beginning of every function and by adding pop %rbp
before returning. Using this knowledge, stack unwinding boils down to
traversing a linked list:
https://i.imgur.com/P6pFdPD.png
As you specifically use x86_64 assembly as an example here: have you looked on the impact
this will have on other architectures like arm or riscv?
Cheers,
Dan
>
>=== Where’s the catch? ===
>
>The frame pointer register is not necessary to run a compiled binary.
>It makes it easy to unwind the stack, and some debugging tools rely on
>frame pointers, but the compiler knows how much data it put on the
>stack, so it can generate code that doesn't need the RBP. Not using
>the frame pointer register can make a program more efficient:
>
>* We don’t need to back up the value of the register onto the stack,
>which saves 3 instructions per function.
>* We can treat the RBP as a general-purpose register and use it for
>something else.
>
>Whether the compiler sets frame pointer or not is controlled by the
>-fomit-frame-pointer flag and the default is "omit", meaning we can’t
>use this method of stack unwinding by default.
>
>To make it possible to rely on the frame pointer being available,
>we'll add -fno-omit-frame-pointer to the default C/C++ compilation
>flags. This will instruct the compiler to make sure the frame pointer
>is always available. This will in turn allow profiling tools to
>provide accurate performance data which can drive performance
>improvements in core libraries and executables.
>
>== Feedback ==
>
>=== Potential performance impact ===
>
>* Meta builds all its libraries and executables with
>-fno-omit-frame-pointer by default. Internal benchmarks did not show
>significant impact on performance when omitting the frame pointer for
>two of our most performance intensive applications.
>* Firefox recently landed a change to preserve the frame pointer in
>all jitted code
>(https://bugzilla.mozilla.org/show_bug.cgi?id=1426134). No significant
>decrease in performance was observed.
>* Kernel 4.8 frame pointer benchmarks by Suse showed 5%-10%
>regressions in some benchmarks
>(https://lore.kernel.org/all/20170602104048.jkkzssljsompjdwy@suse.de/T/#u)
>
>Should individual libraries or executables notice a significant
>performance degradation caused by including the frame pointer
>everywhere, these packages can opt-out on an individual basis as
>described in
https://docs.fedoraproject.org/en-US/packaging-guidelines/#_compiler_flags.
>
>=== Alternatives to frame pointers ===
>
>There are a few alternative ways to unwind stacks instead of using the
>frame pointer:
>
>* [
https://dwarfstd.org DWARF] data - The compiler can emit extra
>information that allows us to find the beginning of the frame without
>the frame pointer, which means we can walk the stack exactly as
>before. The problem is that we need to unwind the stack in kernel
>space which isn't implemented in the kernel. Given that the kernel
>implemented it's own format (ORC) instead of using DWARF, it's
>unlikely that we'll see a DWARF unwinder in the kernel any time soon.
>The perf tool allows you to use the DWARF data with
>--call-graph=dwarf, but this means that it copies the full stack on
>every event and unwinds in user space. This has very high overhead.
>* [
https://www.kernel.org/doc/html/v5.3/x86/orc-unwinder.html ORC]
>(undwarf) - problems with unwinding in kernel led to creation of
>another format with the same purpose as DWARF, just much simpler. This
>can only be used to unwind kernel stack traces; it doesn't help us
>with userspace stacks. More information on ORC can be found
>[https://lwn.net/Articles/728339 here].
>* [
https://lwn.net/Articles/680985 LBR] - New Intel CPUs have a
>feature that gives you source and target addresses for the last 16 (or
>32, in newer CPUs) branches with no overhead. It can be configured to
>record only function calls and to be used as a stack, which means it
>can be used to get the stack trace. Sadly, you only get the last X
>calls, and not the full stack trace, so the data can be very
>incomplete. On top of that, many Fedora users might still be using
>CPUs without LBR support which means we wouldn't be able to assume
>working profilers on a Fedora system by default.
>
>To summarize, if we want complete stacks with reasonably low overhead
>(which we do, there's no other way to get accurate profiling data from
>running services), frame pointers are currently the best option.
>
>== Benefit to Fedora ==
>
>Implementing this change will provide profiling tools with easy access
>to stacktraces of installed libraries and executables which will lead
>to more accurate profiling data in general. This in turn can be used
>to implement optimizations to core libraries and executables which
>will improve the overall performance of Fedora itself and the wider
>Linux ecosystem.
>
>Various debugging tools can also make use of the frame pointer to
>access the current stacktrace, although tools like gdb can already do
>this to some degree via embedded dwarf debugging info.
>
>== Scope ==
>* Proposal owners: Put up a PR to change the rpm macros to build
>packages by default with -fno-omit-frame-pointer by default.
>
>* Other developers: Review and merge the PR implementing the Change.
>
>* Release engineering: [
https://pagure.io/releng/issues #Releng issue
>number]. A mass rebuild is required.
>
>* Policies and guidelines: N/A (not needed for this Change)
>
>* Trademark approval: N/A (not needed for this Change)
>
>* Alignment with Objectives: N/A
>
>== Upgrade/compatibility impact ==
>
>This should not impact upgrades in any way.
>
>== How To Test ==
>
># Build the package with the updated rpm macros
># Profile the binary with `perf record -g <binary>`
># Inspect the perf data with `perf report -g 'graph,0.5,caller'`
># When expanding hot functions in the perf report, perf should show
>the full call graph of the hot function (at least for all functions
>that are part of the binary compiled with -fno-omit-frame-pointer)
>
>== User Experience ==
>
>Fedora users will be more likely to have a streamlined experience when
>trying to debug/profile system executables/libraries. Tools such as
>perf will work out of the box instead of requiring to users to provide
>extra options (e.g. --call-graph=dwarf/LBR) or requiring users to
>recompile all relevant packages with -fno-omit-frame-pointer.
>
>== Dependencies ==
>
>The rpm macros for Fedora need to be adjusted to include
>-fno-omit-frame-pointer in the default C/C++ compilation flags.
>
>== Contingency Plan ==
>
>* Contingency mechanism: The new version can be released without every
>package being rebuilt with fno-omit-frame-pointer. Profiling will only
>work perfectly once all packages have been rebuilt but there will be
>no regression in behavior if not all packages have been rebuilt by the
>time of the release. If the Change is found to introduce unacceptable
>regressions, the PR implementing it can be reverted and affected
>packages can be rebuilt.
>* Contingency deadline: Final freeze
>* Blocks release? No
>
>== Documentation ==
>
>* Original proposal for in-kernel DWARF unwinder (rejected):
>https://lkml.org/lkml/2017/5/5/571
>
>== Release Notes ==
>
>Packages are now compiled with frame pointers included by default.
>This will enable a variety of profiling and debugging tools to show
>more information out of the box.
>
>