Compile with -fno-omit-frame-pointer on x86_64?

Wed Nov 3 19:20:59 UTC 2010

On Wed, 2010-11-03 at 19:58 +0100, Jakub Jelinek wrote:
> On Wed, Nov 03, 2010 at 02:48:12PM -0400, Owen Taylor wrote:
> > Lack of decent profiling is a major problem for making our operating
> > system fast. By far the most effective of profiling is sampling profile
> > with callgraph information.
> > 
> > Soeren's comment from March:
> > 
> >  http://lwn.net/Articles/380582/
> > 
> > Basically summarizes the situation, and as far as I know nothing has
> > changed ... with default compilation options, getting callgraph
> > profiling on x86_64 really requires a DWARF unwinder in the kernel.
> > Which seems unlikely to happen.
> 
> But that's the right thing to do.
> 
> > As a developer, your options for profiling are:
> > 
> >  - Recompile everything you care about profiling 
> >    with -fno-omit-frame-pointer instead of using system packages.
> 
> Instead of this, which really is a big performance penalty. 

Do you have a sense of the quantification of "big" here? I know in
compiler terms, 1% is big, but we're no where close to wringing the last
1% out of overall Fedora performance. If you create a sufficiently
complex system, there's lots of "stupid" stuff going on. And you can't
find the stupid stuff without appropriate tools.

> Even i?86 is
> changing in GCC 4.6 to not do -fno-omit-frame-pointer by default.
> The unwind info recent GCCs provide is correct even in epilogues and can be
> relied upon.  There are several lightweight unwinders that can be easily
> adapted for kernel purposes.  Just talk to the systemtap folks.

It seems like if it was that easy, it would have happened and we'd have
a solution in the upstream kernel...

(One thing that definitely makes things tricky is paging in debuginfo. I
think I saw a discussion somewhere that systemtap preemptively was
paging in all debuginfo for traced modules. That's tricky in systemwide
profiling situations, but maybe you could have something where you do
one run, load the debuginfo for everything that was hit in the first
run, then do a second run.)

> There is always callgrind if you don't want to recompile anything and
> need to profile something even when kernel doesn't support it.

callgrind is reasonable if you a single program that is slow and where
the slowness is pretty much straightup CPU.

But we're seldom trying to profile "a program" - we are trying to
profile system situations that involve several programs and the kernel.

And programs are frequently not straight-up bound on things that
valgrind can easily model. For example, if our program is reading from
uncached graphics memory somewhere, that won't show up at all in
callgrind - to callgrind, it's just memory reads. But it may dominate a
more accurate sampled profile.

Plus the performance hit of callgrind makes it not very useful for
real-time interactive user interface.

- Owen