Richard W.M. Jones wrote:
The problem is you're confusing general gains and gains in
specific scenarios.
But the thing is that a gain in some specific scenario is a lot less useful
than a general gain. And the latter is usually not had through profiling,
but through improvements in toolchain optimizations. -fomit-frame-pointer
was one such improvement that you have now successfully destroyed for all
Fedora users.
Perf + flamegraphs are such a useful tool that we managed to double
performance (ie. ~ 100% gain) in one particular network server case
that we investigated a few years ago. This was by spotting that the
kernel was writing to an MSR (hardware register) which was really
slow, and as it wasn't necessary we just got rid of it.
For that one use case - an incredible performance gain! Does this
mean everyone sees their machines double in speed? Of course not.
And that is why that improvement is much less impressive than it sounds at
first. Chances are it helps only a handful users, in a handful situations,
and even for those users, the overall improvement is not going to be 100%
because they will also be using other software than the one you profiled and
optimized.
Will we be able to say that "Fedora got N% faster" in two
years?
Not at all - it depends entirely what you use Fedora for.
Hence this makes the claims made by the change proponents entirely
unrealistic and impossible to ever verify. We are hitting the end users with
an overall performance penalty in exchange of potential performance
improvements that are impossible not only to predict, but even to quantify
after the fact, i.e., the claim that the latter will more than compensate
for the former is completely unsubstantiated.
The overhead is also a real thing. There's a few percent
overhead
everywhere for enabling frame pointers because every stack frame entry
and exit involves a couple of extra instructions.
Exactly.
Anyway I'd really urge you to play with these tools before
judging
this proposal:
https://www.brendangregg.com/flamegraphs.html
KCachegrind, using Valgrind with the Callgrind or Cachegrind tool, gives me
more information than that even without frame pointers, and it is actually
reliable because it dynamically instruments the code and traces every single
instruction instead of just taking random samples and hoping it did not miss
anything important. It is also much more reproducible because it uses a
mathematical model for the CPU cycles instead of a wallclock time sample
that depends not only on your particular CPU, but also on things such as
background tasks, thermal throttling, etc. Yes, it is slower (up to a factor
~50), but only for the developer doing the profiling, and as explained
above, the reported cycle counts do not depend on the wallclock time anyway.
Kevin Kofler