Abrt (was Re: Most buggy packages)

Wed Feb 20 08:40:34 UTC 2013

Thank you Dave!
That's exactly the kind of ideas I was looking for.
Just a short summary what we can do on the server (now) to get this 
brainstorm going:

- it has all the rpm debuginfo packages, so getting the symbol names or 
lines is not a problem (actually we do that even now)

- it can extract backtrace from userspace coredumps
   - and Fedora users are sending them...
- it can extract backtrace from kernel coredumps
   - actually never seen any Fedora user to send a kernel core
- it's not a problem to run some custom scripts during the analysis

- so far it takes the mapping component->owner from the Fedora pkgdb, 
the bigger plan is to be more distro agnostic, so we're not against 
using other data for component->owner mapping

- we have all the backtraces from all the crashes processed by the 
server, so we can do a lot of datamining (deduplication, finding the 
common component in different crashes, ...)

To wrap it up: All of these ideas bellow are doable, but not without 
your help (so get ready for some emails from us ;)). Almost every 
package needs some special handling and we can't know them all, so it's 
up to maintainers and developers to let us know what kind of information 
they need and how to get them. I can't promise it will be implemented 
over night, but if you shout loud enough...

One thing we're struggling with now is the normalization of stacktraces 
which means deciding which functions are important and which are not. 
e.g. for kernel there are stacktraces with a lot of warn_* functions and 
only a few functions are different and our logic detects this as a 
duplicate because the stacktraces are very similar. We're dealing with 
this problem, but it's very slow process because to make such decision 
you need to know the specific program and we would appreciate any help 
with this matter.

Regards,
Jirka

On 02/20/2013 02:09 AM, Dave Jones wrote:
> On Tue, Feb 19, 2013 at 10:10:38PM +0100, Jiri Moskovcak wrote:
>
>   > >>So if you want to hack this into a tool for use on kernel bugs, go for
>   > >>it.
>   > >...and please integrate with abrt! Let's have it all working together :)
>   >
>   > - I am all for it, the abrt server is exactly the place where these
>   > kind of things should be
>
> What I have in mind is the cases where some human interaction is still necessary.
>
> Adding heuristics on the server side for certain cases would help us, but
> there are still a bunch of common operations we do that require a human
> to make a judgment call before we make a change.
>
> But, pursuing the server-side solution, here are some things that we'd find useful
> that *could* be automated.
>
> - Unlike most packages, we have individual maintainers for subcomponents
>    (this is where our bugzilla implementation sucks, because we can't file
>     by subcomponent).  So when we get bugs against certain drivers,
>     or filesystems etc, we reassign to those developers who signed up to work
>     on those.
>    This probably counts for a significant percentage of our interactions with
>    bugzilla.  I'm not sure what kind of heuristics you'd need to add to automate
>    assigning to the right person.  Maybe you can pull the symbol from the IP,
>    translate that to a filename, and have a database of wildcards so you can do
>    things like..
>     drivers/net/wireless/* -> linville@
>     fs/btrfs/* -> zab@
>     etc..
>
>    Because it's not always easy from a report to tell what component is responsible,
>    sometimes parsing the Summary is necessary, which is the sort of thing
>    I meant by 'needs human to make a judgment call'.  But if we can automate
>    the majority of the cases, it would still help a lot.
>
> - Similar thing as previous, but all graphics bugs get reassigned by us
>    immediately to xorg-x11-drv-* because those guys deal with both the X and
>    kernel modesetting/dri code. So any trace with 'i915', 'radeon' etc
>    can probably be auto-reassigned.
>
> - When we get 'general protection fault' bugs, it's useful to run the Code:
>    line of the oops through scripts/decodecode (from a kernel tree).
>    This disassembly will allow us to see what instruction caused the GPF.
>    (Note: *just* general protection faults, not every trace.  Also, we
>     only really need the faulting instruction, not the whole disassembly).
>    Bonus points if it can suck the relevant data out of the debuginfo rpms
>    to map the code line to C code.
>
> - Extrapolating from the above, when we see certain register values in those
>    bugs, they usually hint at the cause of a bug. For example 0x6b6b6b6b is
>    SLAB_POISON, and usually means we tried to use memory after it was freed.
>    Adding a comment to point this out speeds up analysis.
>
> - Getting trickier..  We see a *lot* of flaky hardware, where we tried to
>    dereference an address which had a single bit flip in memory.
>    If the server side had some smarts so it knew what 'good' addresses looked like,
>    it could detect the single bit-flip case, and guide the user to run
>    memtest86 will save us a round-trip.
>
> That's all I have right now, but there are probably a bunch of other
> common operations we do which could be automated.
>
> 	Dave
>