Static Analysis: some UI ideas

Wed Feb 6 10:11:14 UTC 2013

On Tuesday, February 05, 2013 11:57:51 David Malcolm wrote:
> Interesting.  I had thought that git purely stored compressed files and
> eschewed the idea of deltas, but it does indeed store deltas sometimes:
> http://git-scm.com/book/ch9-4.html

The delta compression is used e.g. anytime you transfer the objects to/from
a remote over a smart enough protocol.

> My plan had been to store the files either in a big lookaside directory
> structure, or directly as (compressed) blobs in a database,
> named/indexed by SHA-1 sum.  But that approach might yield significant
> savings, provided it can do name-matching.
> 
> Then again, premature optimization is the root of all bugs :)

It is not important now because you have not yet reached the point where 
scalability matters.  At this point, you are more limited by the supported 
features (language features, build system compatibility, bugs in analyzers, 
etc.).  Hence, I agree that space requirements is not something that needs
to be solved now.

Btw. using git instead of reinventing git is not an optimization in my point 
of view.  It is a design decision.

> I was a little sloppy in my wording when I said "mentioned in a
> warning", sorry.  This *does* include the traces, so for the case you
> describe it *would* capture the source file containing the definition of
> foo(), so that we would be able to render that information in a
> presentation of the data.  (though all of my current trace
> visualizations only handle one function at a time; see e.g.:
> http://gcc-python-plugin.readthedocs.org/en/latest/_images/sample-html-error
> -report.png (screenshot of html)
> and:
> http://fedorapeople.org/~dmalcolm/gcc-python-plugin/2012-03-19/example/examp
> le.html
> http://fedorapeople.org/~dmalcolm/gcc-python-plugin/2012-03-26/pylibmc/exam
> ple.html
> 
> so we'd need some kind of new visualization for interprocedural paths.
> 
> [specifically, the source extraction is implemented by
> https://github.com/fedora-static-analysis/mock-with-analysis/blob/master/moc
> k-with-analysis#L172 in extract_results_from_chroot() where I pass a
> SourceExtractor visitor over the analysis file; it visits all files
> mentioned in the analysis file]

So it grabs any file explicitly mentioned by the analyzer as I understand it.  
Still the user sometimes needs to see a wider context to reason about the 
validity of the report.  Do you actually have any analyzer that return traces 
going through multiple modules?

> > 'fedpkg foo' is probably overkill at this point.  My concern was rather
> > that we should not so much rely on the web server/browser approach in the
> > first place.  I would like to have most of the equipment working just
> > from terminal without any server or browser.  Any server solution can be
> > then easily built on top of it.
> 
> Right, and that's what I've been doing.

Thank you for doing it this way.

> > > Currently it's matching on 4 things:
> > > * by name of test tool (e.g. "clang-analyzer")
> > 
> > + the class of defect?  e.g. "useClosedFile" in your example above...
> 
> Good point (the "testid" in firehose terminology).  Added in:
> https://github.com/fedora-static-analysis/mock-with-analysis/commit/25b12996
> cbc66477241c94d6f6a7123007743f6c
> 
> though this doesn't appear to affect the comparisons with the
> python-ethtool test data I'm currently working with.

If you require the text messages to match exactly, then the "testid" matches 
trivially in most cases.  Problems may arise when comparing results obtained 
by different versions of a tool.

> > > * by filename of C file within the tarball (so e.g.
> > > '/builddir/build/BUILD/python-ethtool-0.7/python-ethtool/etherinfo.c'
> > > becomes 'python-ethtool/etherinfo.c', allowing different versions to be
> > > compared)
> > 
> > With some part of the path or just base name?
> 
> the "full" path within the tarball, so that, say:
> '/builddir/build/BUILD/python-ethtool-0.6/python-ethtool/etherinfo.c'
> '/builddir/build/BUILD/python-ethtool-0.7/python-ethtool/etherinfo.c'
> 
> both are treated as 'python-ethtool/etherinfo.c'
> 
> (see get_internal_filename() within:
> https://github.com/fedora-static-analysis/mock-with-analysis/blob/master/rep
> orts/reports.py#L10 )

This way it was implemented in some very first version of csdiff, but it
did not work in many cases.  Please try to diff e.g. the results from two 
different versions of openldap.

> > > * text of message
> > 
> > The messages cannot be checked for exact match in certain cases.  Have a
> > look at the rules we use in csdiff for the text messages:
> > 
> > http://git.fedorahosted.org/cgit/codescan-diff.git/plain/csfilter.cc
> 
> Thanks for the link.   My first thought looking at that code was
> "Whoah!" - you appear to be doing some fairly complicated text parsing
> in C++." Have you considered using Python for that?  (in my experience
> the time taken to run the tool dwarfs that of parsing the results, so
> using an interpreted language shouldn't impose a noticeable speed
> penalty).

We both use a regular expression language for string substitutions.  it does 
not really matter whether the rules are written in a .cc or .py file.  I will 
not rewrite a working tool to a different language, but it should be easy to 
expose more its functionality through the pycsdiff module.

> Those rules appear to be very much tool-specific, for the analysis tool
> that you're parsing the results of.

The rules for substitution of messages are tool-specific because the messages 
are tool-specific.  Do you have any idea about how to make this more generic?

> It would seem to makes sense to have two concepts of message for a
> warning: the raw message (as emitted by the tool), and a cleaned version
> of the message (for use in comparing for duplicates).  I wonder where
> the cleaning should happen: on parsing, or at a later time.  The latter
> seems better: the parser reads the raw message, which is treated as
> canonical, and we can have a set of rules for cleaning the message
> (which will need to grow and be tweaked as we run into difficult
> messages).
> 
> I did some message cleaning of my own in:
> https://github.com/fedora-static-analysis/mock-with-analysis/blob/master/rep
> orts/make-comparative-report.py#L28 since some of cpychecker's messages
> embed too much context. (my bad, as author of cpychecker)

We never "clean" the messages in our work-flow.  They are always kept in their 
original form.  The substitution are strictly internal to the diff tool and 
never ever visible from outside (unless you are debugging the tool).  The set 
of rules is something we need to change from time to time and this approach 
ensures that an update of csdiff will pull the latest set of rules and use 
them for matching.

Kamil