Static Analysis SIG ? (was Re: Dealing with static code analysis in Fedora)

Fri Dec 14 18:15:07 UTC 2012

On Thu, 2012-12-13 at 21:45 +0200, Alek Paunov wrote:
> On 11.12.2012 23:52, David Malcolm wrote:
> > We'd be able to run all of the code in Fedora through static analysis
> > tools, and slurp the results into the database
> 
> Dave, I really do not know what to say first :-). The subject is so 
> important and there are so many aspects and application fields - IMHO, 
> the topic is the most important one in the devel list lately (and is in 
> _direct_ relation with the all other _hot_ topics - ABI stability, 
> upgradeability, collections, reliable/automated migrations, packagers 
> productivity, rawhide, etc.)

I very much agree, of course :)

We rely far too much on manual tasks when we build Fedora, and there are
plenty that we could be automating.  Within the domain of making Fedora
more stable, static code analysis should be the partner of the AutoQA
effort: the former can detect bugs at compile-time, the latter at
run-time (hopefully before it reaches any systems that people care
about).  The two approaches are of course complementary: programmers
tend to overemphasize the value of static code analysis - but both are
valuable.

> I hope this thread will be long and fruitful discussion with the final 
> effect to change Fedora to something that all motivated devs in the list 
> expect it to become. Just few preliminary questions about your insights 
> in the future:
> 
> 1) What about dumping the GCC structs to the DB during the OS/Repos 
> processing from the same beginning (means something more powerful than 
> dxr.mozilla.org, and possibility to engage various static analysis 
> people to the project, like Masaryk University team as Michal reported, 
> without the locking to concrete compiler technology/encoding)

Yes - we could use that to build a great source cross-referencer for all
of Fedora.  I can think of plenty of uses for this (e.g. "upstream added
a new parameter to this library function; how much is going to break?").

FWIW, I'm trying to focus my efforts on bug detection, though, and I'm
not sure how far one can get with a database of the IR of every
function.   I've been looking at using LTO to do interprocedural
analysis across source files via my gcc plugin, and I have that working
(not yet released), so I can run analyses at whole libraries at
link-time, from within GCC.  One drawback of that is that (IIRC) GCC's
LTO representation within the .o files is specific to the precise GCC
version, and IIRC there aren't any guarantees about forward or backward
compatibility, compared to say, LLVM bitcode.  It might be possible to
patch GCC to use GIMPLE textual dumps (compressed?) as an intermediate
format, storing that within the .o files in a similar way to the LTO
implementation.  But I'm speculating here.

> 2) Clang world enrolled the (suspicious) term "Compilation database" as 
> the safe sequence and arguments of the compiler invocations for a 
> package build. What is your opinion for abstracting build systems to the 
> DB in the same way in Fedora (based on the GCC plugin)?
I hadn't heard of that; presumably you're referring to:
  http://clang.llvm.org/docs/JSONCompilationDatabase.html
right?

That sounds reminiscent of the "fake-make" program that Steve Grubb
wrote and mentioned elsewhere in this thread:
http://lists.fedoraproject.org/pipermail/devel/2012-December/175259.html
http://people.redhat.com/sgrubb/swa/cwe/index.html

As for both (2) and the SA part of (1), they seem to me to be coming at
this from a slightly different direction to the one I'm interested in:
they're approaching the problem of "I have a static analysis tool that
finds bugs in, say C++ code, how do I get it to run on all of Fedora
without having to dealing with the bazillion different build
invocations, Makefiles, autoconf, cmake, custom scripts etc across all
of the packages".

As described in my other mail, I think it's possible to reliably run a
static analysis tool whilst rebuilding an srpm by hacking up mock to
inject the analysis payload, and then harvesting report files from the
chroot, as described in:
http://lists.fedoraproject.org/pipermail/devel/2012-December/175258.html
[search for "nasty nasty hack... but it works" within that post :)].

There's still the messy business of actually doing the rebuilds, of
course (I believe we can set up some guest VMs within Fedora's
infrastructure for big workloads - massively handwaving here, of
course).

The problem I'm most interested in is "I've run my tool on lots of
packages and it generated lots of results.  What do we do with all of
these results?" - I want to come up with a better answer than "file lots
of bugzillas", since that approach would suck: what happens next time
you want to run the tool?  (newer version of tool, or newer version of
srpm).

Note that we're already running an analysis tool on all of the C/C++
tool in Fedora: the compiler.  How good is everyone at reading all of
the compiler warnings from their builds?  (or does everyone use
-Werror?)  The system I'm envisaging could also be used to slurp in the
compiler warnings from the regular koji build logs, so that it's easy to
detect when a new compiler warning appears that wasn't present in older
builds of the package - useful for both package maintainers (when the
package changes) and for compiler maintainers (when we update gcc).

Aside, there is the problem of cross-architecture analysis: I think a
serious analysis of problems in C/C++ code is going to have to be run on
every architecture we care about.  Paraphrasing an example taken from
Axel Simon's "Value-Range Analysis of C Programs" (great book IMHO), can
you spot the bug in the following C code:

int counter[256];

void count_character_frequency(char *str)
{
    while (*str) {
        counter[(int) *str]++;
        str++;
    }
}

Modulo any typos, the issue is that the code works fine on platforms
where "char" is unsigned, but on platforms where "char" is signed, the
increment of the "counter" array when (-128 < *str < 0) is going to
access the memory region in front of the array, hitting some arbitrary
other thing in RAM ("fly nasal demons, fly!")

> 3) As I said already, IMHO, this thread is the most practically 
> important topic in Fedora. What about SIG/Team? I think base of 8-10 
> high experienced part-time contributors will be enough for your spec and 
> 1)-like enhancements.

I really like your idea of a Static Analysis SIG (hence the retitling of
this thread!).

It turns out that there is already a Formal Methods SIG within Fedora:
https://fedoraproject.org/wiki/FormalMethods  (hi!)

and there's clearly an overlap there with the kind of static analysis I
have in mind.  For example, packaging the APRON library:
http://apron.cri.ensmp.fr/library/
(which is a library of abstract domains: think value-range analysis on
steroids), which would be useful to have available from a GCC plugin.

The focus of the Formal Methods SIG appears to be on formal mathematical
proofs of program properties.   Given the lack of soundness and
completeness of my existing cpychecker work, I feel like "Informal
Methods" would be more apt for cpychecker :) - it generates useful
results, but it ignores interprocedural control flow, and it can can
miss bugs due to the way I don't handle loops in all generality.  It's
also not clear how alive the Formal Methods SIG is  (I've emailed the
two listed co-leads of that SIG).

I've gone ahead and created a page on the wiki for a static analysis
SIG:
https://fedoraproject.org/wiki/StaticAnalysis
It's very much just a tentative placeholder for now (but hey, it's a
wiki!  feel free to add yourself to the "members" list).

> 
> Kind Regards,
> Alek
> 
> P.S. Fedora infrastructure resources are mandatory for the final Fedora 
> repos cooking, but I think that the community is able to provide less 
> secure, but much more in volume resources for the analysis workers 
> (Fedora can just supply small enslaving script for the dedicated VM)

(nods; this sounds eminently solvable)

Thanks for your ideas!
Dave