RFC: "firehose" : an interchange format for static code analysis results

Paul Tagliamonte paultag at debian.org
Tue Feb 5 18:33:16 UTC 2013


On Tue, Feb 05, 2013 at 01:15:58PM -0500, David Malcolm wrote:
> On Wed, 2013-01-30 at 21:46 -0500, Paul Tagliamonte wrote:
> > On Wed, Jan 30, 2013 at 09:15:59PM -0500, David Malcolm wrote:
> > > On Wed, 2013-01-30 at 19:20 -0500, Paul Tagliamonte wrote:
> > > > On Wed, Jan 30, 2013 at 6:29 PM, David Malcolm <dmalcolm at redhat.com> wrote:
> > > > > Short version: there's a new branch of gcc-python-plugin named
> > > > > "firehose", which adds a new external dependency on a new "firehose"
> > > > > package for use by the "libcpychecker" functionality.  I hope to merge
> > > > > this into master once the churn-rate within the firehose API subsides.
> > > > >
> > > > > Long version:
> > > > > I've been working on running various static analysis tools on a large
> > > > > subset of the packages in Fedora, trying to coerce the results into a
> > > > > consistent output format, so that I can build a tracking tool (e.g.
> > > > > "what new warnings were caused due to this commit?")
> > > > >
> > > > > I'm calling my format "firehose" (since reading reports from some code
> > > > > analysis tools can feel like "drinking from a firehose").
> > > > >
> > > > > It can be seen at:
> > > > > https://github.com/fedora-static-analysis/firehose
> > > > > (Free Software, GPLv3 or later)
> > > > >
> > > > > You can see some examples here:
> > > > > https://github.com/fedora-static-analysis/firehose/tree/master/examples
> > > > >
> > > > > It's XML so that it can be easily validated: there's a RELAX-NG schema
> > > > > here:
> > > > > https://github.com/fedora-static-analysis/firehose/blob/master/firehose.rng
> > > > >
> > > > > Essentially a code issue is a message, with additional metadata (such as
> > > > > file/line/column, optional CWE identifier), and optionally a trace of
> > > > > execution to reach the error (so that an analysis tool can identify e.g.
> > > > > that a memory leak happens on a particular error-handling path after
> > > > > once through a loop, or whatever, potentially including a view of the
> > > > > changing variables in the code).
> > > > >
> > > > > The format's not set in stone yet (hence this RFC) - anything I've
> > > > > missed?
> > > > >
> > > > > I have parsers:
> > > > >  * for GCC warnings (textual parsing, assuming LANG=C)
> > > > >  * for clang-analyzer (the --plist format)
> > > > >  * for cppcheck (its XMLv2 format).
> > > > > There's also a Python API and extensive unit tests (though the API is
> > > > > not set in stone yet either).
> > > > >
> > > > > I've created a "firehose" branch of gcc-python-plugin in which the
> > > > > cpychecker uses the firehose API as its internal representation of
> > > > > errors, using that to emit gcc warnings and generate HTML trace reports
> > > > > (I plan to refactor the error trace visualization code from out of my
> > > > > gcc-python-plugin and into the firehose thing, so that other projects
> > > > > can use it).  It can thus "natively" emit XML.  My plan is for this to
> > > > > replace the JSON experiments from
> > > > > https://lists.fedorahosted.org/pipermail/gcc-python-plugin/2012-March/000225.html
> > > > >
> > > > > My plan is to merge this into master, adding the firehose dependency
> > > > > (for the cpychecker parts of the source tree, at least), though I can't
> > > > > do this until firehose achieves some level of API stability... and an
> > > > > official tarball release :)
> > > > >
> > > > > Hope this of interest - would anyone here be interested in using this in
> > > > > their own analysis scripts?
> > > > 
> > > > So, I've been writing a few small apps to help with Debian package
> > > > auditing, and came up with my own (JSON-based) format.
> > > 
> > > Hi!   I was hoping for someone from Debian to take an interest.  In
> > 
> > Ah! Cool! :)
> > 
> > Happy to be a voice from Debian, but I'll need to get more QA folks
> > before this becomes any sort of official anything :)
> > 
> > That being said, I'd be willing to help drive an effort to add this
> > format to emit static testing data. I think it's a super worthwhile
> > goal, and tools that work on both RH (and friends) and Debian (and
> > friends) can only be a good thing.
> > 
> > > particular, the <sut> element ("software-under-test") currently is
> > > specified as having a choice between a <source-rpm> child or... nothing
> > > else.   What would be appropriate metadata for representing the results
> > > of running an analysis tool on a Debian package? (i.e. for identifying
> > > the package under test).   Similar considerations apply for e.g. running
> > > a tool on a working copy (aka checkout) from some SCM.  See
> > > http://lists.fedoraproject.org/pipermail/devel/2013-January/176715.html
> > > for some ideas on that.
> > 
> > So, our bug tracker and friends usually sit pretty happy with a simple
> > string.
> > 
> > However, since we've got something already, let's flesh something out
> > here:
> > 
> >    <package type="rpm" name="libxml2" version="2.9.0" release="1.fc17">
> It's currently:
> <sut>
>    <source-rpm name="python-ethtool" version="0.7" release="4.fc19"
> build-arch="x86_64"/>
> </sut>

Ah! Changes! :)

> 
> though I'm a little unhappy that build-arch is there; that strikes me as
> more a "configuration" thing, rather than an aspect of the build itself.

Quite. Also, which arch is that -- the CPU arch, distro build arch or
the target arch?

For instance, let's go with this (insane!) situation:

I'm on an amd64 machine (64 bit x86), with a 32 bit kernel / operating
system, building binaries for the LIPA arch. What goes there? :)

(for a slighty less insane example, try putting armv6, v7 and v6 + hard
float ABI or something)

> 
> > Could look like:
> > 
> >    
> >    <package type="dsc" name="fluxbox" version="1.3.3" release="1">
> >    <package type="deb" name="fluxbox" version="1.3.3" release="1">
> > 
> > where version is our upstream version, release is the local version (if
> > it exists -- some packages, such as native packages, may not have a local
> > version, and that'd have to be ommited or blank, will that break schema?)
> 
> Would something like this make more sense:
> <sut>
>    <debian-package name="fluxbox" version="1.3.3" release="1"/>
> </sut>

That looks great, so long as release= is optional :)

We could use debian-package for the binary result, and debian-source
for the source package? How does that sound?

> 
> 
> > The only trouble is "type". The .dsc is a "control" file (e.g. a RFC822
> > flat-file that has some control bits in there), and not a package "type"
> > like a .deb is.

> So the .dsc refers to a source build, and the .deb is the built
> artefact?  (sorry, I've never run a debian-based OS, so am a little hazy
> on the details).

Yep, that's basically right.

> 
> Looking at your debuild.me tool, e.g.:
> http://debuild.me/package/50c5798acae33a44b0000000
> 
> I see Package: "fbautostart/2.718281828-1".  Is that what you would want
> to encode here?

That's how package / version strings are usually displaed with Debian
tools -- don't worry about that, that's just a display thing. That's
basically:

<sut>
   <debian-package name="fbautostart" version="2.718281828" release="1"/>
</sut>

> 
> > We could do something like "debian-binary" and "debian-source", but then
> > we start to get a skitch ugly.
> > 
> > If we're looking for three-char names, I'd be super happy to see "dsc"
> > and "deb", even it "dsc" isn't strictly right, and I'd imagine others
> > would be fine with it, too.
> I'm not sure where the "three-char names" thing came from.

Ah. Check up above -- I think you had type= in an older schema or
something:

>    <package type="rpm" name="libxml2" version="2.9.0" release="1.fc17">

I figured you'd want something to fit in the type="" param. No worries,
I'd be hugely +1 on dropping that constraint.

> 
> > 
> > > 
> > > > Clearly, I'd much rather "outsource" that design to something that
> > > > will have compatible tools, adoption.
> > > > 
> > > > I like what you've done, despite my usual blind-XML-hate, so I'll see
> > > > about writing a few test-linters, see how it comes out.
> > > Thanks.  FWIW I originally was thinking JSON, but given that the idea is
> > > that there are multiple sources of data, it's useful to have a schema to
> > > verify the format against, and I'm a huge fan of RELAX-NG for doing
> > > that.
> > 
> > Aye!
> > 
> > > 
> > > Given that I'm still hacking on the format, should we create a mailing
> > > list for it?
> > 
> > I'd join it. I'm also from Boston, so if you want to find a time (off
> > list?) to hack, I'd be up to flesh some of this out in person and get
> > some feedback on the result :)
> 
> I went ahead and created:
>   https://admin.fedoraproject.org/mailman/listinfo/firehose-devel
> Hope it's OK that it's in the "fedora" namespace again.

+subscribed and CC'd -- let's move over there :)

> 
> 
> > > > The only major incompatibility between our approaches is that I
> > > > preferred to sort issued based on Severity (as most tools do have some
> > > > concept of severity), is there any way we could get something
> > > > standardized in?
> > > 
> > > What kinds of value would it have?  What examples of input are you
> > > thinking of?  Would we try to coerce everything into one set of
> > > severities, or would each tool have its own hierarchy?
> > 
> > Yeah. I mean, they're pretty arbitrary :)
> > 
> > For instance, I'd like to encode `lintian' (our package static checker)
> > output in a lossless way (and perhaps encode *more* information). We
> > have a few severities -- eXperimental, Information, Pedantic, Warning
> > and Error.
> > 
> > It's nice to be able to filter out pedantic warnings when doing large
> > runs :)
> Lossless encoding suggests that we have an optional "severity" string
> that's a freeform string, with meaning tied to the specific tool, and
> that we shouldn't try to munge them together.
> 
> So you could have
>   <analysis>
>    <metadata>
>       <generator name="lintian"/>
>    </metadata>
>    <results>
>      <issue severity="Pedantic"/>
>      <!-- etc -->
>    </results>
>   </analysis>
> 

Perfect!

> 
> > > (I see that cppcheck's XML v2 has a severity="error" in each report that
> > > I've looked at)
> > 
> > So, yeah, that may be true, but take a look at the --enable flags --
> > cppcheck (one of the tools I hacked with) knows about `style',
> > `performance', `portability', `information', `unusedFunction' and
> > `missingInclude'.
> > 
> > I'd really love to be able to filter out stuff like performance,
> > unusedFunction, and style when doing archive-wide runs :)
> > 
> > > 
> > > I'm thinking it would have to be optional - for example, GCC warnings
> > > don't have a concept of severity.
> > 
> > Quite! That'd be great. There are tons of tools where it'd be silly
> > indeed!
> > 
> > > 
> > > From a prioritization standpoint, I'm much more interested in where
> > > source code is deployed: for example, a normally low-severity warning
> > > that's in a setuid binary is probably worth more attention than, say, a
> > > buffer-overflow bug in a parser-generator that's only used at build-time
> > > and never actually makes it into a package payload (and is thus never
> > > subjected to hostile data).  (my rationale here is that setuid binaries
> > > require extreme caution, so that any warnings found there are a
> > > suggestion of sloppiness, and thus may signal trouble).  Though arguably
> > > severity != priority (hope this para makes sense).
> > 
> > Extremely +1 on that. Amazing idea.
> 
> Cheers
> Dave
> 
> _______________________________________________
> gcc-python-plugin mailing list
> gcc-python-plugin at lists.fedorahosted.org
> https://lists.fedorahosted.org/mailman/listinfo/gcc-python-plugin

Cheers,
  Paul

-- 
 .''`.  Paul Tagliamonte <paultag at debian.org>
: :'  : Proud Debian Developer
`. `'`  4096R / 8F04 9AD8 2C92 066C 7352  D28A 7B58 5B30 807C 2A87
 `-     http://people.debian.org/~paultag
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://lists.fedoraproject.org/pipermail/firehose-devel/attachments/20130205/cd578a72/attachment.sig>


More information about the firehose-devel mailing list