Re: Dealing with static code analysis in Fedora

Michal Toman <mtoman@xxxxxxxxxx> · Wed, 12 Dec 2012 16:46:43 +0100

Hi Dave!

We have started a similar project in ABRT some year ago. The original 
purpose was to automatically determine some crash characteristics 
(security impact, unchecked user input etc.). The overall idea is 
rebuilding the given package with some compiler plugin and go through 
the AST based on coredump's stack.

We have built an infrastructure working on top of ABRT Server[1]. This 
is because of its backend synchronizing with several Fedora services 
(koji, pkgdb, bugzilla) storing all the data in one database and thus 
ability to make fast cross-service queries. The "storage" (that's how we 
call it) is more complex - it uses DB for relational data and saves 
large files (like RPMs) on the FS.

The storage contains (almost) all packages built in koji. We were trying 
to use mock & yum for a long time, but we were not able to make it work 
correctly with such a complex repo. That's why we have created our own 
dependency solver based on libsolv (using the DB as metadata source) and 
use rpm directly to install the resulting set. This resulted into a fast 
chroot installer able to interrupt build process in many phases.

Some guys working on static analysis at Masaryk University, Brno, CZ 
showed interest in the project. They are working on various clang 
plugins for static analysis, but have no real input data. We agreed to 
implement a service rebuilding Fedora packages with LLVM/Clang and 
providing the output (LLVM bitcode files). They will provide the 
analysis tools.

This showed up as a non-trivial problem, because a simple CC=clang is 
not enough :). The thing is however implemented and I've personally 
rebuilt Fedora a few times, leaving me with some 60% of packages 
successfully processed and a few ideas for improvement. The service has 
not yet been deployed in Fedora infrastructure, because we 1) don't 
consider it ready 2) lack hardware capacity.

This is more or less the current state. We are open to any discussion, 
new ideas, use cases or patches :).

Sources: http://git.fedorahosted.org/cgit/faf.git/
RFE: https://fedorahosted.org/abrt/newticket?component=faf

Michal & ABRT

[1] http://abrt.fedoraproject.org

On 2012-12-11 22:52, David Malcolm  wrote:
A while back I ran my static checker on all of the Python extension
modules in Fedora 17:
   http://fedoraproject.org/wiki/Features/StaticAnalysisOfPythonRefcounts

I wrote various scripts to build the packages in a mock environment that
injects my checker into gcc, then wrote various scripts to triage the
results.  I then filed bugs by hand for the most important results,
writing some more scripts along the way to make the process easier.

This led to some valuable bug fixes, but the mechanism for running the
analysis was very ad hoc and doesn't scale.

In particular, we don't yet have an automated way of rerunning the
tests, whilst using the old results as a baseline.  For example it would
be most useful if only new problems could be reported, and if the system
(whatever it is) remembered when a report has been marked as a true bug
or as a false positive.  Similarly, there's no automated way of saying
"this particular test is bogus; ignore it for now".

I'm wondering if there's a Free Software system for doing this kind of
thing, and if not, I'm thinking of building it.

What I have in mind is a web app backed by a database (perhaps
"checker.fedoraproject.org" ?)

We'd be able to run all of the code in Fedora through static analysis
tools, and slurp the results into the database: primarily my
"cpychecker" work, but we could also run the clang analyzer etc.  I've
also been working on another as-yet-unreleased static analysis tool for
which I'd want a db for the results.  What I have working is a way to
inject an analysis payload into gcc within a mock build, which dumps
JSON report files into the chroot without disturbing the "real" build.
The idea is then to gather up the JSON files and insert the report data
into the db, tagging it with version information.

There are two dimensions to the version information:
  (A) the version of the software under analysis
          (name-version-release.arch)
  (B) the version of the tool doing the analysis

We could use (B) within the system to handle the release cycle of a
static analysis tool.  Initially, any such analysis tools would be
regarded as "experimental", and package maintainers could happily ignore
the results of such a tool.  The maintainer of an analysis tool could
work on bug fixes and heuristics to get the signal:noise ratio of the
tool up to an acceptable level, and then the status of the analysis tool
could be upgraded to an "alpha" level or beyond.

Functional Requirements:
   * a collection of "reports" (not bugs):
     * interprocedural control flow, potentially across multiple source
       files (potentially with annotations, such as value of variables,
       call stack?)
       * syntax highlighting
       * capturing of all relevant source (potentially with headers as
         well?)
       * visualization of control flow so that you can see the path
         through the code that leads to the error
     * support for my cpychecker analysis
     * support for an as-yet-unreleased interprocedural static analysis
       tool I've been working on
     * support for reports from the clang static analyzer
     * ability to mark a report as:
       * a true bug (and a way to act on it, e.g. escalate to bugzilla or
         to the relevant upstream tracker)
       * a false positive (and a way for the analysis maintainer to act
         on it)
       * other bug associations with a report? (e.g. if the wording from
         the tool's message could be improved)
       * ability to have a "conversation" about a report within the UI as
         a series of comments (similar to bugzilla).
     * automated report matching between successive runs, so that the
       markings can be inherited
     * scriptable triage, so that we can write scripts that mark all
       reports matching a certain pattern e.g. as being bogus, as being
       security sensitive, etc
     * potentially: debug data (from the analysis tool) associated with a
       report, so that the maintainers of the tool can analyze a false
       positive
     * ability to store crash results where some code broke a static
       analysis tool, so that the tool can be fixed
   * association between reports and builds
   * association between builds and source packages
   * association between packages and people, so that you can see what
     reports are associated with you (perhaps via the pkgdb?)
   * prioritization of reports to be generated by the tool
   * association between reports and tools (and tool versions)
   * "quality marking" of tool versions, so that we can ignore "alpha"
     versions of tools and handle phasing in of a new static analysis
     tool without spamming everyone
   * ability to view the signal:noise ratio of a version of a tool

Nonfunctional requirements:
   * Free Software
   * sanely deployable within Fedora infrastructure
   * sane code, since we're likely to want to extend it (fwiw I'd be most
     comfortable with a Python implementation).
   * able to scale to running all of Fedora through multiple tools
     repeatedly
   * many simultaneous users
   * will want an authentication system so that we can associate comments
     with users.  Eventually we may want a way of embargoing
     security-sensitive bugs found by the tool so that they're only
     visible by a trusted cabal.
   * authentication system to support FAS, but not require it, in case
     other people want to deploy such a tool.  Maybe OpenID?

Implementation ideas:
   * as well as a relational database for the usual things, perhaps a
lookaside of source files stored gzipped, with content-addressed storage
e.g. "0fcb0d45a6353e150e26f1fa54d11d7be86726b6" stored gzipped as:
     objects/0f/cb0d45a6353e150e26f1fa54d11d7be86726b6
(yes, this looks a lot like git)

Thoughts?  Does such a thing already exist?

It might be fun to hack on this at the next FUDcon.

Dave

--
devel mailing list
devel@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/devel