[fedora-arm] occasional rpm environment corruption

Wed Aug 22 22:12:02 UTC 2007

I've been chasing a 'weird rpm messages on ARM' problem for some
time now, below is a brain dump of what I've found so far.

Very sporadically, rpm/yum on arm get into a state where they spit
out things like:

	Installed: unix2dos.armv5tel 0:2.2-26.2.2
	Complete!
	rpmdb: Locker still has locks
	error: db4 error(22) from db->close: Invalid argument

or:

	# yum install libsoup-devel
	Loading "installonlyn" plugin
	rpmdb: Locker is not valid
	rpmdb: Unknown locker ID: ae1
	error: db4 error(22) from db->close: Invalid argument
	error: cannot open Packages index using db3 - Invalid argument (22)
	error: cannot open Packages database in /var/lib/rpm

This doesn't happen very often, but it's annoying enough when it does.

When it happens, the rpm database itself is totally intact, but some
corruption occurs in the /var/lib/rpm/__db.00x database environment
files -- specifically, the lock region (which is usually __db.003.)
The regular rpm recovery procedure (rm __db.* and then --rebuilddb)
always gets things working again.

Note that this is a general issue seen with db4, and is not
rpm-specific.

After having spent some time looking at the relevant db4 code back
in May, at first I thought that it might be some kind of aliasing
problem between mmap() and read(), since db4 appeared to be using
both mmap() and regular read()/write() to access its data files --
note that most ARMs use virtual caches, which are susceptible to
all kinds of aliasing issues.  Testing did turn up such an aliasing
issue in the kernel:

	http://marc.info/?l=linux-arm-kernel&m=117995603418036&w=2

However, fixing that issue didn't fix the corruption.  As it turns
out:
1. db4 exclusively uses read()/write() to access its database files,
   and exclusively uses mmap() to access its environment (__db.00x)
   files, mapping only a single copy of each environment file into
   a single process, so that aliasing issues can not occur.
2. The corruption also happens in qemu ARM system emulation
   (http://fedoraproject.org/wiki/ARM/HowToQemu), while qemu does
   not emulate virtual caches at all.  Cache aliasing issues should
   not occur in qemu.

Google reports a lot of cases of people seeing similar errors from
rpm.  However, in pretty much all of these cases, what is happening
is that people end up with stale locks because of Ctrl-C'ing
processes that interact with the RPM database or similar -- the db4
version that rpm uses doesn't recover such stale locks very well or
at all.  This doesn't line up with what we are seeing -- we sometimes
see these errors from rpm even while populating a new chroot from
scratch, for example using pilgrim.

There is another report of db4 corruption on arm with symptoms
similar to ours, by Andy Green:

	http://marc.info/?l=linux-arm-kernel&m=118064698109089&w=2

However, this is also not the same issue as we are seeing, for two
reasons:
- His issue is 100% reproducible, and happens on startup.  Ours
  only hits sporadically, and is very hard to reproduce.  (Rolling
  back the RPM database and repeating the offending command pretty
  much never manages to reproduce it in our case.)
- His issue disappears entirely when adding some printfs to a
  couple of db4 functions.

This suggests that he is seeing some compiler optimisation bug
(he uses gcc 4.0.3, we use unmodified Fedora gcc (redhat branch))
I've tried building with -O0, but that makes no difference in our
case.

After digging a lot deeper into db4, what it appears like is that
when this surfaces, db4's internal linked list structures (the
ones implemented by db4's shqueue.h, in shared memory) that are
used to keep track of outstanding locks and lock objects end up
corrupted.

Specifically, it seems as if writes to the writable mmap() of
/var/lib/rpm/__db.003 never make it to disk (i.e. a dirty page is
dropped without a writeout), and later on, a stale version of that
page is read back in from disk.

One clue is that in one instance, while walking one of those
lists, db4 ended up dereferencing an invalid userspace pointer
-- which led to this bug report:

	http://marc.info/?l=linux-arm-kernel&m=118712875312492&w=2

In another specific case, after rpm exited I ended up with the lock
region showing a negative number of active lock objects.  Some
debugging suggested that while the list of lock objects was being
walked, a stale version of one of the 'next' pointers in the list
was somehow read back in from disk, causing some of the elements
on that list to be 'deleted' twice, turning the count of elements
on the list negative.

As I said, it is very hard to reproduce this issue on demand, and
so it is even harder to debug it.  To test whether it's an issue
of lost writes to shared mmap() areas, I've made my version of
rpm mlock() its db4 environment files immediately after mapping
them.  If I don't see this issue re-surface in the next month or
so, I can be reasonably confident that it has something to do
with this.

If it is what I think it is at this point, it might be an issue
on x86 Linux kernels too -- it just might be that x86 machines
don't see it happening since they typically have more RAM and
thus typically less memory pressure, so that there is less
pageout/writeout activity.

Shared mmap() is certainly one of the things that has had a lot
of bugs found in it in the past (the semi-recent Linux rtorrent
corruption issues were basically due to the same type of issue),
and might just not be entirely bug-free yet.

For now, I'm not entirely sure what else I can do to debug this.
The harder I try to reproduce this, the less it occurs.  If you
run into this issue, please just perform the manual rpm recovery
procedure (after sending me a copy of your /var/lib/rpm directory,
if possible.)  Sorry for the inconvenience.