I've been chasing a 'weird rpm messages on ARM' problem for some time now, below is a brain dump of what I've found so far. Very sporadically, rpm/yum on arm get into a state where they spit out things like: Installed: unix2dos.armv5tel 0:2.2-26.2.2 Complete! rpmdb: Locker still has locks error: db4 error(22) from db->close: Invalid argument or: # yum install libsoup-devel Loading "installonlyn" plugin rpmdb: Locker is not valid rpmdb: Unknown locker ID: ae1 error: db4 error(22) from db->close: Invalid argument error: cannot open Packages index using db3 - Invalid argument (22) error: cannot open Packages database in /var/lib/rpm This doesn't happen very often, but it's annoying enough when it does. When it happens, the rpm database itself is totally intact, but some corruption occurs in the /var/lib/rpm/__db.00x database environment files -- specifically, the lock region (which is usually __db.003.) The regular rpm recovery procedure (rm __db.* and then --rebuilddb) always gets things working again. Note that this is a general issue seen with db4, and is not rpm-specific. After having spent some time looking at the relevant db4 code back in May, at first I thought that it might be some kind of aliasing problem between mmap() and read(), since db4 appeared to be using both mmap() and regular read()/write() to access its data files -- note that most ARMs use virtual caches, which are susceptible to all kinds of aliasing issues. Testing did turn up such an aliasing issue in the kernel: http://marc.info/?l=linux-arm-kernel&m=117995603418036&w=2 However, fixing that issue didn't fix the corruption. As it turns out: 1. db4 exclusively uses read()/write() to access its database files, and exclusively uses mmap() to access its environment (__db.00x) files, mapping only a single copy of each environment file into a single process, so that aliasing issues can not occur. 2. The corruption also happens in qemu ARM system emulation (http://fedoraproject.org/wiki/ARM/HowToQemu), while qemu does not emulate virtual caches at all. Cache aliasing issues should not occur in qemu. Google reports a lot of cases of people seeing similar errors from rpm. However, in pretty much all of these cases, what is happening is that people end up with stale locks because of Ctrl-C'ing processes that interact with the RPM database or similar -- the db4 version that rpm uses doesn't recover such stale locks very well or at all. This doesn't line up with what we are seeing -- we sometimes see these errors from rpm even while populating a new chroot from scratch, for example using pilgrim. There is another report of db4 corruption on arm with symptoms similar to ours, by Andy Green: http://marc.info/?l=linux-arm-kernel&m=118064698109089&w=2 However, this is also not the same issue as we are seeing, for two reasons: - His issue is 100% reproducible, and happens on startup. Ours only hits sporadically, and is very hard to reproduce. (Rolling back the RPM database and repeating the offending command pretty much never manages to reproduce it in our case.) - His issue disappears entirely when adding some printfs to a couple of db4 functions. This suggests that he is seeing some compiler optimisation bug (he uses gcc 4.0.3, we use unmodified Fedora gcc (redhat branch)) I've tried building with -O0, but that makes no difference in our case. After digging a lot deeper into db4, what it appears like is that when this surfaces, db4's internal linked list structures (the ones implemented by db4's shqueue.h, in shared memory) that are used to keep track of outstanding locks and lock objects end up corrupted. Specifically, it seems as if writes to the writable mmap() of /var/lib/rpm/__db.003 never make it to disk (i.e. a dirty page is dropped without a writeout), and later on, a stale version of that page is read back in from disk. One clue is that in one instance, while walking one of those lists, db4 ended up dereferencing an invalid userspace pointer -- which led to this bug report: http://marc.info/?l=linux-arm-kernel&m=118712875312492&w=2 In another specific case, after rpm exited I ended up with the lock region showing a negative number of active lock objects. Some debugging suggested that while the list of lock objects was being walked, a stale version of one of the 'next' pointers in the list was somehow read back in from disk, causing some of the elements on that list to be 'deleted' twice, turning the count of elements on the list negative. As I said, it is very hard to reproduce this issue on demand, and so it is even harder to debug it. To test whether it's an issue of lost writes to shared mmap() areas, I've made my version of rpm mlock() its db4 environment files immediately after mapping them. If I don't see this issue re-surface in the next month or so, I can be reasonably confident that it has something to do with this. If it is what I think it is at this point, it might be an issue on x86 Linux kernels too -- it just might be that x86 machines don't see it happening since they typically have more RAM and thus typically less memory pressure, so that there is less pageout/writeout activity. Shared mmap() is certainly one of the things that has had a lot of bugs found in it in the past (the semi-recent Linux rtorrent corruption issues were basically due to the same type of issue), and might just not be entirely bug-free yet. For now, I'm not entirely sure what else I can do to debug this. The harder I try to reproduce this, the less it occurs. If you run into this issue, please just perform the manual rpm recovery procedure (after sending me a copy of your /var/lib/rpm directory, if possible.) Sorry for the inconvenience.