On Thu, Mar 31, 2011 at 12:17 PM, Mark Moseley <moseleymark@xxxxxxxxx> wrote: > On Wed, Mar 30, 2011 at 6:37 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: >> On Wed, 30 Mar 2011 14:55:00 -0700 >> Mark Moseley <moseleymark@xxxxxxxxx> wrote: >> >>> > I'm not 100% sure it's related (but I'm going to guess it is) but on >>> > these same boxes, they're not actually able to reboot at the end of a >>> > graceful shutdown. After yielding that bug and continuing with the >>> > shutdown process, it gets all the way to exec'ing: >>> > >>> > reboot -d -f -i >>> > >>> > and then just hangs forever. I'm guessing a thread is hung still >>> > trying to unmount things. On another box, I triggered that bug with a >>> > umount of one top-level mount that had subtrees. When I umount'd >>> > another top-level mount with subtrees on that same box, it's blocked >>> > and unkillable. That second umount also logged another bug to the >>> > kernel logs. >>> > >>> > In both umounts described above, the entries in /proc/mounts go away >>> > after the umount. >>> > >>> > Jeff, are you at liberty to do a graceful shutdown of the box you saw >>> > that bug on? If so, does it actually reboot? >>> >>> >>> A bit more info: On the same boxes, freshly booted but with all the >>> same mounts (even the subtrees) mounted, I don't get that bug, so it >>> seems to happen just when there's been significant usage within those >>> mounts. These are all read-only mounts, if it makes a difference. >>> >>> I was however able to trigger the bug on a box that had been running >>> (web serving) for about 15 minutes. Here's a snippet from slabinfo >>> right before umount'ing (let me know if more of it would help): >>> >> >> No, it hung. Unfortunately, I didn't collect any info that told me >> where it was hung though. >> >> I believe that's fairly typical though in situations like this. The >> problem is that we likely have a dentry refcount leak somewhere. The >> vfsmount refcount was low enough to allow an unmount though. Often >> these sorts of problems lurk in error handling code... >> >>> # grep nfs /proc/slabinfo >>> nfsd4_delegations 0 0 360 22 2 : tunables 0 0 >>> 0 : slabdata 0 0 0 >>> nfsd4_stateids 0 0 120 34 1 : tunables 0 0 >>> 0 : slabdata 0 0 0 >>> nfsd4_files 0 0 136 30 1 : tunables 0 0 >>> 0 : slabdata 0 0 0 >>> nfsd4_stateowners 0 0 424 38 4 : tunables 0 0 >>> 0 : slabdata 0 0 0 >>> nfs_direct_cache 0 0 136 30 1 : tunables 0 0 >>> 0 : slabdata 0 0 0 >>> nfs_write_data 46 46 704 23 4 : tunables 0 0 >>> 0 : slabdata 2 2 0 >>> nfs_read_data 207 207 704 23 4 : tunables 0 0 >>> 0 : slabdata 9 9 0 >>> nfs_inode_cache 23901 23901 1056 31 8 : tunables 0 0 >>> 0 : slabdata 771 771 0 >>> nfs_page 256 256 128 32 1 : tunables 0 0 >>> 0 : slabdata 8 8 0 >> >> slabinfo probably won't tell us much here. If we can narrow down the >> reproducer for this though then that would definitely help. >> Actually, /proc/self/mountstats might give us something to go on... > > Would turning on kmemleak and letting it run for a bit (luckily 15 > mins of regular use seems to be enough to trigger this bug) tell us > anything? I can do that easily enough but I don't know if it tracks > dentry entries. > > Also I'll grab mountstats as well before/after trying a umount. > Anything in particular within mountstats I should pay close attention > to? > I've been trying random things too, just to see if anything shakes out: * I tried mounting the top-level mount with the subtree mounts via TCP (was UDP), but I still get that error. * Mounting normally (UDP), if I iterate through the subtrees and unmount them *first* (so that there's zero subtrees left over) and then unmount the top-level mount, I *don't* get that error. And a umount at shutdown proceeds without error and the final call to reboot works just fine (instead of hanging indefinitely). So unmounting all the subtrees first cleans things up sufficiently. One random/irrelevant but slightly interesting thing is that when I trigger that error, it also bumps kernel.printk from "1 7 1 4" to "15 7 1 4". I've not seen that before. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html