On Sat, Oct 13, 2012 at 02:28:39AM +0000, Myklebust, Trond wrote: > On Sat, 2012-10-13 at 10:02 +0900, Linus Torvalds wrote: > > On Sat, Oct 13, 2012 at 9:21 AM, Larry McVoy <lm@xxxxxxxxxxxx> wrote: > > > > > > Ahh, I've been away from the kernel too long. I miss that delicate > > > management touch. > > > > "Delicate Management Touch" is my middle name. > > > > > pics of the stack trace at http://www.mcvoy.com/lm/nfs-lock-crash > > > > Ok, that's just the normal kind of random left-over oopses due to > > subsequent problems of a BUG_ON(). Looks like the watchdog timer ends > > up being unhappy, almost certainly simply because some core filesystem > > spinlock not being released. > > > > It used to be (a long long time ago) that we'd recover fairly > > gracefully from BUG_ON()'s - back when the main shared lock we had was > > the kernel lock, and we had a single per-process kernel lock counter. > > So when we killed the process, we could clean that single lock up. > > > > These days, if some process dies in random kernel code due to a > > BUG_ON() or a wild pointer or similar, and we kill it, we are seldom > > able to do so cleanly. So the best we can hope for is that it happened > > in some context where it held no (important) locks. Which is rare. So > > BUG_ON()'s are often fatal, and there are these kinds of downstream > > problems where they get flushed off the screen by subsequent issues... > > If that code is being called under a lock, then we have other problems. > It is standard XDR code: it should always be called from an ordinary > process context with no special locks being held by the callers. > > > Ho humm. Google doesn't seem to be finding any similar bug-reports, so > > unless Bruce or Trond go "Ahh, I know what it's about", I do think we > > would want to get as much more info as possible. > > Never seen it before, and I see no reason why it should drag the entire > box down with it. It is part of the NLM server's callback code, so there > is no chance of it being called as part of a memory reclaim or anything > similarly sensitive to the rest of the box. > > Are we sure that this BUG_ON() really is top of the chain of Oopses > here? All I can see it doing is crashing the lockd server process, Can't it be called from the rpciod workqueue? I'm not sure what happens when we hit a BUG there. It looks like a bunch of BUG_ON's got added with an xdr rewrite in 2b061f9ef216b6d229b06267f188167fd6ab3d9b. Maybe Chuck or someone should do a 'git grep BUG fs/lockd' and figure out what those should be instead? And I need to do the same for nfsd; I've been sloppy about using them as asserts. --b. > which > will seriously inconvenience all the NFS clients trying to do locking, > but it shouldn't be affecting the swapper process as we're seeing in the > Oops screenshots. > If it really is the first thing to Oops, then the only thing I can think > of there that would trigger other Oopses would be a memory corruption > (use after free or some such thing?). Perhaps Larry could try turning on > some of the less intrusive slab debugging options? > > > Doing a kernel compile really isn't that bad. The only nasty piece is > > getting the kernel configuration right, but you can just use the > > distro config. It's much too big and contains everything, but it will > > work, and gets you as similar a kernel as possible. Of course, Ubuntu > > has made installing your own kernel stupidly complicated (you have to > > build a package and install it using the package manager), but while > > it's an annoying extra step or two (compared to just doing a "make > > modules_install install"), it's not rocket surgery. There's a few help > > pages for it: > > > > https://help.ubuntu.com/community/Kernel/Compile > > > > being the first one. > > > > Linus > > -- > Trond Myklebust > Linux NFS client maintainer > > NetApp > Trond.Myklebust@xxxxxxxxxx > www.netapp.com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html