Re: kernel BUG at /build/buildd/linux-3.2.0/fs/lockd/clntxdr.c:226!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 2012-10-12 at 19:31 -0700, Larry McVoy wrote:
> As I've said, debugging this is going to be hard, it means stopping my
> company.  We can do one more crash, you guys can tell me what to do,
> but that's about it.

Do you have a reproducer? If you do, then perhaps we can try to
duplicate the issue on our own setups without needing to deliberately
cause you more downtime.

If not, then I guess we'll have to try looking at those slab memory
debugging options. I'm no expert on slab debugging, and it's way too
late in this corner of the US for me to try to figure out exactly what
options are appropriate, but I can look into it tomorrow (unless Linus
has a suggestion).

> On Sat, Oct 13, 2012 at 02:28:39AM +0000, Myklebust, Trond wrote:
> > On Sat, 2012-10-13 at 10:02 +0900, Linus Torvalds wrote:
> > > On Sat, Oct 13, 2012 at 9:21 AM, Larry McVoy <lm@xxxxxxxxxxxx> wrote:
> > > >
> > > > Ahh, I've been away from the kernel too long.  I miss that delicate
> > > > management touch.
> > > 
> > > "Delicate Management Touch" is my middle name.
> > > 
> > > > pics of the stack trace at http://www.mcvoy.com/lm/nfs-lock-crash
> > > 
> > > Ok, that's just the normal kind of random left-over oopses due to
> > > subsequent problems of a BUG_ON(). Looks like the watchdog timer ends
> > > up being unhappy, almost certainly simply because some core filesystem
> > > spinlock not being released.
> > > 
> > > It used to be (a long long time ago) that we'd recover fairly
> > > gracefully from BUG_ON()'s - back when the main shared lock we had was
> > > the kernel lock, and we had a single per-process kernel lock counter.
> > > So when we killed the process, we could clean that single lock up.
> > > 
> > > These days, if some process dies in random kernel code due to a
> > > BUG_ON() or a wild pointer or similar, and we kill it, we are seldom
> > > able to do so cleanly. So the best we can hope for is that it happened
> > > in some context where it held no (important) locks. Which is rare. So
> > > BUG_ON()'s are often fatal, and there are these kinds of downstream
> > > problems where they get flushed off the screen by subsequent issues...
> > 
> > If that code is being called under a lock, then we have other problems.
> > It is standard XDR code: it should always be called from an ordinary
> > process context with no special locks being held by the callers.
> > 
> > > Ho humm. Google doesn't seem to be finding any similar bug-reports, so
> > > unless Bruce or Trond go "Ahh, I know what it's about", I do think we
> > > would want to get as much more info as possible.
> > 
> > Never seen it before, and I see no reason why it should drag the entire
> > box down with it. It is part of the NLM server's callback code, so there
> > is no chance of it being called as part of a memory reclaim or anything
> > similarly sensitive to the rest of the box.
> > 
> > Are we sure that this BUG_ON() really is top of the chain of Oopses
> > here? All I can see it doing is crashing the lockd server process, which
> > will seriously inconvenience all the NFS clients trying to do locking,
> > but it shouldn't be affecting the swapper process as we're seeing in the
> > Oops screenshots.
> > If it really is the first thing to Oops, then the only thing I can think
> > of there that would trigger other Oopses would be a memory corruption
> > (use after free or some such thing?). Perhaps Larry could try turning on
> > some of the less intrusive slab debugging options?
> > 
> > > Doing a kernel compile really isn't that bad. The only nasty piece is
> > > getting the kernel configuration right, but you can just use the
> > > distro config. It's much too big and contains everything, but it will
> > > work, and gets you as similar a kernel as possible. Of course, Ubuntu
> > > has made installing your own kernel stupidly complicated (you have to
> > > build a package and install it using the package manager), but while
> > > it's an annoying extra step or two (compared to just doing a "make
> > > modules_install install"), it's not rocket surgery. There's a few help
> > > pages for it:
> > > 
> > >     https://help.ubuntu.com/community/Kernel/Compile
> > > 
> > > being the first one.
> > > 
> > >                 Linus
> > 
> > -- 
> > Trond Myklebust
> > Linux NFS client maintainer
> > 
> > NetApp
> > Trond.Myklebust@xxxxxxxxxx
> > www.netapp.com
> 

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@xxxxxxxxxx
www.netapp.com
��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux