RE: bug #917 - deadlock on log recovery

Kirill Malkin <kirill.malkin@xxxxxxxxxxxxxxxxxxxx> · Fri, 30 Mar 2012 12:44:41 -0400

Christoph -

Thank you for getting back to me. The kernel I am using is not a vanilla
kernel.org 2.6.32, but is part of the RHEL/CentOS 6 distribution, which
has many bug fixes backported, at least up until 2.6.38 or so.
Technically, it's their latest kernel.

The bug is very difficult to reproduce even on this kernel. It occurs
while mounting a snapshot of a very large (40TB) filesystem that is in a
very active, continuous use. Once the filesystem snapshot is in that
state, it is reproducible 100% (i.e. on every mount), but it's not clear
what pushes it there. Unfortunately, a kernel upgrade on that system is
currently not possible.

Note the lockup occurs during the trimming of free list in
xfs_alloc.c:xfs_alloc_fix_freelist when it's too long (look for "Make the
freelist shorter if it's too long" comment inside this function), then for
some reason the buffer gets double-locked inside xfs_btree_get_bufs, and
the mount hangs forever. I suspect that we are not seeing this more
frequently because the free list trimming is not a typical occurrence
during recovery.

I've looked through the patches to xfs stack in kernel.org git, and found
virtually no changes to this particular area or references to something
similar. I can probably do more research into it, but would really
appreciate some guidance. Would it help to obtain the metadata backup from
that system? What could possibly cause a deadlock when the log recovery
has really no concurrency? Would it help to debug this by somehow forcing
free list trimming during the recovery?

Thanks again for your help.

Kirill

-----Original Message-----
From: Christoph Hellwig [mailto:hch@xxxxxxxxxxxxx]
Sent: Friday, March 30, 2012 12:07 PM
To: Kirill Malkin
Cc: xfs@xxxxxxxxxxx; xfs-masters@xxxxxxxxxxx
Subject: Re: bug #917 - deadlock on log recovery

On Thu, Mar 22, 2012 at 01:34:00PM -0400, Kirill Malkin wrote:
> Hi,
>
> I am wondering if someone had a chance to look at the bug #917. I
> filed it a couple of weeks ago, but haven?t seen any action. We are
> running into it quite a lot, and the only way out of it is to reboot
> the OS and drop the log. Below is another stack trace that is slightly
> different from the one I filed, but apparently it is the same bug.
>
> Please let me know if you need any other input.

Can you reproduce this with a recent kernel?  2.6.32 is fairly old and a
lot of things have changed in this area.  I quickly looked over the trace
and nothing obvious springs to mind.

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs