Deadlock (livelock?) in LVM2 and dm-snapshot

Joe MacDonald <joe@xxxxxxxxxxxx> · Mon, 28 May 2007 17:24:12 -0400

Hey dm-folks,

I hope this is the right place for this problem, it's kind of a weird
one that I can't seem to pin down, but it's connected to adding and
removing backup volumes in a volume group when using dm-snapshot.

I'll attach the scripts I'm using that demonstrates the problem pretty
quickly (usually within a few seconds of starting the third of the three
scripts (four, really, but snapshot-test.sh is invoked from step-1.sh,
not by a user), but always in less than ten minutes) on my hardware.

What happens is I start up my tests which are contrived versions of the
behaviour observed in a live system.  There's a lot of disk activity but
periodically we need to do a full filesystem backup so we've got this
setup where we have a single volume group and we add a backup volume to
the group for a brief time.  The backup is done and the backup volume is
removed.  It seems to be a the lvremove step where we get a deadlock,
but I can't really tell where and we don't get a panic or backtrace or
anything.  Not surprisingly, everything else in the system seems to be
fine, just the shell doing the lvremove and any other shell that tries
to touch the logical volumes in the group look like the system has hung.
The machine is still quite responsive to anything else, which kind of
makes debugging this harder that it otherwise would be.

I've been using things like LOCKDEP and spinlock debugging code in the
kernel without much success.  The only thing I've got to go on right
now, other than the test cases which aren't pointing me in the right
direction, is that this was failing in 2.6.20 but never fails in 2.6.21
under any amount of load I can generate, but yesterday and today I've
been trying it out on 2.6.22-rc3 and the problem is back.

Anyway, the steps to reproduce:

- from one login shell run step-1.sh
- from another login shell run filegen.sh& and killer.sh&
- wait a few seconds to at most ten minutes, you'll see step-1.sh stop
  producing output before then.

I'm using a Dell Precision 390n with an Intel Core 2 Duo E6300 on the
board.  The problem also seems to appear on UP systems, but it's
definitely a lot easier to have happen on MP hardware.  The test scripts
are set up to do ext3 right now but there are a few commented lines that
you can switch to work with XFS.  I'm more interested in getting XFS
working, but the problem happens on both and XFS also seems to generate
moderately frequent, unrelated backtraces and I don't want to confuse
the matter.  In case it helps, here's the results of lvcreate --version:

  LVM version:     2.02.25 (2007-04 27)
  Library version: 1.02.19 (2007-04-27)
  Driver version:  4.11.0

I'd really appreciate any help or pointers or even advice at what
sections of code to look at that anyone might have.

Thanks.

-- 
Joe MacDonald
:wq
Attachment:
snapshot-test.sh

Description: Bourne shell script
Attachment:
filegen.sh

Description: Bourne shell script
Attachment:
step-1.sh

Description: Bourne shell script
Attachment:
killer.sh

Description: Bourne shell script
Attachment:
pgpq4yoKl1b0Y.pgp

Description: PGP signature
--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel