Hey dm-folks, I hope this is the right place for this problem, it's kind of a weird one that I can't seem to pin down, but it's connected to adding and removing backup volumes in a volume group when using dm-snapshot. I'll attach the scripts I'm using that demonstrates the problem pretty quickly (usually within a few seconds of starting the third of the three scripts (four, really, but snapshot-test.sh is invoked from step-1.sh, not by a user), but always in less than ten minutes) on my hardware. What happens is I start up my tests which are contrived versions of the behaviour observed in a live system. There's a lot of disk activity but periodically we need to do a full filesystem backup so we've got this setup where we have a single volume group and we add a backup volume to the group for a brief time. The backup is done and the backup volume is removed. It seems to be a the lvremove step where we get a deadlock, but I can't really tell where and we don't get a panic or backtrace or anything. Not surprisingly, everything else in the system seems to be fine, just the shell doing the lvremove and any other shell that tries to touch the logical volumes in the group look like the system has hung. The machine is still quite responsive to anything else, which kind of makes debugging this harder that it otherwise would be. I've been using things like LOCKDEP and spinlock debugging code in the kernel without much success. The only thing I've got to go on right now, other than the test cases which aren't pointing me in the right direction, is that this was failing in 2.6.20 but never fails in 2.6.21 under any amount of load I can generate, but yesterday and today I've been trying it out on 2.6.22-rc3 and the problem is back. Anyway, the steps to reproduce: - from one login shell run step-1.sh - from another login shell run filegen.sh& and killer.sh& - wait a few seconds to at most ten minutes, you'll see step-1.sh stop producing output before then. I'm using a Dell Precision 390n with an Intel Core 2 Duo E6300 on the board. The problem also seems to appear on UP systems, but it's definitely a lot easier to have happen on MP hardware. The test scripts are set up to do ext3 right now but there are a few commented lines that you can switch to work with XFS. I'm more interested in getting XFS working, but the problem happens on both and XFS also seems to generate moderately frequent, unrelated backtraces and I don't want to confuse the matter. In case it helps, here's the results of lvcreate --version: LVM version: 2.02.25 (2007-04 27) Library version: 1.02.19 (2007-04-27) Driver version: 4.11.0 I'd really appreciate any help or pointers or even advice at what sections of code to look at that anyone might have. Thanks. -- Joe MacDonald :wq
Attachment:
snapshot-test.sh
Description: Bourne shell script
Attachment:
filegen.sh
Description: Bourne shell script
Attachment:
step-1.sh
Description: Bourne shell script
Attachment:
killer.sh
Description: Bourne shell script
Attachment:
pgpq4yoKl1b0Y.pgp
Description: PGP signature
-- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel