On Fri, 17 Sep 2010 15:53:55 +0100 Tim Small <tim@xxxxxxxxxxx> wrote: > Hi, > > I have a box with a relatively simple setup: > > sda + sdb are 1TB SATA drives attached to an Intel ICH10. > Three partitions on each drive, three md raid1s built on top of these: > > md0 / > md1 swap > md2 LVM PV > > > During resync about a week ago, processes seemed to deadlock on I/O, the > machine was still alive but with a load of 100+. A USB drive happened > to be mounted, so I managed to save /var/log/kern.log At the time of > the problem, the monthly RAID check was in progress. On reboot, a > rebuild commenced, and the same deadlock seemed to occur between roughly > 2 minutes and 15 minutes after boot. > > At this point, the server was running on a Dell PE R300 (12G RAM, > quad-core), with an LSI SAS controller and 2x 500G SATA drives. I > shifted all the data onto a spare box (Dell PE R210, ICH10R, 1x1TB > drive, 8G RAM, quad-core+HT), with only a single drive, so I created the > md RAID1s with just a single drive in each. The original box was put > offline with the idea of me debugging it "soon". > > This morning, I added in a second 1TB drive, and during the resync > (approx 1 hour in), the deadlock up occurred again. The resync had > stopped, and any attempt to write to md2 would deadlock the process in > question. I think it was doing an rsnaphot backup to a USB drive at the > time the initial problem occurred - this creates an LVM snapshot device > on top of md2 for the duration of the backup for each filesystem backed > up (there are two at the moment), and I suppose this results in lots of > read-copy-update operations - the mounting of the snapshots shows up in > the logs as the fs-mounts, and subsequent orphan_cleanups. As the > snapshot survives the reboot, I assume this is what triggers the > subsequent lockup after the machine has rebooted. > > I got a couple of 'echo w > /proc/sysrq-trigger' sets of output this > time... Edited copies of kern.log are attached - looks like it's > barrier related. I'd guess the combination of the LVM CoW snapshot, and > the RAID resync are tickling this bug. > > > Any thoughts? Maybe this is related to Debian bug #584881 - > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=584881 > > ... since the kernel is essentially the same. > > I can do some debugging on this out-of-office-hours, or can probably > resurrect the original hardware to debug that too. > > Logs are here: > > http://buttersideup.com/files/md-raid1-lockup-lvm-snapshot/ > > I think vger binned the first version of this email (with the logs > attached) - so apologies if you've ended up with two copies of this email... > > Tim. > > Hi Tim, unfortunately I need more that just the set of blocked tasks to diagnose the problem. If you could get the result of echo t > /proc/sysrq-trigger that might help a lot. This might be bigger than the dmesg buffer, so you might try booting with 'log_buf_len=1M' just to be sure. It looks a bit like a bug that was fixed in prior to the release of 2.6.26, but as you are running 2.6.26, it cannot be that.. NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html