This version of the patch seems to do the trick... at least I haven't so far hit a failure testing with it. Thanks! Nate -----Original Message----- From: Neil Brown [mailto:neilb@xxxxxxx] Sent: Monday, July 21, 2008 6:56 PM To: Dailey, Nate Cc: linux-raid@xxxxxxxxxxxxxxx; mingo@xxxxxxxxxx Subject: RE: crash: write_sb_page walks mddev.disks without holding reconfig_mutex On Monday July 21, Nate.Dailey@xxxxxxxxxxx wrote: > Quick update... I've applied your patch to the kernel I'm using. There > were a few differences... for example, md_delayed_delete doesn't exist > in my kernel (so I added it). > > Unfortunately, I'm hitting a deadlock, and it looks like md_delayed_work > is at fault. Seems that in at least one case, code which holds the > inode_lock is interrupted, at which point the md_delayed_delete code > gets to run. He ends up needing the inode_lock too, and we're stuck. Yes.... I noticed yesterday that there was a problem with that patch. calling md_delayed_delete with call_rcu just isn't right. md_delayed_delete needs to get a mutex, and call_rcu calls things in a context where mutexes aren't allowed. The problem you are seeing has exactly the same cause. So I've changed it to: call synchronise_rcu() to handle the RCU side, and restore the use of schedule_work to run md_delayed delete. so unbind_rdev_from_array now ends. synchronize_rcu(); INIT_WORK(&rdev->del_work, md_delayed_delete); kobject_get(&rdev->kobj); schedule_work(&rdev->del_work); You can see the submitted version of the full patch at http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi tdiff;h=4b80991c6cb9efa607bc4fd6f3ecdf5511c31bb0 If you can test that (with appropriate revisions to apply to your kernel) I'd really appreciate it. Thanks, NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html