On Tue, 12 Apr 2011 16:05:52 +0200 Thomas Jarosch <thomas.jarosch@xxxxxxxxxxxxx> wrote: > Hello Neil, > > On Wednesday, 6. April 2011 12:16:00 Tejun Heo wrote: > > > To put it another way matching your description Tejun, the put path has > > > a chance to run firstly while mddev_find is waiting for the spinlock, > > > and then while flush_workqueue is waiting for the rest of the put path > > > to complete. > > > > I don't think the logic is wrong per-se. It's more likely that the > > implemented code doesn't really follow the model described by the > > logic. > > > > Probably the best way would be reproducing the problem and throwing in > > some diagnostic code to tell the sequence of events? If work is being > > queued first but it still ends up busy looping, that would be a bug in > > flush_workqueue(), but I think it's more likely that the restart > > condition somehow triggers in an unexpected way without the work item > > queued as expected. > > I can test any debug patch you want, the box is in a test lab anyway. > > Best regards, > Thomas Could you try this? diff --git a/drivers/md/md.c b/drivers/md/md.c index a0ccaab..07c97b1 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -6175,6 +6175,8 @@ static int md_open(struct block_device *bdev, fmode_t mode) mddev_t *mddev = mddev_find(bdev->bd_dev); int err; + BUG_ON(!mddev->gendisk); + if (mddev->gendisk != bdev->bd_disk) { /* we are racing with mddev_put which is discarding this * bd_disk. It don't know how it could get to the state where gendisk was NULL, but it is the only way I can see that the looping could happen. If the BUG_ON does trigger I'll probably be able to find out why it happens. If it doesn't then I'll still be at a loss. NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html