On Thu, Jul 14 2016, Alexander Lyakas wrote: >> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c >> index 32e282f4c83c..c528102b80b6 100644 >> --- a/drivers/md/raid10.c >> +++ b/drivers/md/raid10.c >> @@ -1288,7 +1288,7 @@ read_again: >> sectors_handled; >> goto read_again; >> } else >> - generic_make_request(read_bio); >> + bio_list_add_head(¤t->bio_list, read_bio); >> return; >> } > > Unfortunately, this patch doesn't work. It is super elegant, and seems > like it really should work. But the problem is that the "rdev", to > which we want to send the READ bio, might also be a remapping device > (dm-linear, for example). This device will create its own remapped-bio > and will call generic_make_request(), which will stick the bio onto > current->bio_list TAIL:(:(:( So we are back at square one. This patch > would work if *all* the remapping drivers in the stack were doing > bio_list_add_head() instead of generic_make_request() :(:(:( > > It seems the real fix should be that generic_make_request() would use > bio_list_add_head(), but as pointed in > http://www.spinics.net/lists/raid/msg52756.html, there are some > concerns about changing the order of remapped bios. > While those concerns are valid, they are about hypothetical performance issues rather than observed deadlock issues. So I wouldn't be too worried about them. However I think you said that you didn't want to touch core code at all (maybe I misunderstood) so that wouldn't help you anyway. One option would be to punt the request requests to raidXd: diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 40b35be34f8d..f795e27b2124 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1229,7 +1229,7 @@ read_again: sectors_handled; goto read_again; } else - generic_make_request(read_bio); + reschedule_retry(r1_bio); return; } diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 32e282f4c83c..eec38443075b 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1288,7 +1288,7 @@ read_again: sectors_handled; goto read_again; } else - generic_make_request(read_bio); + reschedule_retry(r10_bio); return; } That might hurt performance, you would need to measure. The other approach would be to revert the patch that caused the problem. e.g. diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 40b35be34f8d..062bb86e5fd8 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -884,7 +884,8 @@ static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio) wait = false; else wait = true; - } + } else if (conf->barrier) + wait = true; return wait; } > >> >> >>> >>> Since this issue is a real deadlock we are hitting in a long-term 3.18 >>> kernel, is there any chance for cc-stable fix? Currently we applied >>> the rudimentary fix I posted. It basically switches context for >>> problematic RAID1 READs, and runs them from a different context. With >>> this fix we don't see the deadlock anymore. >>> >>> Also, can you please comment on another concern I expressed: >>> freeze_array() is now not reentrant. Meaning that if two threads call >>> it in parallel (and it could happen for the same MD), the first thread >>> calling unfreeze_array will mess up things for the second thread. >> >> Yes, that is a regression. This should be enough to fix it. Do you >> agree? >> >> Thanks, >> NeilBrown >> >> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >> index 40b35be34f8d..5ad25c7d7453 100644 >> --- a/drivers/md/raid1.c >> +++ b/drivers/md/raid1.c >> @@ -984,7 +984,7 @@ static void freeze_array(struct r1conf *conf, int extra) >> * we continue. >> */ >> spin_lock_irq(&conf->resync_lock); >> - conf->array_frozen = 1; >> + conf->array_frozen += 1; >> wait_event_lock_irq_cmd(conf->wait_barrier, >> conf->nr_pending == conf->nr_queued+extra, >> conf->resync_lock, >> @@ -995,7 +995,7 @@ static void unfreeze_array(struct r1conf *conf) >> { >> /* reverse the effect of the freeze */ >> spin_lock_irq(&conf->resync_lock); >> - conf->array_frozen = 0; >> + conf->array_frozen -= 1; >> wake_up(&conf->wait_barrier); >> spin_unlock_irq(&conf->resync_lock); >> } > I partially agree. The fix that you suggest makes proper accounting of > whether the array is considered frozen or not. > But the problem is that even with this fix, both threads will think > that they "own" the array. And both will do things, like writing data > or so, which might interfere. The proper fix would ensure that only > one thread "owns" the array, and the second thread waits until the > first calls unfreeze_array(), and then the second thread becomes > "owner" of the array. And of course while there are threads that want > to "freeze" the array, other threads that want to do raise_barrier > etc, should wait. Is that really a problem? A call to "freeze_array()" doesn't mean "I want to own the array", but rather "No regular IO should be happening now". Most callers of freeze_array(): raid1_add_disk(), raid1_remove_disk(), stop(), raid1_reshape() are called with the "mddev_lock()" mutex held, so they cannot interfere with each other. handle_read_error() is called with one pending request, which will block any call on freeze_array(mddev, 0); - handle_read_error() itself calls freeze_array(mddev, 1) so it gets access. So these are already locked against the first four (as lock that ->array_frozen doesn't get corrupted). That just leaves raid1_quiesce(). That is mostly called under mddev_lock() so it won't interfere with the others. The one exception is where md_do_sync() calls mddev->pers->quiesce(mddev, 1); mddev->pers->quiesce(mddev, 0); As this doesn't "claim" the array, but just needs to ensure all pending IO completes, I don't think there is a problem. So it seems to me that your concerns are not actually a problem. Did I miss something? > > I am sorry for giving two negative responses in one email:) Better a negative response than no response :-) NeilBrown
Attachment:
signature.asc
Description: PGP signature