Hello Neil, Thank you for your response. On Fri, Jul 15, 2016 at 2:18 AM, NeilBrown <neilb@xxxxxxxx> wrote: > On Thu, Jul 14 2016, Alexander Lyakas wrote: > >>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c >>> index 32e282f4c83c..c528102b80b6 100644 >>> --- a/drivers/md/raid10.c >>> +++ b/drivers/md/raid10.c >>> @@ -1288,7 +1288,7 @@ read_again: >>> sectors_handled; >>> goto read_again; >>> } else >>> - generic_make_request(read_bio); >>> + bio_list_add_head(¤t->bio_list, read_bio); >>> return; >>> } >> >> Unfortunately, this patch doesn't work. It is super elegant, and seems >> like it really should work. But the problem is that the "rdev", to >> which we want to send the READ bio, might also be a remapping device >> (dm-linear, for example). This device will create its own remapped-bio >> and will call generic_make_request(), which will stick the bio onto >> current->bio_list TAIL:(:(:( So we are back at square one. This patch >> would work if *all* the remapping drivers in the stack were doing >> bio_list_add_head() instead of generic_make_request() :(:(:( >> >> It seems the real fix should be that generic_make_request() would use >> bio_list_add_head(), but as pointed in >> http://www.spinics.net/lists/raid/msg52756.html, there are some >> concerns about changing the order of remapped bios. >> > > While those concerns are valid, they are about hypothetical performance > issues rather than observed deadlock issues. So I wouldn't be too > worried about them. I am thinking of a hypothetical driver that splits say a 12Kb WRITE into 3x4kb WRITEs, and submits them in a proper order, hoping they will get to the disk in the same order, and the disk will work sequentially. But now we are deliberately hindering this. But I see that people much smarter than me are in this discussion, so I will leave it to them:) > However I think you said that you didn't want to touch core code at all > (maybe I misunderstood) so that wouldn't help you anyway. Yes, this is correct. Recompiling the kernel is a bit of a pain for us. We were smart enough to configure the md_mod as loadable module, so at least now I can patch MD code easily:) > > One option would be to punt the request requests to raidXd: > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 40b35be34f8d..f795e27b2124 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -1229,7 +1229,7 @@ read_again: > sectors_handled; > goto read_again; > } else > - generic_make_request(read_bio); > + reschedule_retry(r1_bio); > return; > } > > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c > index 32e282f4c83c..eec38443075b 100644 > --- a/drivers/md/raid10.c > +++ b/drivers/md/raid10.c > @@ -1288,7 +1288,7 @@ read_again: > sectors_handled; > goto read_again; > } else > - generic_make_request(read_bio); > + reschedule_retry(r10_bio); > return; > } This is more or less what my rudimentary patch is doing, except it is doing it only when we really need to wait for the barrier. > > > That might hurt performance, you would need to measure. > The other approach would be to revert the patch that caused the problem. > e.g. > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 40b35be34f8d..062bb86e5fd8 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -884,7 +884,8 @@ static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio) > wait = false; > else > wait = true; > - } > + } else if (conf->barrier) > + wait = true; > > return wait; > } > > I am not sure how this patch helps. You added another condition, and now READs will also wait in some cases. But still if array_frozen is set, everybody will wait unconditionally, which is the root cause for the deadlock I think. I see that there will be no magic solution for this problem:( > >> >>> >>> >>>> >>>> Since this issue is a real deadlock we are hitting in a long-term 3.18 >>>> kernel, is there any chance for cc-stable fix? Currently we applied >>>> the rudimentary fix I posted. It basically switches context for >>>> problematic RAID1 READs, and runs them from a different context. With >>>> this fix we don't see the deadlock anymore. >>>> >>>> Also, can you please comment on another concern I expressed: >>>> freeze_array() is now not reentrant. Meaning that if two threads call >>>> it in parallel (and it could happen for the same MD), the first thread >>>> calling unfreeze_array will mess up things for the second thread. >>> >>> Yes, that is a regression. This should be enough to fix it. Do you >>> agree? >>> >>> Thanks, >>> NeilBrown >>> >>> >>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >>> index 40b35be34f8d..5ad25c7d7453 100644 >>> --- a/drivers/md/raid1.c >>> +++ b/drivers/md/raid1.c >>> @@ -984,7 +984,7 @@ static void freeze_array(struct r1conf *conf, int extra) >>> * we continue. >>> */ >>> spin_lock_irq(&conf->resync_lock); >>> - conf->array_frozen = 1; >>> + conf->array_frozen += 1; >>> wait_event_lock_irq_cmd(conf->wait_barrier, >>> conf->nr_pending == conf->nr_queued+extra, >>> conf->resync_lock, >>> @@ -995,7 +995,7 @@ static void unfreeze_array(struct r1conf *conf) >>> { >>> /* reverse the effect of the freeze */ >>> spin_lock_irq(&conf->resync_lock); >>> - conf->array_frozen = 0; >>> + conf->array_frozen -= 1; >>> wake_up(&conf->wait_barrier); >>> spin_unlock_irq(&conf->resync_lock); >>> } >> I partially agree. The fix that you suggest makes proper accounting of >> whether the array is considered frozen or not. >> But the problem is that even with this fix, both threads will think >> that they "own" the array. And both will do things, like writing data >> or so, which might interfere. The proper fix would ensure that only >> one thread "owns" the array, and the second thread waits until the >> first calls unfreeze_array(), and then the second thread becomes >> "owner" of the array. And of course while there are threads that want >> to "freeze" the array, other threads that want to do raise_barrier >> etc, should wait. > > Is that really a problem? > A call to "freeze_array()" doesn't mean "I want to own the array", but > rather "No regular IO should be happening now". > > Most callers of freeze_array(): > raid1_add_disk(), raid1_remove_disk(), stop(), raid1_reshape() > are called with the "mddev_lock()" mutex held, so they cannot interfere > with each other. > > handle_read_error() is called with one pending request, which will block > any call on freeze_array(mddev, 0); - handle_read_error() itself calls > freeze_array(mddev, 1) so it gets access. > So these are already locked against the first four (as lock that > ->array_frozen doesn't get corrupted). > > That just leaves raid1_quiesce(). > That is mostly called under mddev_lock() so it won't interfere with the > others. > The one exception is where md_do_sync() calls > mddev->pers->quiesce(mddev, 1); > mddev->pers->quiesce(mddev, 0); > > As this doesn't "claim" the array, but just needs to ensure all pending > IO completes, I don't think there is a problem. > > So it seems to me that your concerns are not actually a problem. > Did I miss something? I think you are correct. The only exception is mddev_suspend/resume, which can be called from another module. But for my case, this is not happening. Thanks, Alex. > >> >> I am sorry for giving two negative responses in one email:) > > Better a negative response than no response :-) > > NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html