Hi Neil, I see now that the deadlock problem has been addressed by your commits: 79bd99596b7305ab08109a8bf44a6a4511dbf1cd blk: improve order of bio handling in generic_make_request() f5fe1b51905df7cfe4fdfd85c5fb7bc5b71a094f blk: Ensure users for current->bio_list can see the full list. Thanks for fixing this! Alex. On Wed, Jul 20, 2016 at 1:19 AM NeilBrown <neilb@xxxxxxxx> wrote: > > On Tue, Jul 19 2016, Alexander Lyakas wrote: > > > Hello Neil, > > Thank you for your response. > > > > On Fri, Jul 15, 2016 at 2:18 AM, NeilBrown <neilb@xxxxxxxx> wrote: > >> On Thu, Jul 14 2016, Alexander Lyakas wrote: > >> > >>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c > >>>> index 32e282f4c83c..c528102b80b6 100644 > >>>> --- a/drivers/md/raid10.c > >>>> +++ b/drivers/md/raid10.c > >>>> @@ -1288,7 +1288,7 @@ read_again: > >>>> sectors_handled; > >>>> goto read_again; > >>>> } else > >>>> - generic_make_request(read_bio); > >>>> + bio_list_add_head(¤t->bio_list, read_bio); > >>>> return; > >>>> } > >>> > >>> Unfortunately, this patch doesn't work. It is super elegant, and seems > >>> like it really should work. But the problem is that the "rdev", to > >>> which we want to send the READ bio, might also be a remapping device > >>> (dm-linear, for example). This device will create its own remapped-bio > >>> and will call generic_make_request(), which will stick the bio onto > >>> current->bio_list TAIL:(:(:( So we are back at square one. This patch > >>> would work if *all* the remapping drivers in the stack were doing > >>> bio_list_add_head() instead of generic_make_request() :(:(:( > >>> > >>> It seems the real fix should be that generic_make_request() would use > >>> bio_list_add_head(), but as pointed in > >>> http://www.spinics.net/lists/raid/msg52756.html, there are some > >>> concerns about changing the order of remapped bios. > >>> > >> > >> While those concerns are valid, they are about hypothetical performance > >> issues rather than observed deadlock issues. So I wouldn't be too > >> worried about them. > > I am thinking of a hypothetical driver that splits say a 12Kb WRITE > > into 3x4kb WRITEs, and submits them in a proper order, hoping they > > will get to the disk in the same order, and the disk will work > > sequentially. But now we are deliberately hindering this. But I see > > that people much smarter than me are in this discussion, so I will > > leave it to them:) > > Sure, that is the concern and ideally we would keep things in order. > But the elevator should re-order things in most cases so it shouldn't > matter too much. > For upstream, we obviously aim for best possible. For stable backports, > we sometimes accept non-ideal code when the change is less intrusive. > > If we do backport something to -stable, I would do it "properly" using > something very similar to the upstream version. You don't seem to want > that for your code so I'm suggesting options that, while not 100% ideal, > should suit your needs - and obviously you will test your use cases. > > > > >> However I think you said that you didn't want to touch core code at all > >> (maybe I misunderstood) so that wouldn't help you anyway. > > Yes, this is correct. Recompiling the kernel is a bit of a pain for > > us. We were smart enough to configure the md_mod as loadable module, > > so at least now I can patch MD code easily:) > > > >> > >> One option would be to punt the request requests to raidXd: > >> > >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > >> index 40b35be34f8d..f795e27b2124 100644 > >> --- a/drivers/md/raid1.c > >> +++ b/drivers/md/raid1.c > >> @@ -1229,7 +1229,7 @@ read_again: > >> sectors_handled; > >> goto read_again; > >> } else > >> - generic_make_request(read_bio); > >> + reschedule_retry(r1_bio); > >> return; > >> } > >> > >> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c > >> index 32e282f4c83c..eec38443075b 100644 > >> --- a/drivers/md/raid10.c > >> +++ b/drivers/md/raid10.c > >> @@ -1288,7 +1288,7 @@ read_again: > >> sectors_handled; > >> goto read_again; > >> } else > >> - generic_make_request(read_bio); > >> + reschedule_retry(r10_bio); > >> return; > >> } > > This is more or less what my rudimentary patch is doing, except it is > > doing it only when we really need to wait for the barrier. > > > >> > >> > >> That might hurt performance, you would need to measure. > >> The other approach would be to revert the patch that caused the problem. > >> e.g. > >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > >> index 40b35be34f8d..062bb86e5fd8 100644 > >> --- a/drivers/md/raid1.c > >> +++ b/drivers/md/raid1.c > >> @@ -884,7 +884,8 @@ static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio) > >> wait = false; > >> else > >> wait = true; > >> - } > >> + } else if (conf->barrier) > >> + wait = true; > >> > >> return wait; > >> } > >> > >> > > I am not sure how this patch helps. You added another condition, and > > now READs will also wait in some cases. But still if array_frozen is > > set, everybody will wait unconditionally, which is the root cause for > > the deadlock I think. > > Maybe you're right. I was thinking that array_frozen only causes > problems if there are read requests in the generic_make_request queue, > and this change would keep them out. But I might have dropped a ball > somewhere. > > > > > > I see that there will be no magic solution for this problem:( > > > > Not really. > > > NeilBrown