2017-01-04 19:50 GMT+01:00 Mike Snitzer <snitzer@xxxxxxxxxx>: > On Wed, Jan 04 2017 at 12:12am -0500, > NeilBrown <neilb@xxxxxxxx> wrote: > >> On Tue, Jan 03 2017, Jack Wang wrote: >> >> > 2016-12-23 12:45 GMT+01:00 Lars Ellenberg <lars.ellenberg@xxxxxxxxxx>: >> >> On Fri, Dec 23, 2016 at 09:49:53AM +0100, Michael Wang wrote: >> >>> Dear Maintainers >> >>> >> >>> I'd like to ask for the status of this patch since we hit the >> >>> issue too during our testing on md raid1. >> >>> >> >>> Split remainder bio_A was queued ahead, following by bio_B for >> >>> lower device, at this moment raid start freezing, the loop take >> >>> out bio_A firstly and deliver it, which will hung since raid is >> >>> freezing, while the freezing never end since it waiting for >> >>> bio_B to finish, and bio_B is still on the queue, waiting for >> >>> bio_A to finish... >> >>> >> >>> We're looking for a good solution and we found this patch >> >>> already progressed a lot, but we can't find it on linux-next, >> >>> so we'd like to ask are we still planning to have this fix >> >>> in upstream? >> >> >> >> I don't see why not, I'd even like to have it in older kernels, >> >> but did not have the time and energy to push it. >> >> >> >> Thanks for the bump. >> >> >> >> Lars >> >> >> > Hi folks, >> > >> > As Michael mentioned, we hit a bug this patch is trying to fix. >> > Neil suggested another way to fix it. I attached below. >> > I personal prefer Neil's version as it's less code change, and straight forward. >> > >> > Could you share your comments, we can get one fix into mainline. >> > >> > Thanks, >> > Jinpu >> > From 69a4829a55503e496ce9c730d2c8e3dd8a08874a Mon Sep 17 00:00:00 2001 >> > From: NeilBrown <neilb@xxxxxxxx> >> > Date: Wed, 14 Dec 2016 16:55:52 +0100 >> > Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier() >> > >> > When we call wait_barrier, we might have some bios waiting >> > in current->bio_list, which prevents the array_freeze call to >> > complete. Those can only be internal READs, which have already >> > passed the wait_barrier call (thus incrementing nr_pending), but >> > still were not submitted to the lower level, due to generic_make_request >> > logic to avoid recursive calls. In such case, we have a deadlock: >> > - array_frozen is already set to 1, so wait_barrier unconditionally waits, so >> > - internal READ bios will not be submitted, thus freeze_array will >> > never completes. >> > >> > To fix this, modify generic_make_request to always sort bio_list_on_stack >> > first with lowest level, then higher, until same level. >> > >> > Sent to linux-raid mail list: >> > https://marc.info/?l=linux-raid&m=148232453107685&w=2 >> > >> >> This should probably also have >> >> Inspired-by: Lars Ellenberg <lars.ellenberg@xxxxxxxxxx> >> >> or something that, as I was building on Lars' ideas when I wrote this. >> >> It would also be worth noting in the description that this addresses >> issues with dm and drbd as well as md. > > I never saw this patch but certainly like the relative simplicity of the > solution when compared with other approaches taken, e.g. (5 topmost > commits on this branch): > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip > >> In fact, I think that with this patch in place, much of the need for the >> rescue_workqueue won't exist any more. I cannot promise it can be >> removed completely, but it should be to hard to make it optional and >> only enabled for those few block devices that will still need it. >> The rescuer should only be needed for a bioset which can be allocated >> From twice in the one call the ->make_request_fn. This would include >> raid0 for example, though raid0_make_reqest could be re-written to not >> use a loop and to just call generic_make_request(bio) if bio != split. > > Mikulas, would you be willing to try the below patch with the > dm-snapshot deadlock scenario and report back on whether it fixes that? > > Patch below looks to be the same as here: > https://marc.info/?l=linux-raid&m=148232453107685&q=p3 > > Neil and/or others if that isn't the patch that should be tested please > provide a pointer to the latest. > > Thanks, > Mike Thanks Mike, I've rebased the patch on to Linux-4.10-rc2, and updated the description as Neil suggested. If Mikulas get possitive feedback, then we can go with it. Cheers, Jinpu
From 4ffaefb719c129ed51f9fcb235b945caf56de8d1 Mon Sep 17 00:00:00 2001 From: NeilBrown <neilb@xxxxxxxx> Date: Wed, 14 Dec 2016 16:55:52 +0100 Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier() When we call wait_barrier, we might have some bios waiting in current->bio_list, which prevents the array_freeze call to complete. Those can only be internal READs, which have already passed the wait_barrier call (thus incrementing nr_pending), but still were not submitted to the lower level, due to generic_make_request logic to avoid recursive calls. In such case, we have a deadlock: - array_frozen is already set to 1, so wait_barrier unconditionally waits, so - internal READ bios will not be submitted, thus freeze_array will never completes. To fix this, modify generic_make_request to always sort bio_list_on_stack first with lowest level, then higher, until same level. This would address issuses with dm and drbd as well as md. Sent to linux-raid mail list: https://marc.info/?l=linux-raid&m=148232453107685&w=2 Inspired-by: Lars Ellenberg <lars.ellenberg@xxxxxxxxxx> Suggested-by: NeilBrown <neilb@xxxxxxxx> Signed-off-by: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx> --- block/blk-core.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/block/blk-core.c b/block/blk-core.c index 61ba08c..2f74129 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2019,9 +2019,30 @@ blk_qc_t generic_make_request(struct bio *bio) struct request_queue *q = bdev_get_queue(bio->bi_bdev); if (likely(blk_queue_enter(q, false) == 0)) { + struct bio_list lower, same, hold; + + /* Create a fresh bio_list for all subordinate requests */ + bio_list_init(&hold); + bio_list_merge(&hold, &bio_list_on_stack); + bio_list_init(&bio_list_on_stack); + ret = q->make_request_fn(q, bio); blk_queue_exit(q); + /* sort new bios into those for a lower level + * and those for the same level + */ + bio_list_init(&lower); + bio_list_init(&same); + while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL) + if (q == bdev_get_queue(bio->bi_bdev)) + bio_list_add(&same, bio); + else + bio_list_add(&lower, bio); + /* now assemble so we handle the lowest level first */ + bio_list_merge(&bio_list_on_stack, &lower); + bio_list_merge(&bio_list_on_stack, &same); + bio_list_merge(&bio_list_on_stack, &hold); bio = bio_list_pop(current->bio_list); } else { -- 2.7.4