Thanks Neil, On Tue, Dec 20, 2016 at 10:23 PM, NeilBrown <neilb@xxxxxxxx> wrote: > On Tue, Dec 20 2016, Jinpu Wang wrote: > >> Hi Neil, >>> This is the problem. 'hold' hasn't been initialised. >>> We could either do: >>> bio_list_init(&hold); >>> bio_list_merge(&hold, &bio_list_on_stack); >> I tried above variant first and it lead to panic in endio path: >> > ... >> >> [exception RIP: bio_check_pages_dirty+65] > > > I can't explain that at all. Maybe if I saw the complete patch I would > be able to see what went wrong (or maybe there is a separate cause). I think I know the cause now, I did +bio_list_init(&hold); bio_list_merge(&hold, &bio_list_on_stack); - bio_list_init(&bio_list_on_stack) I didn't read your comment clear enough, just thought we maybe don't need to reinit &bio_list_on_stack twice. But your idea is same the copy before we can q->make_request_fn. > >>> >>> You didn't find 'hold' to be necessary in your testing, but I think that >>> is more complex arrangements it could make an important difference. >> >> Could you elaborate a bit more, from my understanding, in later code, >> we pop all bio from bio_list_on_stack, >> add it to either "lower" or "same" bio_list, so merge both will have >> the whole list again, right? > > If there are several bios on bio_list_on_stack, and if processing the > first one causes several calls to generic_make_request(), then we want > all the bios passed to generic_make_request() to be handled *before* the > remaining bios that were originally on bio_list_on_stack. Of these new > bios, we want to handle those for a lower level device first, and those > for the same device after that. Only when all of those have been > completely submitted (together with any subordinate bios submitted to > generic_make_request()) do we move on to the rest of bio_list_on_stack > (which are at the same level as the first bio, or higher in the device > stack). > > The usage of 'hold' would become important when you have multiple > levels. e.g. dm on md on scsi. If the first bio send to dm is split, > then the first half needs to be processed all the way down to all the > scsi devices, before the second half of the split is handled. > > Hope that helps, > NeilBrown > Thanks, it does help a lot, I attached the patch I'm still testing, but so far so good. Could you check if I got it right? Cheers, -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@xxxxxxxxxxxxxxxx URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss
From 9500d5e9a7ccd1f4397f927ffdc8d378955d477c Mon Sep 17 00:00:00 2001 From: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx> Date: Wed, 14 Dec 2016 16:55:52 +0100 Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier() When we call wait_barrier, we might have some bios waiting in current->bio_list, which prevents the array_freeze call to complete. Those can only be internal READs, which have already passed the wait_barrier call (thus incrementing nr_pending), but still were not submitted to the lower level, due to generic_make_request logic to avoid recursive calls. In such case, we have a deadlock: - array_frozen is already set to 1, so wait_barrier unconditionally waits, so - internal READ bios will not be submitted, thus freeze_array will never completes. To fix this, modify generic_make_request to always sort bio_list_on_stack first with lowest level, then higher, until same level. Suggested-by: Neil Brown <neil.brown.suse.com> Signed-off-by: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx> --- block/blk-core.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/block/blk-core.c b/block/blk-core.c index e8d15d8..5a6390a 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2064,10 +2064,30 @@ blk_qc_t generic_make_request(struct bio *bio) struct request_queue *q = bdev_get_queue(bio->bi_bdev); if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) { + struct bio_list lower, same, hold; + + /* Create a fresh bio_list for all subordinate requests */ + bio_list_init(&hold); + bio_list_merge(&hold, &bio_list_on_stack); + bio_list_init(&bio_list_on_stack); ret = q->make_request_fn(q, bio); blk_queue_exit(q); + /* sort new bios into those for a lower level + * and those for the same level + */ + bio_list_init(&lower); + bio_list_init(&same); + while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL) + if (q == bdev_get_queue(bio->bi_bdev)) + bio_list_add(&same, bio); + else + bio_list_add(&lower, bio); + /* now assemble so we handle the lowest level first */ + bio_list_merge(&bio_list_on_stack, &lower); + bio_list_merge(&bio_list_on_stack, &same); + bio_list_merge(&bio_list_on_stack, &hold); bio = bio_list_pop(current->bio_list); } else { -- 2.7.4