Re: [BUG] MD/RAID1 hung forever on freeze_array

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks Neil,

On Tue, Dec 20, 2016 at 10:23 PM, NeilBrown <neilb@xxxxxxxx> wrote:
> On Tue, Dec 20 2016, Jinpu Wang wrote:
>
>> Hi Neil,
>>> This is the problem.  'hold' hasn't been initialised.
>>> We could either do:
>>>   bio_list_init(&hold);
>>>   bio_list_merge(&hold, &bio_list_on_stack);
>> I tried above variant first and it lead to panic in endio path:
>>
> ...
>>
>>     [exception RIP: bio_check_pages_dirty+65]
>
>
> I can't explain that at all.  Maybe if I saw the complete patch I would
> be able to see what went wrong (or maybe there is a separate cause).
I think I know the cause now, I did
+bio_list_init(&hold);
bio_list_merge(&hold, &bio_list_on_stack);
- bio_list_init(&bio_list_on_stack)
I didn't read your comment clear enough, just thought we maybe don't
need to reinit &bio_list_on_stack twice.
But your idea is same the copy before we can q->make_request_fn.

>
>>>
>>> You didn't find 'hold' to be necessary in your testing, but I think that
>>> is more complex arrangements it could make an important difference.
>>
>> Could you elaborate a bit more, from my understanding, in later code,
>> we pop all bio from bio_list_on_stack,
>> add it to either "lower" or "same" bio_list, so merge both will have
>> the whole list again, right?
>
> If there are several bios on bio_list_on_stack, and if processing the
> first one causes several calls to generic_make_request(), then we want
> all the bios passed to generic_make_request() to be handled *before* the
> remaining bios that were originally on bio_list_on_stack.  Of these new
> bios, we want to handle those for a lower level device first, and those
> for the same device after that.  Only when all of those have been
> completely submitted (together with any subordinate bios submitted to
> generic_make_request()) do we move on to the rest of bio_list_on_stack
> (which are at the same level as the first bio, or higher in the device
> stack).
>
> The usage of 'hold' would become important when you have multiple
> levels.  e.g. dm on md on scsi.  If the first bio send to dm is split,
> then the first half needs to be processed all the way down to all the
> scsi devices, before the second half of the split is handled.
>
> Hope that helps,
> NeilBrown
>
Thanks, it does help a lot, I attached the patch I'm still testing,
but so far so good.
Could you check if I got it right?

Cheers,
-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@xxxxxxxxxxxxxxxx
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss
From 9500d5e9a7ccd1f4397f927ffdc8d378955d477c Mon Sep 17 00:00:00 2001
From: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx>
Date: Wed, 14 Dec 2016 16:55:52 +0100
Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier()

When we call wait_barrier, we might have some bios waiting
in current->bio_list, which prevents the array_freeze call to
complete. Those can only be internal READs, which have already
passed the wait_barrier call (thus incrementing nr_pending), but
still were not submitted to the lower level, due to generic_make_request
logic to avoid recursive calls. In such case, we have a deadlock:
- array_frozen is already set to 1, so wait_barrier unconditionally waits, so
- internal READ bios will not be submitted, thus freeze_array will
never completes.

To fix this, modify generic_make_request to always sort bio_list_on_stack
first with lowest level, then higher, until same level.

Suggested-by: Neil Brown <neil.brown.suse.com>
Signed-off-by: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx>
---
 block/blk-core.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index e8d15d8..5a6390a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2064,10 +2064,30 @@ blk_qc_t generic_make_request(struct bio *bio)
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
+			struct bio_list lower, same, hold;
+
+			/* Create a fresh bio_list for all subordinate requests */
+			bio_list_init(&hold);
+			bio_list_merge(&hold, &bio_list_on_stack);
+			bio_list_init(&bio_list_on_stack);
 
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
+			/* sort new bios into those for a lower level
+			 * and those for the same level
+			 */
+			bio_list_init(&lower);
+			bio_list_init(&same);
+			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+				if (q == bdev_get_queue(bio->bi_bdev))
+					bio_list_add(&same, bio);
+				else
+					bio_list_add(&lower, bio);
+			/* now assemble so we handle the lowest level first */
+			bio_list_merge(&bio_list_on_stack, &lower);
+			bio_list_merge(&bio_list_on_stack, &same);
+			bio_list_merge(&bio_list_on_stack, &hold);
 
 			bio = bio_list_pop(current->bio_list);
 		} else {
-- 
2.7.4


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux