Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window

Coly Li <colyli@xxxxxxx> · Sat, 25 Feb 2017 01:02:33 +0800

On 2017/2/24 上午3:58, Shaohua Li wrote:
> On Fri, Feb 24, 2017 at 03:31:16AM +0800, Coly Li wrote:
>> On 2017/2/24 上午1:34, Shaohua Li wrote:
>>> On Thu, Feb 23, 2017 at 01:54:47PM +0800, Coly Li wrote:
>> [snip]
>>>>>>>> As r1bio_pool preallocates 256 entries, this is unlikely  but not 
>>>>>>>> impossible.  If 256 threads all attempt a write (or read) that
>>>>>>>> crosses a boundary, then they will consume all 256 preallocated
>>>>>>>> entries, and want more. If there is no free memory, they will block
>>>>>>>> indefinitely.
>>>>>>>>
>>>>>>>
>>>>>>> If raid1_make_request() is modified into this way,
>>>>>>> +	if (bio_data_dir(split) == READ)
>>>>>>> +		raid1_read_request(mddev, split);
>>>>>>> +	else
>>>>>>> +		raid1_write_request(mddev, split);
>>>>>>> +	if (split != bio)
>>>>>>> +		generic_make_request(bio);
>>>>>>>
>>>>>>> Then the original bio will be added into the bio_list_on_stack of top
>>>>>>> level generic_make_request(), current->bio_list is initialized, when
>>>>>>> generic_make_request() is called nested in raid1_make_request(), the
>>>>>>> split bio will be added into current->bio_list and nothing else happens.
>>>>>>>
>>>>>>> After the nested generic_make_request() returns, the code back to next
>>>>>>> code of generic_make_request(),
>>>>>>> 2022                         ret = q->make_request_fn(q, bio);
>>>>>>> 2023
>>>>>>> 2024                         blk_queue_exit(q);
>>>>>>> 2025
>>>>>>> 2026                         bio = bio_list_pop(current->bio_list);
>>>>>>>
>>>>>>> bio_list_pop() will return the second half of the split bio, and it is
>>>>>>
>>>>>> So in above sequence, the curent->bio_list will has bios in below sequence:
>>>>>> bios to underlaying disks, second half of original bio
>>>>>>
>>>>>> bio_list_pop will pop bios to underlaying disks first, handle them, then the
>>>>>> second half of original bio.
>>>>>>
>>>>>> That said, this doesn't work for array stacked 3 layers. Because in 3-layer
>>>>>> array, handling the middle layer bio will make the 3rd layer bio hold to
>>>>>> bio_list again.
>>>>>>
>>>>>
>>>>> Could you please give me more hint,
>>>>> - What is the meaning of "hold" from " make the 3rd layer bio hold to
>>>>> bio_list again" ?
>>>>> - Why deadlock happens if the 3rd layer bio hold to bio_list again ?
>>>>
>>>> I tried to set up a 4 layer stacked md raid1, and reduce I/O barrier
>>>> bucket size to 8MB, running for 10 hours, there is no deadlock observed,
>>>>
>>>> Here is how the 4 layer stacked raid1 setup,
>>>> - There are 4 NVMe SSDs, on each SSD I create four 500GB partition,
>>>>   /dev/nvme0n1:  nvme0n1p1, nvme0n1p2, nvme0n1p3, nvme0n1p4
>>>>   /dev/nvme1n1:  nvme1n1p1, nvme1n1p2, nvme1n1p3, nvme1n1p4
>>>>   /dev/nvme2n1:  nvme2n1p1, nvme2n1p2, nvme2n1p3, nvme2n1p4
>>>>   /dev/nvme3n1:  nvme3n1p1, nvme3n1p2, nvme3n1p3, nvme3n1p4
>>>> - Here is how the 4 layer stacked raid1 assembled, level 1 means the top
>>>> level, level 4 means the bottom level in the stacked devices,
>>>>   - level 1:
>>>> 	/dev/md40: /dev/md30  /dev/md31
>>>>   - level 2:
>>>> 	/dev/md30: /dev/md20  /dev/md21
>>>> 	/dev/md31: /dev/md22  /dev/md23
>>>>   - level 3:
>>>> 	/dev/md20: /dev/md10  /dev/md11
>>>> 	/dev/md21: /dev/md12  /dev/md13
>>>> 	/dev/md22: /dev/md14  /dev/md15
>>>> 	/dev/md23: /dev/md16  /dev/md17
>>>>   - level 4:
>>>> 	/dev/md10: /dev/nvme0n1p1  /dev/nvme1n1p1
>>>> 	/dev/md11: /dev/nvme2n1p1  /dev/nvme3n1p1
>>>> 	/dev/md12: /dev/nvme0n1p2  /dev/nvme1n1p2
>>>> 	/dev/md13: /dev/nvme2n1p2  /dev/nvme3n1p2
>>>> 	/dev/md14: /dev/nvme0n1p3  /dev/nvme1n1p3
>>>> 	/dev/md15: /dev/nvme2n1p3  /dev/nvme3n1p3
>>>> 	/dev/md16: /dev/nvme0n1p4  /dev/nvme1n1p4
>>>> 	/dev/md17: /dev/nvme2n1p4  /dev/nvme3n1p4
>>>>
>>>> Here is the fio job file,
>>>> [global]
>>>> direct=1
>>>> thread=1
>>>> ioengine=libaio
>>>>
>>>> [job]
>>>> filename=/dev/md40
>>>> readwrite=write
>>>> numjobs=10
>>>> blocksize=33M
>>>> iodepth=128
>>>> time_based=1
>>>> runtime=10h
>>>>
>>>> I planed to learn how the deadlock comes by analyze a deadlock
>>>> condition. Maybe it was because 8MB bucket unit size is small enough,
>>>> now I try to run with 512K bucket unit size, and see whether I can
>>>> encounter a deadlock.
>>>
>>> Don't think raid1 could easily trigger the deadlock. Maybe you should try
>>> raid10. The resync case is hard to trigger for raid1. The memory pressure case
>>> is hard to trigger for both raid1/10. But it's possible to trigger.
>>>
>>> The 3-layer case is something like this:
>>
>> Hi Shaohua,
>>
>> I try to catch up with you, let me try to follow your mind by the
>> split-in-while-loop condition (this is my new I/O barrier patch). I
>> assume the original BIO is a write bio, and original bio is split and
>> handled in a while loop in raid1_make_request().
>>
>>> 1. in level1, set current->bio_list, split bio to bio1 and bio2
>>
>> This is done in level1 raid1_make_request().
>>
>>> 2. remap bio1 to level2 disk, and queue bio1-level2 in current->bio_list
>>
>> Remap is done by raid1_write_request(), and bio1_level may be added into
>> one of the two list:
>> - plug->pending:
>>   bios in plug->pending may be handled in raid1_unplug(), or in
>> flush_pending_writes() of raid1d().
>>   If current task is about to be scheduled, raid1_unplug() will merge
>> plug->pending's bios to conf->pending_bio_list. And
>> conf->pending_bio_list will be handled in raid1d.
>>   If raid1_unplug() is triggered by blk_finish_plug(), it is also
>> handled in raid1d.
>>
>> - conf->pending_bio_list:
>>   bios in this list is handled in raid1d by calling flush_pending_writes().
>>
>>
>> So generic_make_request() to handle bio1_level2 can only be called in
>> context of raid1d thread, bio1_level2 is added into raid1d's
>> bio_list_on_stack, not caller of level1 generic_make_request().
>>
>>> 3. queue bio2 in current->bio_list
>>
>> Same, bio2_level2 is in level1 raid1d's bio_list_on_stack.
>> Then back to level1 generic_make_request()
>>
>>> 4. generic_make_request then pops bio1-level2
>>
>> At this moment, bio1_level2 and bio2_level2 are in either plug->pending
>> or conf->pending_bio_list, bio_list_pop() returns NULL, and level1
>> generic_make_request() returns to its caller.
>>
>> If before bio_list_pop() called, kernel thread raid1d wakes up and
>> iterates conf->pending_bio_list in flush_pending_writes() or iterate
>> plug->pending in raid1_unplug() by blk_finish_plug(), that happens in
>> level1 raid1d's stack, bios will not show up in level1
>> generic_make_reques(), bio_list_pop() still returns NULL.
>>
>>> 5. remap bio1-level2 to level3 disk, and queue bio1-level2-level3 in current->bio_list
>>
>> bio2_level2 is at head of conf->pending_bio_list or plug->pending, so
>> bio2_level2 is handled firstly.
>>
>> level1 raid1 calls level2 generic_make_request(), then level2
>> raid1_make_request() is called, then level raid1_write_request().
>> bio2_level2 is remapped to bio2_level3, added into plug->pending (level1
>> raid1d's context) or conf->pending_bio_list (level2 raid1's conf), it
>> will be handled by level2 raid1d, when level2 raid1d wakes up.
>> Then returns back to level1 raid1, bio1_level2
>> is handled by level2 generic_make_request() and added into level2
>> plug->pending or conf->pending_bio_list. In this case neither
>> bio2_level2 nor bio1_level is added into any bio_list_on_stack.
>>
>> Then level1 raid1d handles all bios in level1 conf->pending_bio_list,
>> and sleeps.
>>
>> Then level2 raid1d wakes up, and handle bio2_level3 and bio1_level3, by
>> iterate level2 plug->pending or conf->pending_bio_list, and calling
>> level3 generic_make_request().
>>
>> In level3 generic_make_request(), because it is level2 raid1d context,
>> not level1 raid1d context, bio2_level3 is send into
>> q->make_request_fn(), and finally added into level3 plug->pending or
>> conf->pending_bio_list, then back to level3 generic_make_reqeust().
>>
>> Now level2 raid1d's current->bio_list is empty, so level3
>> generic_make_request() returns to level2 raid1d and continue to iterate
>> and send bio1_level3 into level3 generic_make_request().
>>
>> After all bios are added into level3 plug->pending or
>> conf->pending_bio_list, level2 raid1d sleeps.
>>
>> Now level3 raid1d wakes up, continue to iterate level3 plug->pending or
>> conf->pending_bio_list by calling generic_make_request() to underlying
>> devices (which might be a read device).
>>
>> On the above whole patch, each lower level generic_make_request() is
>> called in context of the lower level raid1d. No recursive call happens
>> in normal code path.
>>
>> In raid1 code, recursive call of generic_make_request() only happens for
>> READ bio, but if array is not frozen, no barrier is required, it doesn't
>> hurt.
>>
>>
>>> 6. generic_make_request then pops bio2, but bio1 hasn't finished yet, deadlock
>>
>> As my understand to the code, it won't happen neither.
>>
>>>
>>> The problem is because we add new bio to current->bio_list tail.
>>
>> New bios are added into other context's current->bio_list, which are
>> different lists. If what I understand is correct, a dead lock won't
>> happen in this way.
>>
>> If my understanding is correct, suddenly I come to realize why raid1
>> bios are handled indirectly in another kernel thread.
>>
>> (Just for your information, when I write to this location, another run
>> of testing finished, no deadlock. This time I reduce I/O barrier bucket
>> unit size to 512KB, and set blocksize to 33MB in fio job file. It is
>> really slow (130MB/s), but no deadlock observed)
>>
>>
>> The stacked raid1 devices are really really confused, if I am wrong, any
>> hint is warmly welcome.
> 
> Aha, you are correct. I missed we never directly dispatch bio in a schedule based
> blk-plug flush. I'll drop the patch. Thanks for the insistence, good discussion!

Thank you for the encouragement :-)
After carefully think your patch again, I suggest to let it go ahead and
keep your fix in -next. The reasons are,
1) For 32bit bi_iter.bi_size, it is safe and no hash conflict. But if
someone (maybe I) changes bi_iter.bi_size from 32bit to 64bit, for a
DISCARD bio, it is very easy to be split into more then 512 pieces, then
there will be a lot of hash conflit. If there are resync triggered, it
will every easy go into a dead lock.
  If this dead lock happens in future, it will be quite hard to find out
the root cause. If we have this fix now to avoid future possible hash
conflict, life will be easier at that time.
2) If we decide to use nested generic_make_request() fix in your patch,
then the while-loop does not exist, more CPU cycles will be consumed, it
does not make sense to save a branch by a function pointer, and pay the
cost of code readability. So remove the function pointer is better once
we take reason 1).

Please keep this patch in -next, it helps to avoid future bug.

Thanks.

Coly

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html