Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Coly, Hi Shaohua,


>
> Hi Shaohua,
>
> I try to catch up with you, let me try to follow your mind by the
> split-in-while-loop condition (this is my new I/O barrier patch). I
> assume the original BIO is a write bio, and original bio is split and
> handled in a while loop in raid1_make_request().

It's still possible for read bio. We hit a deadlock in the past.
See https://patchwork.kernel.org/patch/9498949/

Also:
http://www.spinics.net/lists/raid/msg52792.html

Regards,
Jack

>
>> 1. in level1, set current->bio_list, split bio to bio1 and bio2
>
> This is done in level1 raid1_make_request().
>
>> 2. remap bio1 to level2 disk, and queue bio1-level2 in current->bio_list
>
> Remap is done by raid1_write_request(), and bio1_level may be added into
> one of the two list:
> - plug->pending:
>   bios in plug->pending may be handled in raid1_unplug(), or in
> flush_pending_writes() of raid1d().
>   If current task is about to be scheduled, raid1_unplug() will merge
> plug->pending's bios to conf->pending_bio_list. And
> conf->pending_bio_list will be handled in raid1d.
>   If raid1_unplug() is triggered by blk_finish_plug(), it is also
> handled in raid1d.
>
> - conf->pending_bio_list:
>   bios in this list is handled in raid1d by calling flush_pending_writes().
>
>
> So generic_make_request() to handle bio1_level2 can only be called in
> context of raid1d thread, bio1_level2 is added into raid1d's
> bio_list_on_stack, not caller of level1 generic_make_request().
>
>> 3. queue bio2 in current->bio_list
>
> Same, bio2_level2 is in level1 raid1d's bio_list_on_stack.
> Then back to level1 generic_make_request()
>
>> 4. generic_make_request then pops bio1-level2
>
> At this moment, bio1_level2 and bio2_level2 are in either plug->pending
> or conf->pending_bio_list, bio_list_pop() returns NULL, and level1
> generic_make_request() returns to its caller.
>
> If before bio_list_pop() called, kernel thread raid1d wakes up and
> iterates conf->pending_bio_list in flush_pending_writes() or iterate
> plug->pending in raid1_unplug() by blk_finish_plug(), that happens in
> level1 raid1d's stack, bios will not show up in level1
> generic_make_reques(), bio_list_pop() still returns NULL.
>
>> 5. remap bio1-level2 to level3 disk, and queue bio1-level2-level3 in current->bio_list
>
> bio2_level2 is at head of conf->pending_bio_list or plug->pending, so
> bio2_level2 is handled firstly.
>
> level1 raid1 calls level2 generic_make_request(), then level2
> raid1_make_request() is called, then level raid1_write_request().
> bio2_level2 is remapped to bio2_level3, added into plug->pending (level1
> raid1d's context) or conf->pending_bio_list (level2 raid1's conf), it
> will be handled by level2 raid1d, when level2 raid1d wakes up.
> Then returns back to level1 raid1, bio1_level2
> is handled by level2 generic_make_request() and added into level2
> plug->pending or conf->pending_bio_list. In this case neither
> bio2_level2 nor bio1_level is added into any bio_list_on_stack.
>
> Then level1 raid1d handles all bios in level1 conf->pending_bio_list,
> and sleeps.
>
> Then level2 raid1d wakes up, and handle bio2_level3 and bio1_level3, by
> iterate level2 plug->pending or conf->pending_bio_list, and calling
> level3 generic_make_request().
>
> In level3 generic_make_request(), because it is level2 raid1d context,
> not level1 raid1d context, bio2_level3 is send into
> q->make_request_fn(), and finally added into level3 plug->pending or
> conf->pending_bio_list, then back to level3 generic_make_reqeust().
>
> Now level2 raid1d's current->bio_list is empty, so level3
> generic_make_request() returns to level2 raid1d and continue to iterate
> and send bio1_level3 into level3 generic_make_request().
>
> After all bios are added into level3 plug->pending or
> conf->pending_bio_list, level2 raid1d sleeps.
>
> Now level3 raid1d wakes up, continue to iterate level3 plug->pending or
> conf->pending_bio_list by calling generic_make_request() to underlying
> devices (which might be a read device).
>
> On the above whole patch, each lower level generic_make_request() is
> called in context of the lower level raid1d. No recursive call happens
> in normal code path.
>
> In raid1 code, recursive call of generic_make_request() only happens for
> READ bio, but if array is not frozen, no barrier is required, it doesn't
> hurt.
>
>
>> 6. generic_make_request then pops bio2, but bio1 hasn't finished yet, deadlock
>
> As my understand to the code, it won't happen neither.
>
>>
>> The problem is because we add new bio to current->bio_list tail.
>
> New bios are added into other context's current->bio_list, which are
> different lists. If what I understand is correct, a dead lock won't
> happen in this way.
>
> If my understanding is correct, suddenly I come to realize why raid1
> bios are handled indirectly in another kernel thread.
>
> (Just for your information, when I write to this location, another run
> of testing finished, no deadlock. This time I reduce I/O barrier bucket
> unit size to 512KB, and set blocksize to 33MB in fio job file. It is
> really slow (130MB/s), but no deadlock observed)
>
>
> The stacked raid1 devices are really really confused, if I am wrong, any
> hint is warmly welcome.
>
>>
>>> =============== P.S ==============
>>> When I run the stacked raid1 testing, I feel I see something behavior
>>> suspiciously, it is resync.
>>>
>>> The second time when I rebuild all the raid1 devices by "mdadm -C
>>> /dev/mdXX -l 1 -n 2 /dev/xxx /dev/xxx", I see the top level raid1 device
>>> /dev/md40 already accomplished 50%+ resync. I don't think it could be
>>> that fast...
>>
>> no idea, is this reproducible?
>
> It can be stably reproduced. I need to check whether bitmap is cleaned
> when create a stacked raid1. This is a little off topic in this thread,
> once I have some idea, I will send out another topic. Hope it is just so
> fast.
>
> Coly
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux