On Fri, Feb 24, 2017 at 03:31:16AM +0800, Coly Li wrote: > On 2017/2/24 上午1:34, Shaohua Li wrote: > > On Thu, Feb 23, 2017 at 01:54:47PM +0800, Coly Li wrote: > [snip] > >>>>>> As r1bio_pool preallocates 256 entries, this is unlikely but not > >>>>>> impossible. If 256 threads all attempt a write (or read) that > >>>>>> crosses a boundary, then they will consume all 256 preallocated > >>>>>> entries, and want more. If there is no free memory, they will block > >>>>>> indefinitely. > >>>>>> > >>>>> > >>>>> If raid1_make_request() is modified into this way, > >>>>> + if (bio_data_dir(split) == READ) > >>>>> + raid1_read_request(mddev, split); > >>>>> + else > >>>>> + raid1_write_request(mddev, split); > >>>>> + if (split != bio) > >>>>> + generic_make_request(bio); > >>>>> > >>>>> Then the original bio will be added into the bio_list_on_stack of top > >>>>> level generic_make_request(), current->bio_list is initialized, when > >>>>> generic_make_request() is called nested in raid1_make_request(), the > >>>>> split bio will be added into current->bio_list and nothing else happens. > >>>>> > >>>>> After the nested generic_make_request() returns, the code back to next > >>>>> code of generic_make_request(), > >>>>> 2022 ret = q->make_request_fn(q, bio); > >>>>> 2023 > >>>>> 2024 blk_queue_exit(q); > >>>>> 2025 > >>>>> 2026 bio = bio_list_pop(current->bio_list); > >>>>> > >>>>> bio_list_pop() will return the second half of the split bio, and it is > >>>> > >>>> So in above sequence, the curent->bio_list will has bios in below sequence: > >>>> bios to underlaying disks, second half of original bio > >>>> > >>>> bio_list_pop will pop bios to underlaying disks first, handle them, then the > >>>> second half of original bio. > >>>> > >>>> That said, this doesn't work for array stacked 3 layers. Because in 3-layer > >>>> array, handling the middle layer bio will make the 3rd layer bio hold to > >>>> bio_list again. > >>>> > >>> > >>> Could you please give me more hint, > >>> - What is the meaning of "hold" from " make the 3rd layer bio hold to > >>> bio_list again" ? > >>> - Why deadlock happens if the 3rd layer bio hold to bio_list again ? > >> > >> I tried to set up a 4 layer stacked md raid1, and reduce I/O barrier > >> bucket size to 8MB, running for 10 hours, there is no deadlock observed, > >> > >> Here is how the 4 layer stacked raid1 setup, > >> - There are 4 NVMe SSDs, on each SSD I create four 500GB partition, > >> /dev/nvme0n1: nvme0n1p1, nvme0n1p2, nvme0n1p3, nvme0n1p4 > >> /dev/nvme1n1: nvme1n1p1, nvme1n1p2, nvme1n1p3, nvme1n1p4 > >> /dev/nvme2n1: nvme2n1p1, nvme2n1p2, nvme2n1p3, nvme2n1p4 > >> /dev/nvme3n1: nvme3n1p1, nvme3n1p2, nvme3n1p3, nvme3n1p4 > >> - Here is how the 4 layer stacked raid1 assembled, level 1 means the top > >> level, level 4 means the bottom level in the stacked devices, > >> - level 1: > >> /dev/md40: /dev/md30 /dev/md31 > >> - level 2: > >> /dev/md30: /dev/md20 /dev/md21 > >> /dev/md31: /dev/md22 /dev/md23 > >> - level 3: > >> /dev/md20: /dev/md10 /dev/md11 > >> /dev/md21: /dev/md12 /dev/md13 > >> /dev/md22: /dev/md14 /dev/md15 > >> /dev/md23: /dev/md16 /dev/md17 > >> - level 4: > >> /dev/md10: /dev/nvme0n1p1 /dev/nvme1n1p1 > >> /dev/md11: /dev/nvme2n1p1 /dev/nvme3n1p1 > >> /dev/md12: /dev/nvme0n1p2 /dev/nvme1n1p2 > >> /dev/md13: /dev/nvme2n1p2 /dev/nvme3n1p2 > >> /dev/md14: /dev/nvme0n1p3 /dev/nvme1n1p3 > >> /dev/md15: /dev/nvme2n1p3 /dev/nvme3n1p3 > >> /dev/md16: /dev/nvme0n1p4 /dev/nvme1n1p4 > >> /dev/md17: /dev/nvme2n1p4 /dev/nvme3n1p4 > >> > >> Here is the fio job file, > >> [global] > >> direct=1 > >> thread=1 > >> ioengine=libaio > >> > >> [job] > >> filename=/dev/md40 > >> readwrite=write > >> numjobs=10 > >> blocksize=33M > >> iodepth=128 > >> time_based=1 > >> runtime=10h > >> > >> I planed to learn how the deadlock comes by analyze a deadlock > >> condition. Maybe it was because 8MB bucket unit size is small enough, > >> now I try to run with 512K bucket unit size, and see whether I can > >> encounter a deadlock. > > > > Don't think raid1 could easily trigger the deadlock. Maybe you should try > > raid10. The resync case is hard to trigger for raid1. The memory pressure case > > is hard to trigger for both raid1/10. But it's possible to trigger. > > > > The 3-layer case is something like this: > > Hi Shaohua, > > I try to catch up with you, let me try to follow your mind by the > split-in-while-loop condition (this is my new I/O barrier patch). I > assume the original BIO is a write bio, and original bio is split and > handled in a while loop in raid1_make_request(). > > > 1. in level1, set current->bio_list, split bio to bio1 and bio2 > > This is done in level1 raid1_make_request(). > > > 2. remap bio1 to level2 disk, and queue bio1-level2 in current->bio_list > > Remap is done by raid1_write_request(), and bio1_level may be added into > one of the two list: > - plug->pending: > bios in plug->pending may be handled in raid1_unplug(), or in > flush_pending_writes() of raid1d(). > If current task is about to be scheduled, raid1_unplug() will merge > plug->pending's bios to conf->pending_bio_list. And > conf->pending_bio_list will be handled in raid1d. > If raid1_unplug() is triggered by blk_finish_plug(), it is also > handled in raid1d. > > - conf->pending_bio_list: > bios in this list is handled in raid1d by calling flush_pending_writes(). > > > So generic_make_request() to handle bio1_level2 can only be called in > context of raid1d thread, bio1_level2 is added into raid1d's > bio_list_on_stack, not caller of level1 generic_make_request(). > > > 3. queue bio2 in current->bio_list > > Same, bio2_level2 is in level1 raid1d's bio_list_on_stack. > Then back to level1 generic_make_request() > > > 4. generic_make_request then pops bio1-level2 > > At this moment, bio1_level2 and bio2_level2 are in either plug->pending > or conf->pending_bio_list, bio_list_pop() returns NULL, and level1 > generic_make_request() returns to its caller. > > If before bio_list_pop() called, kernel thread raid1d wakes up and > iterates conf->pending_bio_list in flush_pending_writes() or iterate > plug->pending in raid1_unplug() by blk_finish_plug(), that happens in > level1 raid1d's stack, bios will not show up in level1 > generic_make_reques(), bio_list_pop() still returns NULL. > > > 5. remap bio1-level2 to level3 disk, and queue bio1-level2-level3 in current->bio_list > > bio2_level2 is at head of conf->pending_bio_list or plug->pending, so > bio2_level2 is handled firstly. > > level1 raid1 calls level2 generic_make_request(), then level2 > raid1_make_request() is called, then level raid1_write_request(). > bio2_level2 is remapped to bio2_level3, added into plug->pending (level1 > raid1d's context) or conf->pending_bio_list (level2 raid1's conf), it > will be handled by level2 raid1d, when level2 raid1d wakes up. > Then returns back to level1 raid1, bio1_level2 > is handled by level2 generic_make_request() and added into level2 > plug->pending or conf->pending_bio_list. In this case neither > bio2_level2 nor bio1_level is added into any bio_list_on_stack. > > Then level1 raid1d handles all bios in level1 conf->pending_bio_list, > and sleeps. > > Then level2 raid1d wakes up, and handle bio2_level3 and bio1_level3, by > iterate level2 plug->pending or conf->pending_bio_list, and calling > level3 generic_make_request(). > > In level3 generic_make_request(), because it is level2 raid1d context, > not level1 raid1d context, bio2_level3 is send into > q->make_request_fn(), and finally added into level3 plug->pending or > conf->pending_bio_list, then back to level3 generic_make_reqeust(). > > Now level2 raid1d's current->bio_list is empty, so level3 > generic_make_request() returns to level2 raid1d and continue to iterate > and send bio1_level3 into level3 generic_make_request(). > > After all bios are added into level3 plug->pending or > conf->pending_bio_list, level2 raid1d sleeps. > > Now level3 raid1d wakes up, continue to iterate level3 plug->pending or > conf->pending_bio_list by calling generic_make_request() to underlying > devices (which might be a read device). > > On the above whole patch, each lower level generic_make_request() is > called in context of the lower level raid1d. No recursive call happens > in normal code path. > > In raid1 code, recursive call of generic_make_request() only happens for > READ bio, but if array is not frozen, no barrier is required, it doesn't > hurt. > > > > 6. generic_make_request then pops bio2, but bio1 hasn't finished yet, deadlock > > As my understand to the code, it won't happen neither. > > > > > The problem is because we add new bio to current->bio_list tail. > > New bios are added into other context's current->bio_list, which are > different lists. If what I understand is correct, a dead lock won't > happen in this way. > > If my understanding is correct, suddenly I come to realize why raid1 > bios are handled indirectly in another kernel thread. > > (Just for your information, when I write to this location, another run > of testing finished, no deadlock. This time I reduce I/O barrier bucket > unit size to 512KB, and set blocksize to 33MB in fio job file. It is > really slow (130MB/s), but no deadlock observed) > > > The stacked raid1 devices are really really confused, if I am wrong, any > hint is warmly welcome. Aha, you are correct. I missed we never directly dispatch bio in a schedule based blk-plug flush. I'll drop the patch. Thanks for the insistence, good discussion! Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html