On 2017/2/24 上午1:34, Shaohua Li wrote: > On Thu, Feb 23, 2017 at 01:54:47PM +0800, Coly Li wrote: [snip] >>>>>> As r1bio_pool preallocates 256 entries, this is unlikely but not >>>>>> impossible. If 256 threads all attempt a write (or read) that >>>>>> crosses a boundary, then they will consume all 256 preallocated >>>>>> entries, and want more. If there is no free memory, they will block >>>>>> indefinitely. >>>>>> >>>>> >>>>> If raid1_make_request() is modified into this way, >>>>> + if (bio_data_dir(split) == READ) >>>>> + raid1_read_request(mddev, split); >>>>> + else >>>>> + raid1_write_request(mddev, split); >>>>> + if (split != bio) >>>>> + generic_make_request(bio); >>>>> >>>>> Then the original bio will be added into the bio_list_on_stack of top >>>>> level generic_make_request(), current->bio_list is initialized, when >>>>> generic_make_request() is called nested in raid1_make_request(), the >>>>> split bio will be added into current->bio_list and nothing else happens. >>>>> >>>>> After the nested generic_make_request() returns, the code back to next >>>>> code of generic_make_request(), >>>>> 2022 ret = q->make_request_fn(q, bio); >>>>> 2023 >>>>> 2024 blk_queue_exit(q); >>>>> 2025 >>>>> 2026 bio = bio_list_pop(current->bio_list); >>>>> >>>>> bio_list_pop() will return the second half of the split bio, and it is >>>> >>>> So in above sequence, the curent->bio_list will has bios in below sequence: >>>> bios to underlaying disks, second half of original bio >>>> >>>> bio_list_pop will pop bios to underlaying disks first, handle them, then the >>>> second half of original bio. >>>> >>>> That said, this doesn't work for array stacked 3 layers. Because in 3-layer >>>> array, handling the middle layer bio will make the 3rd layer bio hold to >>>> bio_list again. >>>> >>> >>> Could you please give me more hint, >>> - What is the meaning of "hold" from " make the 3rd layer bio hold to >>> bio_list again" ? >>> - Why deadlock happens if the 3rd layer bio hold to bio_list again ? >> >> I tried to set up a 4 layer stacked md raid1, and reduce I/O barrier >> bucket size to 8MB, running for 10 hours, there is no deadlock observed, >> >> Here is how the 4 layer stacked raid1 setup, >> - There are 4 NVMe SSDs, on each SSD I create four 500GB partition, >> /dev/nvme0n1: nvme0n1p1, nvme0n1p2, nvme0n1p3, nvme0n1p4 >> /dev/nvme1n1: nvme1n1p1, nvme1n1p2, nvme1n1p3, nvme1n1p4 >> /dev/nvme2n1: nvme2n1p1, nvme2n1p2, nvme2n1p3, nvme2n1p4 >> /dev/nvme3n1: nvme3n1p1, nvme3n1p2, nvme3n1p3, nvme3n1p4 >> - Here is how the 4 layer stacked raid1 assembled, level 1 means the top >> level, level 4 means the bottom level in the stacked devices, >> - level 1: >> /dev/md40: /dev/md30 /dev/md31 >> - level 2: >> /dev/md30: /dev/md20 /dev/md21 >> /dev/md31: /dev/md22 /dev/md23 >> - level 3: >> /dev/md20: /dev/md10 /dev/md11 >> /dev/md21: /dev/md12 /dev/md13 >> /dev/md22: /dev/md14 /dev/md15 >> /dev/md23: /dev/md16 /dev/md17 >> - level 4: >> /dev/md10: /dev/nvme0n1p1 /dev/nvme1n1p1 >> /dev/md11: /dev/nvme2n1p1 /dev/nvme3n1p1 >> /dev/md12: /dev/nvme0n1p2 /dev/nvme1n1p2 >> /dev/md13: /dev/nvme2n1p2 /dev/nvme3n1p2 >> /dev/md14: /dev/nvme0n1p3 /dev/nvme1n1p3 >> /dev/md15: /dev/nvme2n1p3 /dev/nvme3n1p3 >> /dev/md16: /dev/nvme0n1p4 /dev/nvme1n1p4 >> /dev/md17: /dev/nvme2n1p4 /dev/nvme3n1p4 >> >> Here is the fio job file, >> [global] >> direct=1 >> thread=1 >> ioengine=libaio >> >> [job] >> filename=/dev/md40 >> readwrite=write >> numjobs=10 >> blocksize=33M >> iodepth=128 >> time_based=1 >> runtime=10h >> >> I planed to learn how the deadlock comes by analyze a deadlock >> condition. Maybe it was because 8MB bucket unit size is small enough, >> now I try to run with 512K bucket unit size, and see whether I can >> encounter a deadlock. > > Don't think raid1 could easily trigger the deadlock. Maybe you should try > raid10. The resync case is hard to trigger for raid1. The memory pressure case > is hard to trigger for both raid1/10. But it's possible to trigger. > > The 3-layer case is something like this: Hi Shaohua, I try to catch up with you, let me try to follow your mind by the split-in-while-loop condition (this is my new I/O barrier patch). I assume the original BIO is a write bio, and original bio is split and handled in a while loop in raid1_make_request(). > 1. in level1, set current->bio_list, split bio to bio1 and bio2 This is done in level1 raid1_make_request(). > 2. remap bio1 to level2 disk, and queue bio1-level2 in current->bio_list Remap is done by raid1_write_request(), and bio1_level may be added into one of the two list: - plug->pending: bios in plug->pending may be handled in raid1_unplug(), or in flush_pending_writes() of raid1d(). If current task is about to be scheduled, raid1_unplug() will merge plug->pending's bios to conf->pending_bio_list. And conf->pending_bio_list will be handled in raid1d. If raid1_unplug() is triggered by blk_finish_plug(), it is also handled in raid1d. - conf->pending_bio_list: bios in this list is handled in raid1d by calling flush_pending_writes(). So generic_make_request() to handle bio1_level2 can only be called in context of raid1d thread, bio1_level2 is added into raid1d's bio_list_on_stack, not caller of level1 generic_make_request(). > 3. queue bio2 in current->bio_list Same, bio2_level2 is in level1 raid1d's bio_list_on_stack. Then back to level1 generic_make_request() > 4. generic_make_request then pops bio1-level2 At this moment, bio1_level2 and bio2_level2 are in either plug->pending or conf->pending_bio_list, bio_list_pop() returns NULL, and level1 generic_make_request() returns to its caller. If before bio_list_pop() called, kernel thread raid1d wakes up and iterates conf->pending_bio_list in flush_pending_writes() or iterate plug->pending in raid1_unplug() by blk_finish_plug(), that happens in level1 raid1d's stack, bios will not show up in level1 generic_make_reques(), bio_list_pop() still returns NULL. > 5. remap bio1-level2 to level3 disk, and queue bio1-level2-level3 in current->bio_list bio2_level2 is at head of conf->pending_bio_list or plug->pending, so bio2_level2 is handled firstly. level1 raid1 calls level2 generic_make_request(), then level2 raid1_make_request() is called, then level raid1_write_request(). bio2_level2 is remapped to bio2_level3, added into plug->pending (level1 raid1d's context) or conf->pending_bio_list (level2 raid1's conf), it will be handled by level2 raid1d, when level2 raid1d wakes up. Then returns back to level1 raid1, bio1_level2 is handled by level2 generic_make_request() and added into level2 plug->pending or conf->pending_bio_list. In this case neither bio2_level2 nor bio1_level is added into any bio_list_on_stack. Then level1 raid1d handles all bios in level1 conf->pending_bio_list, and sleeps. Then level2 raid1d wakes up, and handle bio2_level3 and bio1_level3, by iterate level2 plug->pending or conf->pending_bio_list, and calling level3 generic_make_request(). In level3 generic_make_request(), because it is level2 raid1d context, not level1 raid1d context, bio2_level3 is send into q->make_request_fn(), and finally added into level3 plug->pending or conf->pending_bio_list, then back to level3 generic_make_reqeust(). Now level2 raid1d's current->bio_list is empty, so level3 generic_make_request() returns to level2 raid1d and continue to iterate and send bio1_level3 into level3 generic_make_request(). After all bios are added into level3 plug->pending or conf->pending_bio_list, level2 raid1d sleeps. Now level3 raid1d wakes up, continue to iterate level3 plug->pending or conf->pending_bio_list by calling generic_make_request() to underlying devices (which might be a read device). On the above whole patch, each lower level generic_make_request() is called in context of the lower level raid1d. No recursive call happens in normal code path. In raid1 code, recursive call of generic_make_request() only happens for READ bio, but if array is not frozen, no barrier is required, it doesn't hurt. > 6. generic_make_request then pops bio2, but bio1 hasn't finished yet, deadlock As my understand to the code, it won't happen neither. > > The problem is because we add new bio to current->bio_list tail. New bios are added into other context's current->bio_list, which are different lists. If what I understand is correct, a dead lock won't happen in this way. If my understanding is correct, suddenly I come to realize why raid1 bios are handled indirectly in another kernel thread. (Just for your information, when I write to this location, another run of testing finished, no deadlock. This time I reduce I/O barrier bucket unit size to 512KB, and set blocksize to 33MB in fio job file. It is really slow (130MB/s), but no deadlock observed) The stacked raid1 devices are really really confused, if I am wrong, any hint is warmly welcome. > >> =============== P.S ============== >> When I run the stacked raid1 testing, I feel I see something behavior >> suspiciously, it is resync. >> >> The second time when I rebuild all the raid1 devices by "mdadm -C >> /dev/mdXX -l 1 -n 2 /dev/xxx /dev/xxx", I see the top level raid1 device >> /dev/md40 already accomplished 50%+ resync. I don't think it could be >> that fast... > > no idea, is this reproducible? It can be stably reproduced. I need to check whether bitmap is cleaned when create a stacked raid1. This is a little off topic in this thread, once I have some idea, I will send out another topic. Hope it is just so fast. Coly -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html