On 2017/2/22 上午4:09, Coly Li wrote: > On 2017/2/22 上午1:45, Shaohua Li wrote: >> On Tue, Feb 21, 2017 at 05:45:53PM +0800, Coly Li wrote: >>> On 2017/2/21 上午8:29, NeilBrown wrote: >>>> On Mon, Feb 20 2017, Coly Li wrote: >>>> >>>>>> 在 2017年2月20日,下午3:04,Shaohua Li <shli@xxxxxxxxxx> 写道: >>>>>> >>>>>>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote: >>>>>>>> On Mon, Feb 20 2017, NeilBrown wrote: >>>>>>>> >>>>>>>>> On Fri, Feb 17 2017, Coly Li wrote: >>>>>>>>> >>>>>>>>>> On 2017/2/16 下午3:04, NeilBrown wrote: I know you are >>>>>>>>>> going to change this as Shaohua wantsthe spitting to >>>>>>>>>> happen in a separate function, which I agree with, but >>>>>>>>>> there is something else wrong here. Calling >>>>>>>>>> bio_split/bio_chain repeatedly in a loop is dangerous. >>>>>>>>>> It is OK for simple devices, but when one request can >>>>>>>>>> wait for another request to the same device it can >>>>>>>>>> deadlock. This can happen with raid1. If a resync >>>>>>>>>> request calls raise_barrier() between one request and >>>>>>>>>> the next, then the next has to wait for the resync >>>>>>>>>> request, which has to wait for the first request. As >>>>>>>>>> the first request will be stuck in the queue in >>>>>>>>>> generic_make_request(), you get a deadlock. >>>>>>>>> >>>>>>>>> For md raid1, queue in generic_make_request(), can I >>>>>>>>> understand it as bio_list_on_stack in this function? And >>>>>>>>> queue in underlying device, can I understand it as the >>>>>>>>> data structures like plug->pending and >>>>>>>>> conf->pending_bio_list ? >>>>>>>> >>>>>>>> Yes, the queue in generic_make_request() is the >>>>>>>> bio_list_on_stack. That is the only queue I am talking >>>>>>>> about. I'm not referring to plug->pending or >>>>>>>> conf->pending_bio_list at all. >>>>>>>> >>>>>>>>> >>>>>>>>> I still don't get the point of deadlock, let me try to >>>>>>>>> explain why I don't see the possible deadlock. If a bio >>>>>>>>> is split, and the first part is processed by >>>>>>>>> make_request_fn(), and then a resync comes and it will >>>>>>>>> raise a barrier, there are 3 possible conditions, - the >>>>>>>>> resync I/O tries to raise barrier on same bucket of the >>>>>>>>> first regular bio. Then the resync task has to wait to >>>>>>>>> the first bio drops its conf->nr_pending[idx] >>>>>>>> >>>>>>>> Not quite. First, the resync task (in raise_barrier()) will >>>>>>>> wait for ->nr_waiting[idx] to be zero. We can assume this >>>>>>>> happens immediately. Then the resync_task will increment >>>>>>>> ->barrier[idx]. Only then will it wait for the first bio to >>>>>>>> drop ->nr_pending[idx]. The processing of that first bio >>>>>>>> will have submitted bios to the underlying device, and they >>>>>>>> will be in the bio_list_on_stack queue, and will not be >>>>>>>> processed until raid1_make_request() completes. >>>>>>>> >>>>>>>> The loop in raid1_make_request() will then call >>>>>>>> make_request_fn() which will call wait_barrier(), which >>>>>>>> will wait for ->barrier[idx] to be zero. >>>>>>> >>>>>>> Thinking more carefully about this.. the 'idx' that the >>>>>>> second bio will wait for will normally be different, so there >>>>>>> won't be a deadlock after all. >>>>>>> >>>>>>> However it is possible for hash_long() to produce the same >>>>>>> idx for two consecutive barrier_units so there is still the >>>>>>> possibility of a deadlock, though it isn't as likely as I >>>>>>> thought at first. >>>>>> >>>>>> Wrapped the function pointer issue Neil pointed out into Coly's >>>>>> original patch. Also fix a 'use-after-free' bug. For the >>>>>> deadlock issue, I'll add below patch, please check. >>>>>> >>>>>> Thanks, Shaohua >>>>>> >>>>> >>> >>> Neil, >>> >>> Thanks for your patient explanation, I feel I come to follow up what >>> you mean. Let me try to re-tell what I understand, correct me if I am >>> wrong. >>> >>> >>>>> Hmm, please hold, I am still thinking of it. With barrier bucket >>>>> and hash_long(), I don't see dead lock yet. For raid10 it might >>>>> happen, but once we have barrier bucket on it , there will no >>>>> deadlock. >>>>> >>>>> My question is, this deadlock only happens when a big bio is >>>>> split, and the split small bios are continuous, and the resync io >>>>> visiting barrier buckets in sequntial order too. In the case if >>>>> adjacent split regular bios or resync bios hit same barrier >>>>> bucket, it will be a very big failure of hash design, and should >>>>> have been found already. But no one complain it, so I don't >>>>> convince myself tje deadlock is real with io barrier buckets >>>>> (this is what Neil concerns). >>>> >>>> I think you are wrong about the design goal of a hash function. >>>> When feed a sequence of inputs, with any stride (i.e. with any >>>> constant difference between consecutive inputs), the output of the >>>> hash function should appear to be random. A random sequence can >>>> produce the same number twice in a row. If the hash function >>>> produces a number from 0 to N-1, you would expect two consecutive >>>> outputs to be the same about once every N inputs. >>>> >>> >>> Yes, you are right. But when I mentioned hash conflict, I limit the >>> integers in range [0, 1<<38]. 38 is (64-17-9), when a 64bit LBA >>> address divided by 64MB I/O barrier unit size, its value range is >>> reduced to [0, 1<<38]. >>> >>> Maximum size of normal bio is 1MB, it could be split into 2 bios at most. >>> >>> For DISCARD bio, its maximum size is 4GB, it could be split into 65 >>> bios at most. >>> >>> Then in this patch, the hash question is degraded to: for any >>> consecutive 65 integers in range [0, 1<<38], use hash_long() to hash >>> these 65 integers into range [0, 1023], will any hash conflict happen >>> among these integers ? >>> >>> I tried a half range [0, 1<<37] to check hash conflict, by writing a >>> simple code to emulate hash calculation in the new I/O barrier patch, >>> to iterate all consecutive {2, 65, 128, 512} integers in range [0, >>> 1<<37] for hash conflict. >>> >>> On a 20 core CPU each run spent 7+ hours, finally I find no hash >>> conflict detected up to 512 consecutive integers in above limited >>> condition. For 1024, there are a lot hash conflict detected. >>> >>> [0, 1<<37] range back to [0, 63] LBA range, this is large enough for >>> almost all existing md raid configuration. So for current kernel >>> implementation and real world device, for a single bio, there is no >>> possible hash conflict the new I/O barrier patch. >>> >>> If bi_iter.bi_size changes from unsigned int to unsigned long in >>> future, the above assumption will be wrong. There will be hash >>> conflict, and potential dead lock, which is quite implicit. Yes, I >>> agree with you. No, bio split inside loop is not perfect. >>> >>>> Even if there was no possibility of a deadlock from a resync >>>> request happening between two bios, there are other possibilities. >>>> >>> >>> The bellowed text makes me know more about raid1 code, but confuses me >>> more as well. Here comes my questions, >>> >>>> It is not, in general, safe to call mempool_alloc() twice in a >>>> row, without first ensuring that the first allocation will get >>>> freed by some other thread. raid1_write_request() allocates from >>>> r1bio_pool, and then submits bios to the underlying device, which >>>> get queued on bio_list_on_stack. They will not be processed until >>>> after raid1_make_request() completes, so when raid1_make_request >>>> loops around and calls raid1_write_request() again, it will try to >>>> allocate another r1bio from r1bio_pool, and this might end up >>>> waiting for the r1bio which is trapped and cannot complete. >>>> >>> >>> Can I say that it is because blk_finish_plug() won't be called before >>> raid1_make_request() returns ? Then in raid1_write_request(), mbio >>> will be added into plug->pending, but before blk_finish_plug() is >>> called, they won't be handled. >> >> blk_finish_plug is called if raid1_make_request sleep. The bio is hold in >> current->bio_list, not in plug list. >> > > Oops, I messed them up, thank you for the clarifying :-) > >>>> As r1bio_pool preallocates 256 entries, this is unlikely but not >>>> impossible. If 256 threads all attempt a write (or read) that >>>> crosses a boundary, then they will consume all 256 preallocated >>>> entries, and want more. If there is no free memory, they will block >>>> indefinitely. >>>> >>> >>> If raid1_make_request() is modified into this way, >>> + if (bio_data_dir(split) == READ) >>> + raid1_read_request(mddev, split); >>> + else >>> + raid1_write_request(mddev, split); >>> + if (split != bio) >>> + generic_make_request(bio); >>> >>> Then the original bio will be added into the bio_list_on_stack of top >>> level generic_make_request(), current->bio_list is initialized, when >>> generic_make_request() is called nested in raid1_make_request(), the >>> split bio will be added into current->bio_list and nothing else happens. >>> >>> After the nested generic_make_request() returns, the code back to next >>> code of generic_make_request(), >>> 2022 ret = q->make_request_fn(q, bio); >>> 2023 >>> 2024 blk_queue_exit(q); >>> 2025 >>> 2026 bio = bio_list_pop(current->bio_list); >>> >>> bio_list_pop() will return the second half of the split bio, and it is >> >> So in above sequence, the curent->bio_list will has bios in below sequence: >> bios to underlaying disks, second half of original bio >> >> bio_list_pop will pop bios to underlaying disks first, handle them, then the >> second half of original bio. >> >> That said, this doesn't work for array stacked 3 layers. Because in 3-layer >> array, handling the middle layer bio will make the 3rd layer bio hold to >> bio_list again. >> > > Could you please give me more hint, > - What is the meaning of "hold" from " make the 3rd layer bio hold to > bio_list again" ? > - Why deadlock happens if the 3rd layer bio hold to bio_list again ? I tried to set up a 4 layer stacked md raid1, and reduce I/O barrier bucket size to 8MB, running for 10 hours, there is no deadlock observed, Here is how the 4 layer stacked raid1 setup, - There are 4 NVMe SSDs, on each SSD I create four 500GB partition, /dev/nvme0n1: nvme0n1p1, nvme0n1p2, nvme0n1p3, nvme0n1p4 /dev/nvme1n1: nvme1n1p1, nvme1n1p2, nvme1n1p3, nvme1n1p4 /dev/nvme2n1: nvme2n1p1, nvme2n1p2, nvme2n1p3, nvme2n1p4 /dev/nvme3n1: nvme3n1p1, nvme3n1p2, nvme3n1p3, nvme3n1p4 - Here is how the 4 layer stacked raid1 assembled, level 1 means the top level, level 4 means the bottom level in the stacked devices, - level 1: /dev/md40: /dev/md30 /dev/md31 - level 2: /dev/md30: /dev/md20 /dev/md21 /dev/md31: /dev/md22 /dev/md23 - level 3: /dev/md20: /dev/md10 /dev/md11 /dev/md21: /dev/md12 /dev/md13 /dev/md22: /dev/md14 /dev/md15 /dev/md23: /dev/md16 /dev/md17 - level 4: /dev/md10: /dev/nvme0n1p1 /dev/nvme1n1p1 /dev/md11: /dev/nvme2n1p1 /dev/nvme3n1p1 /dev/md12: /dev/nvme0n1p2 /dev/nvme1n1p2 /dev/md13: /dev/nvme2n1p2 /dev/nvme3n1p2 /dev/md14: /dev/nvme0n1p3 /dev/nvme1n1p3 /dev/md15: /dev/nvme2n1p3 /dev/nvme3n1p3 /dev/md16: /dev/nvme0n1p4 /dev/nvme1n1p4 /dev/md17: /dev/nvme2n1p4 /dev/nvme3n1p4 Here is the fio job file, [global] direct=1 thread=1 ioengine=libaio [job] filename=/dev/md40 readwrite=write numjobs=10 blocksize=33M iodepth=128 time_based=1 runtime=10h I planed to learn how the deadlock comes by analyze a deadlock condition. Maybe it was because 8MB bucket unit size is small enough, now I try to run with 512K bucket unit size, and see whether I can encounter a deadlock. =============== P.S ============== When I run the stacked raid1 testing, I feel I see something behavior suspiciously, it is resync. The second time when I rebuild all the raid1 devices by "mdadm -C /dev/mdXX -l 1 -n 2 /dev/xxx /dev/xxx", I see the top level raid1 device /dev/md40 already accomplished 50%+ resync. I don't think it could be that fast... Coly -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html