Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window

Shaohua Li <shli@xxxxxxxxxx> · Thu, 23 Feb 2017 09:34:18 -0800

On Thu, Feb 23, 2017 at 01:54:47PM +0800, Coly Li wrote:
> On 2017/2/22 上午4:09, Coly Li wrote:
> > On 2017/2/22 上午1:45, Shaohua Li wrote:
> >> On Tue, Feb 21, 2017 at 05:45:53PM +0800, Coly Li wrote:
> >>> On 2017/2/21 上午8:29, NeilBrown wrote:
> >>>> On Mon, Feb 20 2017, Coly Li wrote:
> >>>>
> >>>>>> 在 2017年2月20日，下午3:04，Shaohua Li <shli@xxxxxxxxxx> 写道：
> >>>>>>
> >>>>>>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
> >>>>>>>> On Mon, Feb 20 2017, NeilBrown wrote:
> >>>>>>>>
> >>>>>>>>> On Fri, Feb 17 2017, Coly Li wrote:
> >>>>>>>>>
> >>>>>>>>>> On 2017/2/16 下午3:04, NeilBrown wrote: I know you are
> >>>>>>>>>> going to change this as Shaohua wantsthe spitting to 
> >>>>>>>>>> happen in a separate function, which I agree with, but
> >>>>>>>>>> there is something else wrong here. Calling
> >>>>>>>>>> bio_split/bio_chain repeatedly in a loop is dangerous.
> >>>>>>>>>> It is OK for simple devices, but when one request can
> >>>>>>>>>> wait for another request to the same device it can 
> >>>>>>>>>> deadlock. This can happen with raid1.  If a resync
> >>>>>>>>>> request calls raise_barrier() between one request and
> >>>>>>>>>> the next, then the next has to wait for the resync
> >>>>>>>>>> request, which has to wait for the first request. As
> >>>>>>>>>> the first request will be stuck in the queue in 
> >>>>>>>>>> generic_make_request(), you get a deadlock.
> >>>>>>>>>
> >>>>>>>>> For md raid1, queue in generic_make_request(), can I
> >>>>>>>>> understand it as bio_list_on_stack in this function? And
> >>>>>>>>> queue in underlying device, can I understand it as the
> >>>>>>>>> data structures like plug->pending and 
> >>>>>>>>> conf->pending_bio_list ?
> >>>>>>>>
> >>>>>>>> Yes, the queue in generic_make_request() is the
> >>>>>>>> bio_list_on_stack.  That is the only queue I am talking
> >>>>>>>> about.  I'm not referring to plug->pending or
> >>>>>>>> conf->pending_bio_list at all.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I still don't get the point of deadlock, let me try to
> >>>>>>>>> explain why I don't see the possible deadlock. If a bio
> >>>>>>>>> is split, and the first part is processed by
> >>>>>>>>> make_request_fn(), and then a resync comes and it will 
> >>>>>>>>> raise a barrier, there are 3 possible conditions, - the
> >>>>>>>>> resync I/O tries to raise barrier on same bucket of the
> >>>>>>>>> first regular bio. Then the resync task has to wait to
> >>>>>>>>> the first bio drops its conf->nr_pending[idx]
> >>>>>>>>
> >>>>>>>> Not quite. First, the resync task (in raise_barrier()) will
> >>>>>>>> wait for ->nr_waiting[idx] to be zero.  We can assume this
> >>>>>>>> happens immediately. Then the resync_task will increment
> >>>>>>>> ->barrier[idx]. Only then will it wait for the first bio to
> >>>>>>>> drop ->nr_pending[idx]. The processing of that first bio
> >>>>>>>> will have submitted bios to the underlying device, and they
> >>>>>>>> will be in the bio_list_on_stack queue, and will not be
> >>>>>>>> processed until raid1_make_request() completes.
> >>>>>>>>
> >>>>>>>> The loop in raid1_make_request() will then call
> >>>>>>>> make_request_fn() which will call wait_barrier(), which
> >>>>>>>> will wait for ->barrier[idx] to be zero.
> >>>>>>>
> >>>>>>> Thinking more carefully about this.. the 'idx' that the
> >>>>>>> second bio will wait for will normally be different, so there
> >>>>>>> won't be a deadlock after all.
> >>>>>>>
> >>>>>>> However it is possible for hash_long() to produce the same
> >>>>>>> idx for two consecutive barrier_units so there is still the
> >>>>>>> possibility of a deadlock, though it isn't as likely as I
> >>>>>>> thought at first.
> >>>>>>
> >>>>>> Wrapped the function pointer issue Neil pointed out into Coly's
> >>>>>> original patch. Also fix a 'use-after-free' bug. For the
> >>>>>> deadlock issue, I'll add below patch, please check.
> >>>>>>
> >>>>>> Thanks, Shaohua
> >>>>>>
> >>>>>
> >>>
> >>> Neil,
> >>>
> >>> Thanks for your patient explanation, I feel I come to follow up what
> >>> you mean. Let me try to re-tell what I understand, correct me if I am
> >>> wrong.
> >>>
> >>>
> >>>>> Hmm, please hold, I am still thinking of it. With barrier bucket
> >>>>> and hash_long(), I don't see dead lock yet. For raid10 it might
> >>>>> happen, but once we have barrier bucket on it , there will no
> >>>>> deadlock.
> >>>>>
> >>>>> My question is, this deadlock only happens when a big bio is
> >>>>> split, and the split small bios are continuous, and the resync io
> >>>>> visiting barrier buckets in sequntial order too. In the case if
> >>>>> adjacent split regular bios or resync bios hit same barrier
> >>>>> bucket, it will be a very big failure of hash design, and should
> >>>>> have been found already. But no one complain it, so I don't
> >>>>> convince myself tje deadlock is real with io barrier buckets
> >>>>> (this is what Neil concerns).
> >>>>
> >>>> I think you are wrong about the design goal of a hash function. 
> >>>> When feed a sequence of inputs, with any stride (i.e. with any
> >>>> constant difference between consecutive inputs), the output of the
> >>>> hash function should appear to be random. A random sequence can
> >>>> produce the same number twice in a row. If the hash function
> >>>> produces a number from 0 to N-1, you would expect two consecutive
> >>>> outputs to be the same about once every N inputs.
> >>>>
> >>>
> >>> Yes, you are right. But when I mentioned hash conflict, I limit the
> >>> integers in range [0, 1<<38]. 38 is (64-17-9), when a 64bit LBA
> >>> address divided by 64MB I/O barrier unit size, its value range is
> >>> reduced to [0, 1<<38].
> >>>
> >>> Maximum size of normal bio is 1MB, it could be split into 2 bios at most.
> >>>
> >>> For DISCARD bio, its maximum size is 4GB, it could be split into 65
> >>> bios at most.
> >>>
> >>> Then in this patch, the hash question is degraded to: for any
> >>> consecutive 65 integers in range [0, 1<<38], use hash_long() to hash
> >>> these 65 integers into range [0, 1023], will any hash conflict happen
> >>> among these integers ?
> >>>
> >>> I tried a half range [0, 1<<37] to check hash conflict, by writing a
> >>> simple code to emulate hash calculation in the new I/O barrier patch,
> >>> to iterate all consecutive {2, 65, 128, 512} integers in range [0,
> >>> 1<<37] for hash conflict.
> >>>
> >>> On a 20 core CPU each run spent 7+ hours, finally I find no hash
> >>> conflict detected up to 512 consecutive integers in above limited
> >>> condition. For 1024, there are a lot hash conflict detected.
> >>>
> >>> [0, 1<<37] range back to [0, 63] LBA range, this is large enough for
> >>> almost all existing md raid configuration. So for current kernel
> >>> implementation and real world device, for a single bio, there is no
> >>> possible hash conflict the new I/O barrier patch.
> >>>
> >>> If bi_iter.bi_size changes from unsigned int to unsigned long in
> >>> future, the above assumption will be wrong. There will be hash
> >>> conflict, and potential dead lock, which is quite implicit. Yes, I
> >>> agree with you. No, bio split inside loop is not perfect.
> >>>
> >>>> Even if there was no possibility of a deadlock from a resync
> >>>> request happening between two bios, there are other possibilities.
> >>>>
> >>>
> >>> The bellowed text makes me know more about raid1 code, but confuses me
> >>> more as well. Here comes my questions,
> >>>
> >>>> It is not, in general, safe to call mempool_alloc() twice in a
> >>>> row, without first ensuring that the first allocation will get
> >>>> freed by some other thread.  raid1_write_request() allocates from
> >>>> r1bio_pool, and then submits bios to the underlying device, which
> >>>> get queued on bio_list_on_stack.  They will not be processed until
> >>>> after raid1_make_request() completes, so when raid1_make_request
> >>>> loops around and calls raid1_write_request() again, it will try to
> >>>> allocate another r1bio from r1bio_pool, and this might end up
> >>>> waiting for the r1bio which is trapped and cannot complete.
> >>>>
> >>>
> >>> Can I say that it is because blk_finish_plug() won't be called before
> >>> raid1_make_request() returns ? Then in raid1_write_request(), mbio
> >>> will be added into plug->pending, but before blk_finish_plug() is
> >>> called, they won't be handled.
> >>
> >> blk_finish_plug is called if raid1_make_request sleep. The bio is hold in
> >> current->bio_list, not in plug list.
> >>  
> > 
> > Oops, I messed them up,  thank you for the clarifying :-)
> > 
> >>>> As r1bio_pool preallocates 256 entries, this is unlikely  but not 
> >>>> impossible.  If 256 threads all attempt a write (or read) that
> >>>> crosses a boundary, then they will consume all 256 preallocated
> >>>> entries, and want more. If there is no free memory, they will block
> >>>> indefinitely.
> >>>>
> >>>
> >>> If raid1_make_request() is modified into this way,
> >>> +	if (bio_data_dir(split) == READ)
> >>> +		raid1_read_request(mddev, split);
> >>> +	else
> >>> +		raid1_write_request(mddev, split);
> >>> +	if (split != bio)
> >>> +		generic_make_request(bio);
> >>>
> >>> Then the original bio will be added into the bio_list_on_stack of top
> >>> level generic_make_request(), current->bio_list is initialized, when
> >>> generic_make_request() is called nested in raid1_make_request(), the
> >>> split bio will be added into current->bio_list and nothing else happens.
> >>>
> >>> After the nested generic_make_request() returns, the code back to next
> >>> code of generic_make_request(),
> >>> 2022                         ret = q->make_request_fn(q, bio);
> >>> 2023
> >>> 2024                         blk_queue_exit(q);
> >>> 2025
> >>> 2026                         bio = bio_list_pop(current->bio_list);
> >>>
> >>> bio_list_pop() will return the second half of the split bio, and it is
> >>
> >> So in above sequence, the curent->bio_list will has bios in below sequence:
> >> bios to underlaying disks, second half of original bio
> >>
> >> bio_list_pop will pop bios to underlaying disks first, handle them, then the
> >> second half of original bio.
> >>
> >> That said, this doesn't work for array stacked 3 layers. Because in 3-layer
> >> array, handling the middle layer bio will make the 3rd layer bio hold to
> >> bio_list again.
> >>
> > 
> > Could you please give me more hint,
> > - What is the meaning of "hold" from " make the 3rd layer bio hold to
> > bio_list again" ?
> > - Why deadlock happens if the 3rd layer bio hold to bio_list again ?
> 
> I tried to set up a 4 layer stacked md raid1, and reduce I/O barrier
> bucket size to 8MB, running for 10 hours, there is no deadlock observed,
> 
> Here is how the 4 layer stacked raid1 setup,
> - There are 4 NVMe SSDs, on each SSD I create four 500GB partition,
>   /dev/nvme0n1:  nvme0n1p1, nvme0n1p2, nvme0n1p3, nvme0n1p4
>   /dev/nvme1n1:  nvme1n1p1, nvme1n1p2, nvme1n1p3, nvme1n1p4
>   /dev/nvme2n1:  nvme2n1p1, nvme2n1p2, nvme2n1p3, nvme2n1p4
>   /dev/nvme3n1:  nvme3n1p1, nvme3n1p2, nvme3n1p3, nvme3n1p4
> - Here is how the 4 layer stacked raid1 assembled, level 1 means the top
> level, level 4 means the bottom level in the stacked devices,
>   - level 1:
> 	/dev/md40: /dev/md30  /dev/md31
>   - level 2:
> 	/dev/md30: /dev/md20  /dev/md21
> 	/dev/md31: /dev/md22  /dev/md23
>   - level 3:
> 	/dev/md20: /dev/md10  /dev/md11
> 	/dev/md21: /dev/md12  /dev/md13
> 	/dev/md22: /dev/md14  /dev/md15
> 	/dev/md23: /dev/md16  /dev/md17
>   - level 4:
> 	/dev/md10: /dev/nvme0n1p1  /dev/nvme1n1p1
> 	/dev/md11: /dev/nvme2n1p1  /dev/nvme3n1p1
> 	/dev/md12: /dev/nvme0n1p2  /dev/nvme1n1p2
> 	/dev/md13: /dev/nvme2n1p2  /dev/nvme3n1p2
> 	/dev/md14: /dev/nvme0n1p3  /dev/nvme1n1p3
> 	/dev/md15: /dev/nvme2n1p3  /dev/nvme3n1p3
> 	/dev/md16: /dev/nvme0n1p4  /dev/nvme1n1p4
> 	/dev/md17: /dev/nvme2n1p4  /dev/nvme3n1p4
> 
> Here is the fio job file,
> [global]
> direct=1
> thread=1
> ioengine=libaio
> 
> [job]
> filename=/dev/md40
> readwrite=write
> numjobs=10
> blocksize=33M
> iodepth=128
> time_based=1
> runtime=10h
> 
> I planed to learn how the deadlock comes by analyze a deadlock
> condition. Maybe it was because 8MB bucket unit size is small enough,
> now I try to run with 512K bucket unit size, and see whether I can
> encounter a deadlock.

Don't think raid1 could easily trigger the deadlock. Maybe you should try
raid10. The resync case is hard to trigger for raid1. The memory pressure case
is hard to trigger for both raid1/10. But it's possible to trigger.

The 3-layer case is something like this:
1. in level1, set current->bio_list, split bio to bio1 and bio2
2. remap bio1 to level2 disk, and queue bio1-level2 in current->bio_list
3. queue bio2 in current->bio_list
4. generic_make_request then pops bio1-level2
5. remap bio1-level2 to level3 disk, and queue bio1-level2-level3 in current->bio_list
6. generic_make_request then pops bio2, but bio1 hasn't finished yet, deadlock

The problem is because we add new bio to current->bio_list tail.

> =============== P.S ==============
> When I run the stacked raid1 testing, I feel I see something behavior
> suspiciously, it is resync.
> 
> The second time when I rebuild all the raid1 devices by "mdadm -C
> /dev/mdXX -l 1 -n 2 /dev/xxx /dev/xxx", I see the top level raid1 device
> /dev/md40 already accomplished 50%+ resync. I don't think it could be
> that fast...

no idea, is this reproducible?

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html