Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window

Coly Li <colyli@xxxxxxx> · Mon, 20 Feb 2017 16:07:38 +0800

发自我的 iPhone
> 在 2017年2月20日，下午3:04，Shaohua Li <shli@xxxxxxxxxx> 写道：
> 
>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
>>> On Mon, Feb 20 2017, NeilBrown wrote:
>>> 
>>>> On Fri, Feb 17 2017, Coly Li wrote:
>>>> 
>>>>> On 2017/2/16 下午3:04, NeilBrown wrote:
>>>>> I know you are going to change this as Shaohua wantsthe spitting to
>>>>> happen in a separate function, which I agree with, but there is 
>>>>> something else wrong here. Calling bio_split/bio_chain repeatedly
>>>>> in a loop is dangerous. It is OK for simple devices, but when one
>>>>> request can wait for another request to the same device it can
>>>>> deadlock. This can happen with raid1.  If a resync request calls
>>>>> raise_barrier() between one request and the next, then the next has
>>>>> to wait for the resync request, which has to wait for the first
>>>>> request. As the first request will be stuck in the queue in 
>>>>> generic_make_request(), you get a deadlock.
>>>> 
>>>> For md raid1, queue in generic_make_request(), can I understand it as
>>>> bio_list_on_stack in this function? And queue in underlying device,
>>>> can I understand it as the data structures like plug->pending and
>>>> conf->pending_bio_list ?
>>> 
>>> Yes, the queue in generic_make_request() is the bio_list_on_stack.  That
>>> is the only queue I am talking about.  I'm not referring to
>>> plug->pending or conf->pending_bio_list at all.
>>> 
>>>> 
>>>> I still don't get the point of deadlock, let me try to explain why I
>>>> don't see the possible deadlock. If a bio is split, and the first part
>>>> is processed by make_request_fn(), and then a resync comes and it will
>>>> raise a barrier, there are 3 possible conditions,
>>>> - the resync I/O tries to raise barrier on same bucket of the first
>>>> regular bio. Then the resync task has to wait to the first bio drops
>>>> its conf->nr_pending[idx]
>>> 
>>> Not quite.
>>> First, the resync task (in raise_barrier()) will wait for ->nr_waiting[idx]
>>> to be zero.  We can assume this happens immediately.
>>> Then the resync_task will increment ->barrier[idx].
>>> Only then will it wait for the first bio to drop ->nr_pending[idx].
>>> The processing of that first bio will have submitted bios to the
>>> underlying device, and they will be in the bio_list_on_stack queue, and
>>> will not be processed until raid1_make_request() completes.
>>> 
>>> The loop in raid1_make_request() will then call make_request_fn() which
>>> will call wait_barrier(), which will wait for ->barrier[idx] to be
>>> zero.
>> 
>> Thinking more carefully about this.. the 'idx' that the second bio will
>> wait for will normally be different, so there won't be a deadlock after
>> all.
>> 
>> However it is possible for hash_long() to produce the same idx for two
>> consecutive barrier_units so there is still the possibility of a
>> deadlock, though it isn't as likely as I thought at first.
> 
> Wrapped the function pointer issue Neil pointed out into Coly's original patch.
> Also fix a 'use-after-free' bug. For the deadlock issue, I'll add below patch,
> please check.
> 
> Thanks,
> Shaohua
> 

Hmm, please hold, I am still thinking of it. With barrier bucket and hash_long(), I don't see dead lock yet. For raid10 it might happen, but once we have barrier bucket on it , there will no deadlock.

My question is, this deadlock only happens when a big bio is split, and the split small bios are continuous, and the resync io visiting barrier buckets in sequntial order too. In the case if adjacent split regular bios or resync bios hit same barrier bucket, it will be a very big failure of hash design, and should have been found already. But no one complain it, so I don't convince myself tje deadlock is real with io barrier buckets (this is what Neil concerns).

For the function pointer asignment, it is because I see a brach happens in a loop. If I use a function pointer, I can avoid redundant brach inside the loop. raid1_read_request() and raid1_write_request() are not simple functions, I don't know whether gcc may make them inline or not, so I am on the way to check the disassembled code..

The loop in raid1_make_request() is quite high level, I am not sure whether CPU brach pridiction may work correctly, especially when it is a big DISCARD bio, using function pointer may drop a possible brach.

So I need to check what we get and lose when use function pointer or not. If it is not urgent, please hold this patch for a while.

The only thing I worry in the bellowed patch is, if a very big DISCARD bio comes, will the kernel space stack trend to be overflow？

Thanks.

Coly

> From ee9c98138bcdf8bceef384a68f49258b6b8b8c6d Mon Sep 17 00:00:00 2001
> Message-Id: <ee9c98138bcdf8bceef384a68f49258b6b8b8c6d.1487573888.git.shli@xxxxxx>
> From: Shaohua Li <shli@xxxxxx>
> Date: Sun, 19 Feb 2017 22:18:32 -0800
> Subject: [PATCH] md/raid1/10: fix potential deadlock
> 
> Neil Brown pointed out a potential deadlock in raid 10 code with
> bio_split/chain. The raid1 code could have the same issue, but recent
> barrier rework makes it less likely to happen. The deadlock happens in
> below sequence:
> 
> 1. generic_make_request(bio), this will set current->bio_list
> 2. raid10_make_request will split bio to bio1 and bio2
> 3. __make_request(bio1), wait_barrer, add underlayer disk bio to
> current->bio_list
> 4. __make_request(bio2), wait_barrer
> 
> If raise_barrier happens between 3 & 4, since wait_barrier runs at 3,
> raise_barrier waits for IO completion from 3. And since raise_barrier
> sets barrier, 4 waits for raise_barrier. But IO from 3 can't be
> dispatched because raid10_make_request() doesn't finished yet.
> 
> The solution is to adjust the IO ordering. Quotes from Neil:
> "
> It is much safer to:
> 
>    if (need to split) {
>        split = bio_split(bio, ...)
>        bio_chain(...)
>        make_request_fn(split);
>        generic_make_request(bio);
>   } else
>        make_request_fn(mddev, bio);
> 
> This way we first process the initial section of the bio (in 'split')
> which will queue some requests to the underlying devices.  These
> requests will be queued in generic_make_request.
> Then we queue the remainder of the bio, which will be added to the end
> of the generic_make_request queue.
> Then we return.
> generic_make_request() will pop the lower-level device requests off the
> queue and handle them first.  Then it will process the remainder
> of the original bio once the first section has been fully processed.
> "
> 
> Cc: Coly Li <colyli@xxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx (v3.14+, only the raid10 part)
> Suggested-by: NeilBrown <neilb@xxxxxxxx>
> Signed-off-by: Shaohua Li <shli@xxxxxx>
> ---
> drivers/md/raid1.c  | 28 ++++++++++++++--------------
> drivers/md/raid10.c | 41 ++++++++++++++++++++---------------------
> 2 files changed, 34 insertions(+), 35 deletions(-)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 676f72d..e55d865 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1566,21 +1566,21 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
>    sector_t sectors;
> 
>    /* if bio exceeds barrier unit boundary, split it */
> -    do {
> -        sectors = align_to_barrier_unit_end(
> -                bio->bi_iter.bi_sector, bio_sectors(bio));
> -        if (sectors < bio_sectors(bio)) {
> -            split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
> -            bio_chain(split, bio);
> -        } else {
> -            split = bio;
> -        }
> +    sectors = align_to_barrier_unit_end(
> +            bio->bi_iter.bi_sector, bio_sectors(bio));
> +    if (sectors < bio_sectors(bio)) {
> +        split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
> +        bio_chain(split, bio);
> +    } else {
> +        split = bio;
> +    }
> 
> -        if (bio_data_dir(split) == READ)
> -            raid1_read_request(mddev, split);
> -        else
> -            raid1_write_request(mddev, split);
> -    } while (split != bio);
> +    if (bio_data_dir(split) == READ)
> +        raid1_read_request(mddev, split);
> +    else
> +        raid1_write_request(mddev, split);
> +    if (split != bio)
> +        generic_make_request(bio);
> }
> 
> static void raid1_status(struct seq_file *seq, struct mddev *mddev)
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index a1f8e98..b495049 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1551,28 +1551,27 @@ static void raid10_make_request(struct mddev *mddev, struct bio *bio)
>        return;
>    }
> 
> -    do {
> -
> -        /*
> -         * If this request crosses a chunk boundary, we need to split
> -         * it.
> -         */
> -        if (unlikely((bio->bi_iter.bi_sector & chunk_mask) +
> -                 bio_sectors(bio) > chunk_sects
> -                 && (conf->geo.near_copies < conf->geo.raid_disks
> -                 || conf->prev.near_copies <
> -                 conf->prev.raid_disks))) {
> -            split = bio_split(bio, chunk_sects -
> -                      (bio->bi_iter.bi_sector &
> -                       (chunk_sects - 1)),
> -                      GFP_NOIO, fs_bio_set);
> -            bio_chain(split, bio);
> -        } else {
> -            split = bio;
> -        }
> +    /*
> +     * If this request crosses a chunk boundary, we need to split
> +     * it.
> +     */
> +    if (unlikely((bio->bi_iter.bi_sector & chunk_mask) +
> +             bio_sectors(bio) > chunk_sects
> +             && (conf->geo.near_copies < conf->geo.raid_disks
> +             || conf->prev.near_copies <
> +             conf->prev.raid_disks))) {
> +        split = bio_split(bio, chunk_sects -
> +                  (bio->bi_iter.bi_sector &
> +                   (chunk_sects - 1)),
> +                  GFP_NOIO, fs_bio_set);
> +        bio_chain(split, bio);
> +    } else {
> +        split = bio;
> +    }
> 
> -        __make_request(mddev, split);
> -    } while (split != bio);
> +    __make_request(mddev, split);
> +    if (split != bio)
> +        generic_make_request(bio);
> 
>    /* In case raid10d snuck in to freeze_array */
>    wake_up(&conf->wait_barrier);
> -- 
> 2.9.3
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html