On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote: > On Mon, Feb 20 2017, NeilBrown wrote: > > > On Fri, Feb 17 2017, Coly Li wrote: > > > >> On 2017/2/16 下午3:04, NeilBrown wrote: > >>> I know you are going to change this as Shaohua wantsthe spitting to > >>> happen in a separate function, which I agree with, but there is > >>> something else wrong here. Calling bio_split/bio_chain repeatedly > >>> in a loop is dangerous. It is OK for simple devices, but when one > >>> request can wait for another request to the same device it can > >>> deadlock. This can happen with raid1. If a resync request calls > >>> raise_barrier() between one request and the next, then the next has > >>> to wait for the resync request, which has to wait for the first > >>> request. As the first request will be stuck in the queue in > >>> generic_make_request(), you get a deadlock. > >> > >> For md raid1, queue in generic_make_request(), can I understand it as > >> bio_list_on_stack in this function? And queue in underlying device, > >> can I understand it as the data structures like plug->pending and > >> conf->pending_bio_list ? > > > > Yes, the queue in generic_make_request() is the bio_list_on_stack. That > > is the only queue I am talking about. I'm not referring to > > plug->pending or conf->pending_bio_list at all. > > > >> > >> I still don't get the point of deadlock, let me try to explain why I > >> don't see the possible deadlock. If a bio is split, and the first part > >> is processed by make_request_fn(), and then a resync comes and it will > >> raise a barrier, there are 3 possible conditions, > >> - the resync I/O tries to raise barrier on same bucket of the first > >> regular bio. Then the resync task has to wait to the first bio drops > >> its conf->nr_pending[idx] > > > > Not quite. > > First, the resync task (in raise_barrier()) will wait for ->nr_waiting[idx] > > to be zero. We can assume this happens immediately. > > Then the resync_task will increment ->barrier[idx]. > > Only then will it wait for the first bio to drop ->nr_pending[idx]. > > The processing of that first bio will have submitted bios to the > > underlying device, and they will be in the bio_list_on_stack queue, and > > will not be processed until raid1_make_request() completes. > > > > The loop in raid1_make_request() will then call make_request_fn() which > > will call wait_barrier(), which will wait for ->barrier[idx] to be > > zero. > > Thinking more carefully about this.. the 'idx' that the second bio will > wait for will normally be different, so there won't be a deadlock after > all. > > However it is possible for hash_long() to produce the same idx for two > consecutive barrier_units so there is still the possibility of a > deadlock, though it isn't as likely as I thought at first. Wrapped the function pointer issue Neil pointed out into Coly's original patch. Also fix a 'use-after-free' bug. For the deadlock issue, I'll add below patch, please check. Thanks, Shaohua >From ee9c98138bcdf8bceef384a68f49258b6b8b8c6d Mon Sep 17 00:00:00 2001 Message-Id: <ee9c98138bcdf8bceef384a68f49258b6b8b8c6d.1487573888.git.shli@xxxxxx> From: Shaohua Li <shli@xxxxxx> Date: Sun, 19 Feb 2017 22:18:32 -0800 Subject: [PATCH] md/raid1/10: fix potential deadlock Neil Brown pointed out a potential deadlock in raid 10 code with bio_split/chain. The raid1 code could have the same issue, but recent barrier rework makes it less likely to happen. The deadlock happens in below sequence: 1. generic_make_request(bio), this will set current->bio_list 2. raid10_make_request will split bio to bio1 and bio2 3. __make_request(bio1), wait_barrer, add underlayer disk bio to current->bio_list 4. __make_request(bio2), wait_barrer If raise_barrier happens between 3 & 4, since wait_barrier runs at 3, raise_barrier waits for IO completion from 3. And since raise_barrier sets barrier, 4 waits for raise_barrier. But IO from 3 can't be dispatched because raid10_make_request() doesn't finished yet. The solution is to adjust the IO ordering. Quotes from Neil: " It is much safer to: if (need to split) { split = bio_split(bio, ...) bio_chain(...) make_request_fn(split); generic_make_request(bio); } else make_request_fn(mddev, bio); This way we first process the initial section of the bio (in 'split') which will queue some requests to the underlying devices. These requests will be queued in generic_make_request. Then we queue the remainder of the bio, which will be added to the end of the generic_make_request queue. Then we return. generic_make_request() will pop the lower-level device requests off the queue and handle them first. Then it will process the remainder of the original bio once the first section has been fully processed. " Cc: Coly Li <colyli@xxxxxxx> Cc: stable@xxxxxxxxxxxxxxx (v3.14+, only the raid10 part) Suggested-by: NeilBrown <neilb@xxxxxxxx> Signed-off-by: Shaohua Li <shli@xxxxxx> --- drivers/md/raid1.c | 28 ++++++++++++++-------------- drivers/md/raid10.c | 41 ++++++++++++++++++++--------------------- 2 files changed, 34 insertions(+), 35 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 676f72d..e55d865 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1566,21 +1566,21 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio) sector_t sectors; /* if bio exceeds barrier unit boundary, split it */ - do { - sectors = align_to_barrier_unit_end( - bio->bi_iter.bi_sector, bio_sectors(bio)); - if (sectors < bio_sectors(bio)) { - split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set); - bio_chain(split, bio); - } else { - split = bio; - } + sectors = align_to_barrier_unit_end( + bio->bi_iter.bi_sector, bio_sectors(bio)); + if (sectors < bio_sectors(bio)) { + split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set); + bio_chain(split, bio); + } else { + split = bio; + } - if (bio_data_dir(split) == READ) - raid1_read_request(mddev, split); - else - raid1_write_request(mddev, split); - } while (split != bio); + if (bio_data_dir(split) == READ) + raid1_read_request(mddev, split); + else + raid1_write_request(mddev, split); + if (split != bio) + generic_make_request(bio); } static void raid1_status(struct seq_file *seq, struct mddev *mddev) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index a1f8e98..b495049 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1551,28 +1551,27 @@ static void raid10_make_request(struct mddev *mddev, struct bio *bio) return; } - do { - - /* - * If this request crosses a chunk boundary, we need to split - * it. - */ - if (unlikely((bio->bi_iter.bi_sector & chunk_mask) + - bio_sectors(bio) > chunk_sects - && (conf->geo.near_copies < conf->geo.raid_disks - || conf->prev.near_copies < - conf->prev.raid_disks))) { - split = bio_split(bio, chunk_sects - - (bio->bi_iter.bi_sector & - (chunk_sects - 1)), - GFP_NOIO, fs_bio_set); - bio_chain(split, bio); - } else { - split = bio; - } + /* + * If this request crosses a chunk boundary, we need to split + * it. + */ + if (unlikely((bio->bi_iter.bi_sector & chunk_mask) + + bio_sectors(bio) > chunk_sects + && (conf->geo.near_copies < conf->geo.raid_disks + || conf->prev.near_copies < + conf->prev.raid_disks))) { + split = bio_split(bio, chunk_sects - + (bio->bi_iter.bi_sector & + (chunk_sects - 1)), + GFP_NOIO, fs_bio_set); + bio_chain(split, bio); + } else { + split = bio; + } - __make_request(mddev, split); - } while (split != bio); + __make_request(mddev, split); + if (split != bio) + generic_make_request(bio); /* In case raid10d snuck in to freeze_array */ wake_up(&conf->wait_barrier); -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html