On Sat, Oct 17 2015 at 12:04pm -0400, Ming Lei <tom.leiming@xxxxxxxxx> wrote: > On Thu, Oct 15, 2015 at 4:47 AM, Mike Snitzer <snitzer@xxxxxxxxxx> wrote: > > From: Mikulas Patocka <mpatocka@xxxxxxxxxx> > > > > The block layer uses per-process bio list to avoid recursion in > > generic_make_request. When generic_make_request is called recursively, > > the bio is added to current->bio_list and generic_make_request returns > > immediately. The top-level instance of generic_make_request takes bios > > from current->bio_list and processes them. > > > > Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by > > stacking drivers") created a workqueue for every bio set and code > > in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by > > redirecting bios queued on current->bio_list to the workqueue if the > > system is low on memory. However another deadlock (see below **) may > > happen, without any low memory condition, because generic_make_request > > is queuing bios to current->bio_list (rather than submitting them). > > > > Fix this deadlock by redirecting any bios on current->bio_list to the > > bio_set's rescue workqueue on every schedule call. Consequently, when > > the process blocks on a mutex, the bios queued on current->bio_list are > > dispatched to independent workqueus and they can complete without > > waiting for the mutex to be available. > > It isn't common to acquire mutex/semaphone inside .make_request() > or .request_fn(), so I am wondering it is good to reuse the rescuing > workqueue for this unusual case. Which specific locking are you concerned about? > Also sometimes it can hurt performance by converting I/O submission > from one context into concurrent contexts of workqueue, especially > in case of sequential I/O, since plug & plug merge can't be used any > more. True, plug and plug merge wouldn't be usable but this recursive call to generic_make_request isn't expected to be common. This patch was to fix a relatively obscure bio spliting scenario that fell out of the complexity of dm-snapshot. > > diff --git a/block/bio.c b/block/bio.c > > index ad3f276..99f5a2ad 100644 > > --- a/block/bio.c > > +++ b/block/bio.c > > @@ -354,35 +354,35 @@ static void bio_alloc_rescue(struct work_struct *work) > > } > > } > > > > -static void punt_bios_to_rescuer(struct bio_set *bs) > > +/** > > + * blk_flush_bio_list > > + * @tsk: task_struct whose bio_list must be flushed > > + * > > + * Pop bios queued on @tsk->bio_list and submit each of them to > > + * their rescue workqueue. > > + * > > + * If the bio doesn't have a bio_set, we leave it on @tsk->bio_list. > > + * However, stacking drivers should use bio_set, so this shouldn't be > > + * an issue. > > + */ > > +void blk_flush_bio_list(struct task_struct *tsk) ... > > + while ((bio = bio_list_pop(&list))) { > > + struct bio_set *bs = bio->bi_pool; > > + if (unlikely(!bs)) { > > + bio_list_add(tsk->bio_list, bio); > > + continue; > > + } > > > > - queue_work(bs->rescue_workqueue, &bs->rescue_work); > > + spin_lock(&bs->rescue_lock); > > + bio_list_add(&bs->rescue_list, bio); > > + queue_work(bs->rescue_workqueue, &bs->rescue_work); > > + spin_unlock(&bs->rescue_lock); > > + } > > Not like rescuring path, schedule out can be quite frequent, and the > above change will switch to submit these I/Os from wq concurrently, > which might hurt performance for sequential I/O. > > Also I am wondering why not submit these I/Os in 'current' context > just like what flush plug does? Flush plug during schedule makes use of kblockd so I'm not sure what you're referring to here. > > } > > > > /** > > @@ -422,7 +422,6 @@ static void punt_bios_to_rescuer(struct bio_set *bs) > > */ > > struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs) > > { > > - gfp_t saved_gfp = gfp_mask; > > unsigned front_pad; > > unsigned inline_vecs; > > unsigned long idx = BIO_POOL_NONE; > > @@ -457,23 +456,11 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs) > > * reserve. > > * > > * We solve this, and guarantee forward progress, with a rescuer > > - * workqueue per bio_set. If we go to allocate and there are > > - * bios on current->bio_list, we first try the allocation > > - * without __GFP_WAIT; if that fails, we punt those bios we > > - * would be blocking to the rescuer workqueue before we retry > > - * with the original gfp_flags. > > + * workqueue per bio_set. If an allocation would block (due to > > + * __GFP_WAIT) the scheduler will first punt all bios on > > + * current->bio_list to the rescuer workqueue. > > */ > > - > > - if (current->bio_list && !bio_list_empty(current->bio_list)) > > - gfp_mask &= ~__GFP_WAIT; > > - > > p = mempool_alloc(bs->bio_pool, gfp_mask); > > - if (!p && gfp_mask != saved_gfp) { > > - punt_bios_to_rescuer(bs); > > - gfp_mask = saved_gfp; > > - p = mempool_alloc(bs->bio_pool, gfp_mask); > > - } > > - > > front_pad = bs->front_pad; > > inline_vecs = BIO_INLINE_VECS; > > } > > @@ -486,12 +473,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs) > > > > if (nr_iovecs > inline_vecs) { > > bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool); > > - if (!bvl && gfp_mask != saved_gfp) { > > - punt_bios_to_rescuer(bs); > > - gfp_mask = saved_gfp; > > - bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool); > > - } > > - > > Looks you touched rescuing path for bio allocation, and better to just > do one thing in one patch. Yes, good point, I've split the patches, you can see the result in the 2 topmost commits in this branch: http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip Definitely helped clean up the patches and make them more approachable/reviewable. Thanks for the suggestion. I'll hold off on sending out v4 of these patches until I can better undersatnd your concerns about using the rescue workqueue during schedule. Mike -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel