On Fri, Jun 24, 2016 at 10:27 PM, Lars Ellenberg <lars.ellenberg@xxxxxxxxxx> wrote: > On Fri, Jun 24, 2016 at 07:36:57PM +0800, Ming Lei wrote: >> > >> > This is not a theoretical problem. >> > At least int DRBD, and an unfortunately high IO concurrency wrt. the >> > "max-buffers" setting, without this patch we have a reproducible deadlock. >> >> Is there any log about the deadlock? And is there any lockdep warning >> if it is enabled? > > In DRBD, to avoid potentially very long internal queues as we wait for > our replication peer device and local backend, we limit the number of > in-flight bios we accept, and block in our ->make_request_fn() if that > number exceeds a configured watermark ("max-buffers"). > > Works fine, as long as we could assume that once our make_request_fn() > returns, any bios we "recursively" submitted against the local backend > would be dispatched. Which used to be the case. > > commit 54efd50 block: make generic_make_request handle arbitrarily sized bios > introduced blk_queue_split(), which will split any bio that is > violating the queue limits into a smaller piece to be processed > right away, and queue the "rest" on current->bio_list to be > processed by the iteration in generic_make_request() once the > current ->make_request_fn() returns. > > Before that, any user was supposed to go through bio_add_page(), > which would call our merge bvec function, and that should already > be sufficient to not violate queue limits. > > Previously, typically the next in line bio to be processed would > be the cloned one we passed down to our backend device, which in > case of further stacking devices (e.g. logical volume, raid1 or > similar) may again push further bios on that list. > All of which would simply be processed in turn. Could you clarify if the issue is triggered in drbd without dm/md involved? Or the issue is always triggered with dm/md over drbd? As Mike mentioned, there is one known issue with dm-snapshot. > > Now, with blk_queue_split(), the next-in-line bio would be the > next piece of the "too big" original bio, so we end up calling > several times into our ->make_request_fn() without even > dispatching one bio to the backend. If your ->make_request_fn() is called several times, each time at least you should dispatch one bio returnd from blk_queue_split(), so I don't understand why even one bio isn't dispatched out. > > With highly concurrent IO and/or very small "max-buffers" setting, > this can deadlock, where the current submit path would wait for > the completion of the bios supposedly already passed to the > backend, but which actually are not even dispatched yet, rotting > away on current->bio_list. Suppose your ->make_request_fn only handles the bio returned from blk_queue_split(), I don't understand why your driver waits for the 'rest' bios from the blk_queue_split(). Maybe drbd driver calls generic_make_request() in ->make_request_fn() and supposes the bio is always dispatched out once generic_make_request() returns, is it right? > > I could code around that in various ways inside the DRBD make_request_fn, > or always dispatch the backend bios to a different (thread or work_queue) > context, but I'd rather have the distinction between "recursively > submitted" bios and "pushed back part of originally passed in bio" as > implemented in this patch. > > Thanks, > > Lars > >> > I'm unsure if it could cause problems in md raid in some corner cases >> > or when a rebuild/scrub/... starts in a "bad" moment. >> > >> > It also may increase "stress" on any involved bio_set, >> > because it may increase the number of bios that are allocated >> > before the first of them is actually processed, causing more >> > frequent calls of punt_bios_to_rescuer(). >> > >> > Alternatively, >> > a simple one-line change to generic_make_request() would also "fix" it, >> > by processing all recursive bios in LIFO order. >> > But that may change the order in which bios reach the "physical" layer, >> > in case they are in fact split into many smaller pieces, for example in >> > a striping setup, which may have other performance implications. >> > >> > | diff --git a/block/blk-core.c b/block/blk-core.c >> > | index 2475b1c7..a5623f6 100644 >> > | --- a/block/blk-core.c >> > | +++ b/block/blk-core.c >> > | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio) >> > | * should be added at the tail >> > | */ >> > | if (current->bio_list) { >> > | - bio_list_add(current->bio_list, bio); >> > | + bio_list_add_head(current->bio_list, bio); >> > | goto out; >> > | } > >> > diff --git a/block/blk-core.c b/block/blk-core.c >> > index 2475b1c7..f03ff4c 100644 >> > --- a/block/blk-core.c >> > +++ b/block/blk-core.c >> > @@ -2031,7 +2031,7 @@ end_io: >> > */ >> > blk_qc_t generic_make_request(struct bio *bio) >> > { >> > - struct bio_list bio_list_on_stack; >> > + struct recursion_to_iteration_bio_lists bio_lists_on_stack; >> > blk_qc_t ret = BLK_QC_T_NONE; >> > >> > if (!generic_make_request_checks(bio)) >> > @@ -2040,15 +2040,18 @@ blk_qc_t generic_make_request(struct bio *bio) > >> > - if (current->bio_list) { >> > - bio_list_add(current->bio_list, bio); >> > + if (current->bio_lists) { >> > + bio_list_add(¤t->bio_lists->recursion, bio); >> > goto out; >> > } >> > > >> > @@ -2067,8 +2070,9 @@ blk_qc_t generic_make_request(struct bio *bio) >> > * bio_list, and call into ->make_request() again. >> > */ >> > BUG_ON(bio->bi_next); >> > - bio_list_init(&bio_list_on_stack); >> > - current->bio_list = &bio_list_on_stack; >> > + bio_list_init(&bio_lists_on_stack.recursion); >> > + bio_list_init(&bio_lists_on_stack.remainder); >> > + current->bio_lists = &bio_lists_on_stack; >> > do { >> > struct request_queue *q = bdev_get_queue(bio->bi_bdev); >> > >> > @@ -2076,16 +2080,14 @@ blk_qc_t generic_make_request(struct bio *bio) >> > ret = q->make_request_fn(q, bio); >> > >> > blk_queue_exit(q); >> > - >> > - bio = bio_list_pop(current->bio_list); >> > } else { >> > - struct bio *bio_next = bio_list_pop(current->bio_list); >> > - >> > bio_io_error(bio); >> > - bio = bio_next; >> > } >> > + bio = bio_list_pop(¤t->bio_lists->recursion); >> > + if (!bio) >> > + bio = bio_list_pop(¤t->bio_lists->remainder); >> > } while (bio); >> > - current->bio_list = NULL; /* deactivate */ >> > + current->bio_lists = NULL; /* deactivate */ >> > >> > out: >> > return ret; >> > diff --git a/block/blk-merge.c b/block/blk-merge.c >> > index 2613531..8da0c22 100644 >> > --- a/block/blk-merge.c >> > +++ b/block/blk-merge.c >> > @@ -172,6 +172,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio, >> > struct bio *split, *res; >> > unsigned nsegs; >> > >> > + BUG_ON(!current->bio_lists); >> > if ((*bio)->bi_rw & REQ_DISCARD) >> > split = blk_bio_discard_split(q, *bio, bs, &nsegs); >> > else if ((*bio)->bi_rw & REQ_WRITE_SAME) >> > @@ -190,7 +191,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio, >> > >> > bio_chain(split, *bio); >> > trace_block_split(q, split, (*bio)->bi_iter.bi_sector); >> > - generic_make_request(*bio); >> > + bio_list_add_head(¤t->bio_lists->remainder, *bio); >> > *bio = split; > > -- > To unsubscribe from this list: send the line "unsubscribe linux-block" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-bcache" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html