On Mon, Sep 04 2017, Mikulas Patocka wrote: > On Mon, 4 Sep 2017, NeilBrown wrote: > >> On Fri, Sep 01 2017, Mikulas Patocka wrote: >> >> > On Fri, 1 Sep 2017, NeilBrown wrote: >> > >> >> On Thu, Aug 31 2017, Mikulas Patocka wrote: >> >> >> >> >> >> >> >> Note that only current->bio_list[0] is offloaded. current->bio_list[1] >> >> >> contains bios that were scheduled *before* the current one started, so >> >> > >> >> > These bios need to be offloaded too, otherwise you re-introduce this >> >> > deadlock: https://www.redhat.com/archives/dm-devel/2014-May/msg00089.html >> >> >> >> Thanks for the link. >> >> In the example the bio that is stuck was created in step 4. The call >> >> to generic_make_request() will have placed it on current->bio_list[0]. >> >> The top-level generic_make_request call by Process A is still running, >> >> so nothing will have moved the bio to ->bio_list[1]. That only happens >> >> after the ->make_request_fn completes, which must be after step 7. >> >> So the problem bio is on ->bio_list[0] and the code in my patch will >> >> pass it to a workqueue for handling. >> >> >> >> So I don't think the deadlock would be reintroduced. Can you see >> >> something that I am missing? >> >> >> >> Thanks, >> >> NeilBrown >> > >> > Offloading only current->bio_list[0] will work in a simple case described >> > above, but not in the general case. >> > >> > For example, suppose that we have a dm device where the first part is >> > linear and the second part is snapshot. >> > >> > * We receive bio A that crosses the linear/snapshot boundary >> > * DM core creates bio B, submits it to the linear target and adds it to >> > current->bio_list[0] >> > * DM core creates bio C, submits it to the snapshot target, the snapshot >> > target calls track_chunk for this bio and appends the bio to >> > current->bio_list[0] >> > * Now, we return back to generic_make_request >> > * We pop bio B from current->bio_list[0] >> > * We execute bio_list_on_stack[1] = bio_list_on_stack[0], that moves bio C >> > to bio_list_on_stack[1] - and now we lose any possibility to offload bio C >> > to the rescue thread >> > * The kcopyd thread for the snapshot takes the snapshot lock and waits for >> > bio C to finish >> >> Thanks for the explanation. >> I cannot find this last step in the code. "waits for BIO C to finish" >> is presumably the called to __check_for_conflicting_io(). This is >> called from kcopyd in merge_callback() -> snapshot_merge_next_chunks(). >> What lock is held at that time? > > The function pending_complete is called from the kcopyd callback. It takes > "down_write(&s->lock)" and calls __check_for_conflicting_io which waits > for I/Os to finish. > >> > * We process bio B - and if processing bio B reaches something that takes >> > the snapshot lock (for example an origin target for the snapshot), a >> > deadlock will happen. The deadlock could be avoided by offloading bio C to >> > the rescue thread, but bio C is already on bio_list_on_stack[1] and so it >> > won't be offloaded >> >> I think the core issue behind this deadlock is that the same volume >> can appear twice in the top-level device, in different regions. An >> outstanding request to one region can then interfere with a new request >> to a different region. That seems like a reasonable thing to do. >> I cannot immediately see a generic way to handle this case that I am >> happy with. I'll keep hunting. > > You can have a snapshot and an origin for the same device in the same > tree - it is not common, but it is possible. > > Offloafing bios queued on current->bio_list avoids the deadlock, but your > patch breaks it by offloading only the first list and not the second. > Thanks for the extra explanation. I've thought some more about this I can see a way forward that I am comfortable with. Please let me know what you think. I haven't tested this patch yet, but I believe that applying it first will clear the way for my other patch to work without reintroducing deadlocks. Thanks, NeilBrown From: NeilBrown <neilb@xxxxxxxx> Date: Wed, 6 Sep 2017 09:14:52 +1000 Subject: [PATCH] dm: ensure bio submission follows a depth-first tree walk. A dm device can, in general, represent a tree of targets, each of which handles a sub-range of the range of blocks handled by the parent. The bio sequencing managed by generic_make_request() requires that bios are generated and handled in a depth-first manner. Each call to a make_request_fn() may submit bios to a single member device, and may submit bios for a reduced region of the same device as the make_request_fn. In particular, any bios submitted to member devices must be expected to be processed in order, so a later one must never wait for an earlier one. This ordering is usually achieved by using bio_split() to reduce a bio to a size that can be completely handled by one target, and resubmitting the remainder to the originating device. bio_queue_split() shows the canonical approach. dm doesn't follow this approach, largely because it has needed to split bios since long before bio_split() was available. It currently can submit bios to separate targets within the one dm_make_request() call. Dependencies between these targets, as can happen with dm-snap, can cause deadlocks if either bios gets stuck behind the other in the queues managed by generic_make_request(). This requires the 'rescue' functionality provided by dm_offload*. Some of this requirement can be removed by changing the order of bio submission to follow the canonical approach. That is, if dm finds that it needs to split a bio, the remainder should be sent to generic_make_request() rather than being handled immediately. This delays the handling until the first part is completely processed, so the deadlock problems do not occur. __split_and_process_bio() can be called both from dm_make_request() and from dm_wq_work(). When called from dm_wq_work() the current approach is perfectly satisfactory as each bio will be processed immediately. When called from dm_make_request, current->bio_list will be non-NULL, and in this case it is best to create a separate "clone" bio for the remainder. To provide the clone bio when splitting, we use q->bio_split(). This was previously being freed to avoid having excess rescuer threads. As bio_split bio sets no longer create rescuer threads, there is little cost and much gain from leaving the ->bio_split bio set in place. Signed-off-by: NeilBrown <neilb@xxxxxxxx> --- drivers/md/dm.c | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index d669fddd9290..81395f7550c0 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1007,7 +1007,7 @@ static void dm_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, /* * A target may call dm_accept_partial_bio only from the map routine. It is - * allowed for all bio types except REQ_PREFLUSH. + * allowed for all bio types except REQ_PREFLUSH and REQ_OP_ZONE_RESET. * * dm_accept_partial_bio informs the dm that the target only wants to process * additional n_sectors sectors of the bio and the rest of the data should be @@ -1508,8 +1508,21 @@ static void __split_and_process_bio(struct mapped_device *md, } else { ci.bio = bio; ci.sector_count = bio_sectors(bio); - while (ci.sector_count && !error) + while (ci.sector_count && !error) { error = __split_and_process_non_flush(&ci); + if (current->bio_list && ci.sector_count && !error) { + /* Remainder must be passed to generic_make_request() + * so that it gets handled *after* bios already submitted + * have been completely processed. + */ + struct bio *b = bio_clone_bioset(bio, GFP_NOIO, + md->queue->bio_split); + bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9); + bio_chain(b, bio); + generic_make_request(b); + break; + } + } } /* drop the extra reference count */ @@ -1520,8 +1533,8 @@ static void __split_and_process_bio(struct mapped_device *md, *---------------------------------------------------------------*/ /* - * The request function that just remaps the bio built up by - * dm_merge_bvec. + * The request function that remaps the bio to one target and + * splits off any remainder. */ static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio) { @@ -2056,12 +2069,6 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t) case DM_TYPE_DAX_BIO_BASED: dm_init_normal_md_queue(md); blk_queue_make_request(md->queue, dm_make_request); - /* - * DM handles splitting bios as needed. Free the bio_split bioset - * since it won't be used (saves 1 process per bio-based DM device). - */ - bioset_free(md->queue->bio_split); - md->queue->bio_split = NULL; if (type == DM_TYPE_DAX_BIO_BASED) queue_flag_set_unlocked(QUEUE_FLAG_DAX, md->queue); -- 2.14.0
Attachment:
signature.asc
Description: PGP signature
-- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel