Re: dm-snap deadlock in pending_complete()

NeilBrown <neilb@xxxxxxxx> · Tue, 18 Aug 2015 16:29:48 +1000

On Mon, 17 Aug 2015 09:48:57 -0400 (EDT) Mikulas Patocka
<mpatocka@xxxxxxxxxx> wrote:

> 
> 
> On Thu, 13 Aug 2015, NeilBrown wrote:

> > Or we could change __split_and_process_bio to use bio_split() to split
> > the bio, then handle the first and call generic_make_request on the
> > second.  That would effectively put the second half on the end of
> > bio_list so it wouldn't be tried until all requests derived from the
> > first half have been handled.
> 
> I don't think it will fix the bug - even if you put the second half of the 
> bio at the end of bio_list, it will still wait until other entries on the 
> bio list are processed.
> 
> For example - device 1 gets a bio, splits it to bio1 and bio2, forwards 
> them to device 2 and put them on current->bio_list
> 
> Device 2 receives bio1 (popped from curretn->bio_list), splits it to bio3 
> and bio4, forwards them to device 3 and puts them at the end of 
> current->bio_list
> 
> Device 2 receives bio2 (popped from current->bio_list), waits until bio1 
> finishes, but bio1 won't ever finish because it depends on bio3 and bio4 
> that are on current->bio_list.
>

Yes, you are right.  I think I was thinking of a slightly different sort
of problem.

That is more insidious than I imagined, but maybe it leads to a fairly
straight forward solution.

Suppose we treated current->bio_list like a stack instead of a queue.

In your example, bio would be split into bio1 and bio2 that are both
pushed onto the stack - lets push them in reverse order to maintain a
similar set of dependencies.  So push bio2 then bio1.

Then bio1 is popped off, split into bio3 and bio4, and pushed back.

bio3 is popped and handled - nothing new pushed.
bio4 is popped and handled - nothing new pushed.
bio2 is popped, waits for bio1 - which will complete as nothing stops it
- and then proceed (probably gets split into bio4 and bio5).

The key idea here is that a bio for a lower-level device (lower in the
stacking order, where filesystem is at the top and hardware at the
bottom) is never placed after (beneath) a bio for an upper level device
in the stack.  So we always handle the lower-level bios first.

This makes sense as upper level devices may want to wait for lower
level devices, but never the reverse.

We probably don't want a strict stack discipline as if one driver
submits two bios, they should still be processed in that order.
Rather: each call to a make_request_fn gets a new queue to attach
bios to, and when it completes, this queue is pushed onto the stack of
other pending bios.

Something like this.

diff --git a/block/blk-core.c b/block/blk-core.c
index 627ed0c593fb..b4663eed7615 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1961,12 +1961,15 @@ void generic_make_request(struct bio *bio)
 	 * bio_list, and call into ->make_request() again.
 	 */
 	BUG_ON(bio->bi_next);
-	bio_list_init(&bio_list_on_stack);
-	current->bio_list = &bio_list_on_stack;
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+		struct bio_list remainder;
 
+		remainder = bio_list_on_stack;
+		bio_list_init(&bio_list_on_stack);
+		current->bio_list = &bio_list_on_stack;
 		q->make_request_fn(q, bio);
+		bio_list_merge(&bio_list_on_stack, &remainder);
 
 		bio = bio_list_pop(current->bio_list);
 	} while (bio);


So before handling a bio we put all other delayed bios (which must be
for devices on the same or a higher level) in 'remainder' and initialize
bio_list_on_stack which becomes current->bio_list.
bios submitted by that make_request_fn call (all for lower-level
devices) are collected in bio_list_on_stack which is then pushed onto
the remainder (actually remainder is spliced underneath
bio_list_on_stack).

Then the top bio is handled.  If  the most recent make_request_fn
submitted any bios, this will be the first of those, otherwise it will
whatever is next from some previous call.


This isn't sufficient by itself.  It still needs dm_make_request to
split a bio of the front of the original bio, map that, then submit the
mapped bio and the remains of the split bio.

I imagine something vaguely like this:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ab37ae114e94..977678ff8d82 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1711,8 +1711,11 @@ static void __split_and_process_bio(struct mapped_device *md,
 	} else {
 		ci.bio = bio;
 		ci.sector_count = bio_sectors(bio);
-		while (ci.sector_count && !error)
-			error = __split_and_process_non_flush(&ci);
+		error = __split_and_process_non_flush(&ci);
+		if (!error && ci.sector_count) {
+			bio->bi_iter.bi_size = to_bytes(ci.sector_count);
+			generic_make_request(bio);
+		}
 	}
 
 	/* drop the extra reference count */


Though that is wrong at least because it doesn't create a proper linkage
between the old 'bio' and the clone created by clone_bio().
This would result in cloned part being fully processed before the
remainder is looked at at all.

NeilBrown

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel