Hi! We should not never reorder BIOs (at least not around barriers), as Jens Axboe affirmed. This will cause problems as soon as there will be BIO users that set BIO_RW_BARRIER. This is an attempt to fix this problem in the device-mapper core. Currently this can happen while the device is suspended since md->deferred implements a stack instead of a queue. Also we must block incoming requests while the queue is flushed because this could also lead to a reordering. I thought of two solutions: 1. While the queue is being flushed new incoming requests are added to the tail of the queue. 2. When the queue is being flushed just block dm_request until flushing is finished. The first has a problem: Under heavy IO we might never leave dm_resume because under heavy IO dm_request keeps flooding the queue. So I implemented the second. The queue gets flushed while md->lock is held (for writing) so dm_request blocks. Because we then cannot requeue the deferred requests via generic_make_request (would deadlock in dm_request when trying to acquire the read lock) it directly starts processing of the request. Therefore the last few lines of dm_request are moved into a __dm_request. BTW: There was a up_read(&md->lock); missing in dm_request in the (!dm->map) error case. diff -Nur linux.orig/drivers/md/dm.c linux/drivers/md/dm.c --- linux.orig/drivers/md/dm.c 2004-01-01 21:49:16.118235408 +0100 +++ linux/drivers/md/dm.c 2004-01-01 21:54:35.537676272 +0100 @@ -63,7 +63,8 @@ */ atomic_t pending; wait_queue_head_t wait; - struct bio *deferred; + struct bio *deferred_head; + struct bio *deferred_tail; /* * The current mapping. @@ -231,8 +232,13 @@ return 1; } - bio->bi_next = md->deferred; - md->deferred = bio; + bio->bi_next = NULL; + + if (md->deferred_tail) + md->deferred_tail->bi_next = bio; + else + md->deferred_head = bio; + md->deferred_tail = bio; up_write(&md->lock); return 0; /* deferred successfully */ @@ -512,6 +518,16 @@ *---------------------------------------------------------------*/ +static inline void __dm_request(struct mapped_device *md, struct bio *bio) +{ + if (!md->map) { + bio_io_error(bio, bio->bi_size); + return; + } + + __split_bio(md, bio); +} + /* * The request function that just remaps the bio built up by * dm_merge_bvec. @@ -550,12 +566,7 @@ down_read(&md->lock); } - if (!md->map) { - bio_io_error(bio, bio->bi_size); - return 0; - } - - __split_bio(md, bio); + __dm_request(md, bio); up_read(&md->lock); return 0; } @@ -788,16 +799,16 @@ } /* - * Requeue the deferred bios by calling generic_make_request. + * Process the deferred bios */ -static void flush_deferred_io(struct bio *c) +static void flush_deferred_io(struct mapped_device *md, struct bio *c) { struct bio *n; while (c) { n = c->bi_next; c->bi_next = NULL; - generic_make_request(c); + __dm_request(md, c); c = n; } } @@ -891,12 +902,14 @@ dm_table_resume_targets(md->map); clear_bit(DMF_SUSPENDED, &md->flags); + def = md->deferred_head; + md->deferred_head = md->deferred_tail = NULL; + + flush_deferred_io(md, def); + clear_bit(DMF_BLOCK_IO, &md->flags); - def = md->deferred; - md->deferred = NULL; up_write(&md->lock); - flush_deferred_io(def); blk_run_queues(); return 0;