On Wed, Feb 14 2018, Mike Snitzer wrote: > On Wed, Feb 14 2018 at 3:39pm -0500, > NeilBrown <neilb@xxxxxxxx> wrote: > >> On Wed, Feb 14 2018, Milan Broz wrote: >> >> > Hi, >> > >> > the commit (found by bisect) >> > >> > commit 18a25da84354c6bb655320de6072c00eda6eb602 >> > Author: NeilBrown <neilb@xxxxxxxx> >> > Date: Wed Sep 6 09:43:28 2017 +1000 >> > >> > dm: ensure bio submission follows a depth-first tree walk >> > >> > cause serious regression while reading from DM device. >> > >> > The reproducer is below, basically it tries to simulate failure we see in cryptsetup >> > regression test: we have DM device with error and zero target and try to read >> > "passphrase" from it (it is test for 64 bit offset error path): >> > >> > Test device: >> > # dmsetup table test >> > 0 10000000 error >> > 10000000 1000000 zero >> > >> > We try to run this operation: >> > lseek64(fd, 5119999988, SEEK_CUR); // this should seek to error target sector >> > read(fd, buf, 13); // this should fail, if we seek to error part of the device >> > >> > While on 4.15 the read properly fails: >> > Seek returned 5119999988. >> > Read returned -1. >> > >> > for 4.16 it actually succeeds returning some random data >> > (perhaps kernel memory, so this bug is even more dangerous): >> > Seek returned 5119999988. >> > Read returned 13. >> > >> > Full reproducer below: >> > >> > #define _GNU_SOURCE >> > #define _LARGEFILE64_SOURCE >> > #include <stdio.h> >> > #include <stddef.h> >> > #include <stdint.h> >> > #include <stdlib.h> >> > #include <unistd.h> >> > #include <fcntl.h> >> > #include <inttypes.h> >> > >> > int main (int argc, char *argv[]) >> > { >> > char buf[13]; >> > int fd; >> > //uint64_t offset64 = 5119999999; >> > uint64_t offset64 = 5119999988; >> > off64_t r; >> > ssize_t bytes; >> > >> > system("echo -e \'0 10000000 error\'\\\\n\'10000000 1000000 zero\' | dmsetup create test"); >> > >> > fd = open("/dev/mapper/test", O_RDONLY); >> > if (fd == -1) { >> > printf("open fail\n"); >> > return 1; >> > } >> > >> > r = lseek64(fd, offset64, SEEK_CUR); >> > printf("Seek returned %" PRIu64 ".\n", r); >> > if (r < 0) { >> > printf("seek fail\n"); >> > close(fd); >> > return 2; >> > } >> > >> > bytes = read(fd, buf, 13); >> > printf("Read returned %d.\n", (int)bytes); >> > >> > close(fd); >> > return 0; >> > } >> > >> > >> > Please let me know if you need more info to reproduce it. >> >> Thanks for the detailed report. I haven't tried to reproduce, but the >> code looks very weird. >> The patch I posted "Date: Wed, 06 Sep 2017 09:43:28 +1000" had: >> + struct bio *b = bio_clone_bioset(bio, GFP_NOIO, >> + md->queue->bio_split); >> + bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9); >> + bio_chain(b, bio); >> + generic_make_request(b); >> + break; >> >> The code in Linux has: >> >> struct bio *b = bio_clone_bioset(bio, GFP_NOIO, >> md->queue->bio_split); >> ci.io->orig_bio = b; >> bio_advance(bio, (bio_sectors(bio) - ci.sector_count) << 9); >> bio_chain(b, bio); >> ret = generic_make_request(bio); >> break; >> >> So the wrong bio is sent to generic_make_request(). >> Mike: do you remember how that change happened? I think there were >> discussions following the patch, but I cannot find anything about making >> that change. > > Mikulas had this feedback: > https://www.redhat.com/archives/dm-devel/2017-November/msg00159.html > > You replied with: > https://www.redhat.com/archives/dm-devel/2017-November/msg00165.html > > Where you said "Yes, you are right something like that would be better." > > And you then provided a follow-up patch: > https://www.redhat.com/archives/dm-devel/2017-November/msg00175.html > > That we discussed and I said I'd just fold it into the original, and you > agreed and thanked me for checking with you ;) > https://www.redhat.com/archives/dm-devel/2017-November/msg00208.html > > Anyway, I'm just about to switch to Daddy Daycare mode (need to get my > daughter up from her nap, feed her dinner, etc) so: I'll circle back to > this tomorrow morning. > Thanks for the reminder - it's coming back to me now. And looking again at the code, it doesn't seem so bad. I was worried about reassigning ci.io->orig_bio after calling __split_and_process_non_flush(), but the important thing is to set it before the last dec_pending() call, and the code gets that right. I think I've found the problem though: /* done with normal IO or empty flush */ bio->bi_status = io_error; When we have chained bios, there are multiple sources for error which need to be merged into bi_status: all bios that were chained to this one, and this bio itself. __bio_chain_endio uses: if (!parent->bi_status) parent->bi_status = bio->bi_status; so the new status won't over-ride the old status (unless there are races....), but dm doesn't do that - I guess it didn't have to worry about chains before. So this: diff --git a/drivers/md/dm.c b/drivers/md/dm.c index d6de00f367ef..68136806d365 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -903,7 +903,8 @@ static void dec_pending(struct dm_io *io, blk_status_t error) queue_io(md, bio); } else { /* done with normal IO or empty flush */ - bio->bi_status = io_error; + if (io_error) + bio->bi_status = io_error; bio_endio(bio); } } should fix it. And __bio_chain_endio should probably be diff --git a/block/bio.c b/block/bio.c index e1708db48258..ad77140edc6f 100644 --- a/block/bio.c +++ b/block/bio.c @@ -312,7 +312,7 @@ static struct bio *__bio_chain_endio(struct bio *bio) { struct bio *parent = bio->bi_private; - if (!parent->bi_status) + if (bio->bi_status) parent->bi_status = bio->bi_status; bio_put(bio); return parent; to remove the chance that a race will hide a real error. Milan: could you please check that the patch fixes the dm-crypt bug? I've confirmed that it fixes your test case. When you confirm, I'll send a proper patch. I guess I should audit all code that assigns to ->bi_status and make sure nothing ever assigns 0 to it. Thanks, NeilBrown
Attachment:
signature.asc
Description: PGP signature
-- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel