Hi, I've opened the following public BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1607527 Feel free to add comments to that BZ if you have a redhat bugzilla account. But otherwise, happy to get as much feedback and discussion going purely on the relevant lists. I've taken ~1.5 weeks to categorize and isolate this issue. But I've reached a point where I'm getting diminishing returns and could _really_ use the collective eyeballs and expertise of the community. This is by far one of the most nasty cases of corruption I've seen in a while. Not sure where the ultimate cause of corruption lies (that the money question) but it _feels_ rooted in NVMe and is unique to this particular workload I've stumbled onto via customer escalation and then trying to replicate an rbd device using a more approachable one (request-based DM multipath in this case). >From the BZ's comment#0: The following occurs with latest v4.18-rc3 and v4.18-rc6 and also occurs with v4.15. When corruption occurs from this test it also destroys the DOS partition table (created during step 0 below).. yeah, corruption is _that_ bad. Almost like the corruption is temporal (recently accessed regions of the NVMe device)? Anyway: I stumbled onto rampant corruption when using request-based DM multipath ontop of an NVMe device (not exclusive to a particular drive either, happens to NVMe devices from multiple vendors). But the corruption only occurs if the request-based multipath IO is issued to an NVMe device in parallel to other IO issued to the _same_ underlying NVMe by the DM cache target. See topology detailed below (at the very end of this comment).. basically all 3 devices that are used to create a DM cache device need to be backed by the same NVMe device (via partitions or linear volumes). Again, using request-based DM multipath for dm-cache's "slow" device is _required_ to reproduce. Not 100% clear why really... other than request-based DM multipath builds large IOs (due to merging). --- Additional comment from Mike Snitzer on 2018-07-20 10:14:09 EDT --- To reproduce this issue using device-mapper-test-suite: 0) Partition an NVMe device. First primary partition with at least a 5GB, seconf primary partition with at least 48GB. NOTE: larger partitions (e.g. 1: 50GB 2: >= 220GB) can be used to reproduce XFS corruption much quicker. 1) create a request-based multipath device ontop of an NVMe device, e.g.: #!/bin/sh modprobe dm-service-time DEVICE=/dev/nvme1n1p2 SIZE=`blockdev --getsz $DEVICE` echo "0 $SIZE multipath 2 queue_mode mq 0 1 1 service-time 0 1 2 $DEVICE 1000 1" | dmsetup create nvme_mpath # Just a note for how to fail/reinstate path: # dmsetup message nvme_mpath 0 "fail_path $DEVICE" # dmsetup message nvme_mpath 0 "reinstate_path $DEVICE" 2) checkout device-mapper-test-suite from my github repo: git clone git://github.com/snitm/device-mapper-test-suite.git cd device-mapper-test-suite git checkout -b devel origin/devel 3) follow device-mapper-test-suite's README.md to get it all setup 4) Configure /root/.dmtest/config with something like: profile :nvme_shared do metadata_dev '/dev/nvme1n1p1' #data_dev '/dev/nvme1n1p2' data_dev '/dev/mapper/nvme_mpath' end default_profile :nvme_shared ------ NOTE: configured 'metadata_dev' gets carved up by device-mapper-test-suite to provide both the dm-cache's metadata device and the "fast" data device. The configured 'data_dev' is used for dm-cache's "slow" data device. 5) run the test: # tail -f /var/log/messages & # time dmtest run --suite cache -n /split_large_file/ 6) If multipath device failed the lone NVMe path you'll need to reinstate the path before the next iteration of your test, e.g. (from #1 above): dmsetup message nvme_mpath 0 "reinstate_path $DEVICE" --- Additional comment from Mike Snitzer on 2018-07-20 12:02:45 EDT --- (In reply to Mike Snitzer from comment #6) > SO seems pretty clear something is still wrong with request-based DM > multipath ontop of NVMe... sadly we don't have any negative check in > blk-core, NVMe or elsewhere to offer any clue :( Building on this comment: "Anyway, fact that I'm getting this corruption on multiple different NVMe drives: I am definitely concerned that this BZ is due to a bug somewhere in NVMe core (or block core code that is specific to NVMe)." I'm left thinking that request-based DM multipath is somehow causing NVMe's SG lists or other infrastructure to be "wrong" and it is resulting in corruption. I get corruption to the dm-cache's metadata device (which while theoretically unrelated as its a separate device from the "slow" dm-cache data device) if the dm-cache slow data device is backed by request-based dm-multipath ontop of NVMe (which is a partition from the _same_ NVMe device that is used by the dm-cache metadata device). Basically I'm back to thinking NVMe is corrupting the data due to the IO pattern or nature of the cloned requests dm-multipath is issuing. And it is causing corruption to other NVMe partitions on the same parent NVMe device. Certainly that is a concerning hypothesis but I'm not seeing much else that would explain this weird corruption. If I don't use the same NVMe device (with multiple partitions) for _all_ 3 sub-devices that dm-cache needs I don't see the corruption. It is almost like the mix of IO issued by DM cache's metadata (on nvme1n1p1 using dm-linear) and "fast" device (also on nvme1n1p1 via dm-linear volume) in conjunction with IO issued by request-based DM multipath to NVMe for "slow" device (on nvme1n1p2) is triggering NVMe to respond negatively. But this same observation can be made on completely different hardware using 2 totally different NVMe devices: testbed1: Intel Corporation Optane SSD 900P Series (2700) testbed2: Samsung Electronics Co Ltd NVMe SSD Controller 171X (rev 03) Which is why it feels like some bug in Linux (be it dm-rq.c, blk-core.c, blk-merge.c or the common NVMe driver) topology before starting the device-mapper-test-suite test: # lsblk /dev/nvme1n1 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:1 0 745.2G 0 disk ├─nvme1n1p2 259:5 0 695.2G 0 part │ └─nvme_mpath 253:2 0 695.2G 0 dm └─nvme1n1p1 259:4 0 50G 0 part topology during the device-mapper-test-suite test: # lsblk /dev/nvme1n1 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:1 0 745.2G 0 disk ├─nvme1n1p2 259:5 0 695.2G 0 part │ └─nvme_mpath 253:2 0 695.2G 0 dm │ └─test-dev-458572 253:5 0 48G 0 dm │ └─test-dev-613083 253:6 0 48G 0 dm /root/snitm/git/device-mapper-test-suite/kernel_builds └─nvme1n1p1 259:4 0 50G 0 part ├─test-dev-126378 253:4 0 4G 0 dm │ └─test-dev-613083 253:6 0 48G 0 dm /root/snitm/git/device-mapper-test-suite/kernel_builds └─test-dev-652491 253:3 0 40M 0 dm └─test-dev-613083 253:6 0 48G 0 dm /root/snitm/git/device-mapper-test-suite/kernel_builds pruning that tree a bit (removing the dm-cache device 253:6) for clarity: # lsblk /dev/nvme1n1 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:1 0 745.2G 0 disk ├─nvme1n1p2 259:5 0 695.2G 0 part │ └─nvme_mpath 253:2 0 695.2G 0 dm │ └─test-dev-458572 253:5 0 48G 0 dm └─nvme1n1p1 259:4 0 50G 0 part ├─test-dev-126378 253:4 0 4G 0 dm └─test-dev-652491 253:3 0 40M 0 dm 40M device is dm-cache "metadata" device 4G device is dm-cache "fast" data device 48G device is dm-cache "slow" data device