On Wed, Nov 29, 2023 at 10:10 AM Eric Wheeler <dm-devel@xxxxxxxxxxxxxxxxxx> wrote: > > Hi Joe, > > I'm not sure who else to CC on this issue, feel free to add others. > > Recently we tried putting dm-integrity on NVMe's under MD RAID1 with > dm-thin metadata (tmeta) on that raid1 mirror (Linux v6.5.7). It worked > fine for ~12 hours and then, at ~3:30 am in the morning when we do a > reserve_metadata_snap and thin_dump's, IO's began to hang. > > For reference, we have been using the same block device stack for years, > and it is quite stable. We've been running Linux v6.5.7 for about 6 weeks > without issue until we tried dm-integrity under dm-thin. The stack > _without_ dm-integrity looks like this: > > nvme{0,1} => > raid mirror /dev/md0 => > LVM PV for VG "data" => > tmeta for dm-thin > > /dev/sda (LSI-based Hardware RAID) > bcache => > LVM PV for VG "data" => > tdata for dm-thin "pool0" => > single thin volume from the pool => > drbd to another host over the network > > After adding dm-integrity, the stack is the same as above except that > the RAID mirror is above dm-integrity: > > nvme{0,1} => > dm-integrity => > raid mirror /dev/md0 => > LVM PV for VG "data" => > tmeta for dm-thin > > The first time we had the thindump+IO hang problem we thought that an > underlying NVMe drive was bad, so we replaced both: We rebooted and did > a `pvmove` of the tmeta volume to different NVMe+md-raid1+dm-integrity > on disks that were brand new, but had the same issue that night during > reserve_metadata_snap+thin_dump. > > Since we had the same problem on a different set of new NVMe disks, we > rebooted, did a `pvmove` back to the original MD RAID1 mirror stack > _without_ dm-integrity, and everything was back to normal. Thus, I do not > think it is a disk or tmeta corruption issue, but an obscure race or > deadlock bug. > > We tried to reproduce this in a virtual environment, but so far it has not > triggered. In about a week when we get downtime for this system, I will > take a copy of its thin metadata to inspect further. For now, I have not > yet been enable to reproduce problem outside our semi-production testing > environment. > > So, this is what we see so far: > > We get _thousands_ of these lines in the kernel logs when IO hangs. The > `dm-175` device is one of the the dm-integrity targets being mirrored. > Note the "12924 callbacks suppressed" message: > > [ +0.584816] raid1_end_read_request: 12924 callbacks suppressed > [ +0.000006] md/raid1:md0: dm-175: rescheduling sector 1979682216 > [ +0.020011] md/raid1:md0: dm-175: rescheduling sector 1979682216 > [ +0.019462] md/raid1:md0: dm-175: rescheduling sector 1979682216 > [ +0.018672] md/raid1:md0: dm-175: rescheduling sector 1979682216 > [ +0.006242] md/raid1:md0: dm-175: rescheduling sector 1979682216 > [ +0.002399] md/raid1:md0: dm-175: rescheduling sector 1979682216 > [ +0.000934] md/raid1:md0: dm-175: rescheduling sector 1979682216 > [ +0.000688] md/raid1:md0: dm-175: rescheduling sector 1979682216 > [ +0.001161] md/raid1:md0: dm-175: rescheduling sector 1979682216 > [ +0.006404] md/raid1:md0: dm-175: rescheduling sector 1979682216 > [ +3.594380] integrity_metadata: 13095 callbacks suppressed > [ +0.000006] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0 > [ +0.001410] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0 > [ +0.000824] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0 > [ +0.003587] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0 > [ +0.011065] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0 > [ +0.004176] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0 > [ +0.000604] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0 > [ +0.006153] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0 > [ +0.008797] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0 > [ +0.002350] device-mapper: integrity: dm-175: Checksum failed at sector 0x760395b0 > [ +0.668866] raid1_read_request: 12747 callbacks suppressed > [ +0.000005] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175 > [ +0.024708] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175 > [ +0.024441] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175 > [ +0.018321] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175 > [ +0.000850] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175 > [ +0.004479] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175 > [ +0.001225] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175 > [ +0.001024] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175 > [ +0.000741] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175 > [ +0.001094] md/raid1:md0: redirecting sector 1979682216 to other mirror: dm-175 Hi all The process is as follows and I add some logs 1. raid1_read_request submits bio to read data 2. raid1_end_read finds the read request fails and try to fix the read error 3. handle_read_error>fix_read_error tries to read data in 4K unit in another disk, write to the failed disk in 4K unit and read from the failed disk in 4K unit. fix_read_error finishes successfully. 4. handle_read_error re-read data from the failed disk. And the read request fails again. [ 1067.149829] device-mapper: integrity: dm-3: Checksum failed at sector 0x17d8 [ 1067.149842] md/raid1:md9: dm-3: rescheduling sector 3912, sectors 248 [ 1067.150437] md/raid1:md9: dm-3: fix read error 3912, sectors 248 [ 1067.150439] md/raid1:md9: dm-3: re read 3912, sectors 248 It looks like the bio size is suspicious. Because it reads successfully when bio size is 4K. Best Regards Xiao > > Here are two traces of hung tasks that the kernel reported in dmesg. The > decoded line numbers are at the bottom of this email. > > It looks like dmeventd is checking metadata transaction IDs, and then > hangs. We believe that is snapshot was being made at the moment that IOs > hung. This trace only appears once: > > INFO: task dmeventd:3718459 blocked for more than 122 seconds. > Tainted: G E 6.5.7 #23 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:dmeventd state:D stack:0 pid:3718459 ppid:1 flags:0x00000002 > Call Trace: > <TASK> > __schedule+0x2d7/0x800 > schedule+0x5e/0xe0 > schedule_preempt_disabled+0x15/0x30 > rwsem_down_read_slowpath+0x2c4/0x5f0 > ? lock_acquire+0x10b/0x120 > down_read+0xff/0x260 > dm_pool_get_metadata_transaction_id+0x24/0x60 [dm_thin_pool] > pool_status+0x133/0x630 [dm_thin_pool] > retrieve_status+0xc4/0x210 [dm_mod] > table_status+0x9c/0x150 [dm_mod] > ctl_ioctl+0x248/0x370 [dm_mod] > ? __pfx_table_status+0x10/0x10 [dm_mod] > dm_ctl_ioctl+0xe/0x20 [dm_mod] > __x64_sys_ioctl+0x99/0xe0 > do_syscall_64+0x3f/0xa0 > entry_SYSCALL_64_after_hwframe+0x6e/0xd8 > > In the next trace below, you can see that DRBD is trying to write to it's > thin volume backing device, and thin metadata is being manipulated it by > dm-thin (which is above md raid1 with dm-integrity), but hangs during > `dm_thin_find_block`: > > This trace appears several times for different DRBD volumes that are > backed by thin volumes in the same pool. Except that the thin volume > differs, the traces are the same: > > INFO: task drbd_r_secServe:1072212 blocked for more than 122 seconds. > Tainted: G E 6.5.7 #23 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:drbd_r_secServe state:D stack:0 pid:1072212 ppid:2 flags:0x00004000 > Call Trace: > <TASK> > __schedule+0x2d7/0x800 > schedule+0x5e/0xe0 > schedule_preempt_disabled+0x15/0x30 > rwsem_down_read_slowpath+0x2c4/0x5f0 > ? dm_thin_find_block+0x32/0x90 [dm_thin_pool] > down_read+0xff/0x260 > dm_thin_find_block+0x32/0x90 [dm_thin_pool] > thin_bio_map.isra.0+0x169/0x3a0 [dm_thin_pool] > __map_bio+0x49/0x1d0 [dm_mod] > dm_split_and_process_bio+0x21c/0x580 [dm_mod] > dm_submit_bio+0x64/0x170 [dm_mod] > __submit_bio+0x7e/0xd0 > __submit_bio_noacct+0x90/0x200 > drbd_submit_peer_request+0x138/0x2d0 [drbd] > receive_DataRequest+0x244/0x710 [drbd] > ? __pfx_receive_DataRequest+0x10/0x10 [drbd] > drbdd+0x174/0x280 [drbd] > ? __pfx_drbd_thread_setup+0x10/0x10 [drbd] > drbd_receiver+0x51/0xd0 [drbd] > drbd_thread_setup+0xa3/0x1b0 [drbd] > kthread+0xe9/0x120 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x34/0x50 > ? __pfx_kthread+0x10/0x10 > ret_from_fork_asm+0x1b/0x30 > </TASK> > > This is the call stack of the thin_dump command, its just waiting: > > # thin_dump -f xml --dev-id 671125 -m /dev/mapper/data-pool0_tmeta > ... > > # cat /proc/4016388/stack > [<0>] submit_bio_wait+0x7a/0xc0 > [<0>] __blkdev_direct_IO_simple+0x130/0x280 > [<0>] blkdev_read_iter+0xb8/0x150 > [<0>] do_iter_readv_writev+0x14e/0x160 > [<0>] do_iter_read+0x125/0x1c0 > [<0>] vfs_readv+0xa8/0xf0 > [<0>] __x64_sys_preadv+0xc0/0x100 > [<0>] do_syscall_64+0x3f/0xa0 > [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 > > > > These are the line number decodes for the above traces: > > trace 1: > kernel: INFO: task dmeventd:3718459 blocked for more than 122 seconds. > __schedule (kernel/sched/core.c:5381 kernel/sched/core.c:6710) > schedule (./arch/x86/include/asm/preempt.h:85 (discriminator 1) kernel/sched/core.c:6787 (discriminator 1)) > schedule_preempt_disabled (./arch/x86/include/asm/preempt.h:80 kernel/sched/core.c:6846) > rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073) > ? lock_acquire (./include/trace/events/lock.h:24 kernel/locking/lockdep.c:5724) > down_read (kernel/locking/rwsem.c:1250 kernel/locking/rwsem.c:1263 kernel/locking/rwsem.c:1522) > dm_pool_get_metadata_transaction_id (drivers/md/dm-thin-metadata.c:1323) dm_thin_pool > pool_status (drivers/md/dm-thin.c:3944) dm_thin_pool > retrieve_status (./include/linux/fortify-string.h:214 drivers/md/dm-ioctl.c:1316) dm_mod > table_status (drivers/md/dm-ioctl.c:1707) dm_mod > ctl_ioctl (drivers/md/dm-ioctl.c:2083) dm_mod > ? __pfx_table_status (drivers/md/dm-ioctl.c:1693) dm_mod > dm_ctl_ioctl (drivers/md/dm-ioctl.c:2103) dm_mod > __x64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:870 fs/ioctl.c:856 fs/ioctl.c:856) > do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) > > > trace 2: > kernel: INFO: task drbd_r_secServe:1072212 blocked for more than 122 seconds. > __schedule (kernel/sched/core.c:5381 kernel/sched/core.c:6710) > schedule (./arch/x86/include/asm/preempt.h:85 (discriminator 1) kernel/sched/core.c:6787 (discriminator 1)) > schedule_preempt_disabled (./arch/x86/include/asm/preempt.h:80 kernel/sched/core.c:6846) > rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073) > ? dm_thin_find_block (drivers/md/dm-thin-metadata.c:1574) dm_thin_pool > down_read (kernel/locking/rwsem.c:1250 kernel/locking/rwsem.c:1263 kernel/locking/rwsem.c:1522) > dm_thin_find_block (drivers/md/dm-thin-metadata.c:1574) dm_thin_pool > thin_bio_map.isra.0 (drivers/md/dm-thin.c:2764) dm_thin_pool > __map_bio (drivers/md/dm.c:1427) dm_mod > dm_split_and_process_bio (drivers/md/dm.c:1729 drivers/md/dm.c:1782) dm_mod > dm_submit_bio (drivers/md/dm.c:1835) dm_mod > __submit_bio (block/blk-core.c:373 block/blk-core.c:603) > __submit_bio_noacct (./include/linux/bio.h:624 block/blk-core.c:653) > drbd_submit_peer_request (drivers/block/drbd/drbd_receiver.c:1723) drbd > receive_DataRequest (drivers/block/drbd/drbd_receiver.c:2993) drbd > ? __pfx_receive_DataRequest (drivers/block/drbd/drbd_receiver.c:2803) drbd > drbdd (drivers/block/drbd/drbd_receiver.c:5092) drbd > ? __pfx_drbd_thread_setup (drivers/block/drbd/drbd_main.c:310) drbd > drbd_receiver (drivers/block/drbd/drbd_receiver.c:5561) drbd > drbd_thread_setup (drivers/block/drbd/drbd_main.c:325) drbd > kthread (kernel/kthread.c:389) > ? __pfx_kthread (kernel/kthread.c:342) > ret_from_fork (arch/x86/kernel/process.c:151) > ? __pfx_kthread (kernel/kthread.c:342) > ret_from_fork_asm (arch/x86/entry/entry_64.S:312) > > trace 3: > # thin_dump -f xml --dev-id 671125 -m /dev/mapper/data-pool0_tmeta > <hung> > > # cat /proc/4016388/stack > submit_bio_wait (block/bio.c:1370 (discriminator 1)) > __blkdev_direct_IO_simple (./include/linux/bio.h:495 block/fops.c:99) > blkdev_read_iter (block/fops.c:608) > do_iter_readv_writev (./include/linux/fs.h:1871) > do_iter_read (fs/read_write.c:795) > vfs_readv (fs/read_write.c:916) > __x64_sys_preadv (fs/read_write.c:1007 fs/read_write.c:1057 fs/read_write.c:1052 fs/read_write.c:1052) > do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) > > > > -- > Eric Wheeler >