Hi, 在 2023/11/22 1:54, Lakshmi Narasimhan Sundararajan 写道:
Good day team! I observe a deadlock condition being hit rather consistently in the following scenario.
We fixed several deadlocks related to raid456 reshape concurrent with normal io, and can you try the latest kernel(v6.6)? Thanks, Kuai
I have tried my best to make it the email short, but its the best I could get to without missing key details. The storage layout is as below: multiple disks (usually virtual) -> md array (raid 0) -> lvm pv part of a vg -> hosting thin pool -> thin volumes - IOs active. when 'md array' gets full or on need, the array is expanded by temporarily putting the array into 'raid4' adding a spare drive to the array and growing the array, then once reshape gets done, array is put back to 'raid0'. This is only to support capacity expansion. This is never a case of a bad disk problem. a/ array creation initially: mdadm -C /dev/mdX -f -n 1 -l0 -c 1024 /dev/sdc // nothing special a raid0 array with provisioned drives. b/ when capacity expansion is needed: mdadm -G -l4 /dev/mdX mdadm -G -n () -a <new disk> <<< this phase gets stuck. wait for reshape to complete and put array back to 'raid0'. mdadm -G -l0 /dev/mdX resize pv: lvm pvresize /dev/mdX resize thin pool on the vg. <dmesg snip begin - a dynamic drive sdf is added to the vm for capacity expansion> [Mon Nov 20 11:41:33 2023] vmw_pvscsi: msg type: 0x0 - MSG RING: 5/4 (5) [Mon Nov 20 11:41:33 2023] vmw_pvscsi: msg: device added at scsi0:5:0 [Mon Nov 20 11:41:33 2023] scsi 0:0:5:0: Direct-Access VMware Virtual disk 2.0 PQ: 0 ANSI: 6 [Mon Nov 20 11:41:33 2023] sd 0:0:5:0: Attached scsi generic sg5 type 0 [Mon Nov 20 11:41:33 2023] sd 0:0:5:0: [sdf] 419430400 512-byte logical blocks: (215 GB/200 GiB) [Mon Nov 20 11:41:33 2023] sd 0:0:5:0: [sdf] Write Protect is off [Mon Nov 20 11:41:33 2023] sd 0:0:5:0: [sdf] Mode Sense: 3b 00 00 00 [Mon Nov 20 11:41:33 2023] sd 0:0:5:0: [sdf] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA [Mon Nov 20 11:41:33 2023] sd 0:0:5:0: [sdf] Attached SCSI disk [Mon Nov 20 11:41:39 2023] md/raid:md127: device sdc operational as raid disk 0 [Mon Nov 20 11:41:39 2023] md/raid:md127: raid level 4 active with 1 out of 2 devices, algorithm 5 <snip end> There are 2 sets of tracebacks seen. 1. the mdadm cli that issued the grow command to the array is stuck as below. mdadm -G -l4 /dev/md127 mdadm -G /dev/md127 -n 3 -a /dev/sdf <<< stuck [Mon Nov 20 11:44:21 2023] INFO: task mdadm:65108 blocked for more than 122 seconds. [Mon Nov 20 11:44:21 2023] Tainted: G OE -------- --- 5.14.0-284.36.1.el9_2.x86_64 #1 [Mon Nov 20 11:44:21 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Mon Nov 20 11:44:21 2023] task:mdadm state:D stack: 0 pid:65108 ppid: 50296 flags:0x00004002 [Mon Nov 20 11:44:21 2023] Call Trace: [Mon Nov 20 11:44:21 2023] <TASK> [Mon Nov 20 11:44:21 2023] __schedule+0x248/0x620 [Mon Nov 20 11:44:21 2023] ? kmem_cache_alloc+0x17d/0x300 [Mon Nov 20 11:44:21 2023] schedule+0x2d/0x60 [Mon Nov 20 11:44:21 2023] resize_stripes+0x623/0x710 [raid456] [Mon Nov 20 11:44:21 2023] ? cpuacct_percpu_seq_show+0x10/0x10 [Mon Nov 20 11:44:21 2023] update_raid_disks+0xd4/0x140 [Mon Nov 20 11:44:21 2023] raid_disks_store+0x5c/0x130 [Mon Nov 20 11:44:21 2023] md_attr_store+0x80/0xf0 [Mon Nov 20 11:44:21 2023] kernfs_fop_write_iter+0x121/0x1b0 [Mon Nov 20 11:44:21 2023] new_sync_write+0xfc/0x190 [Mon Nov 20 11:44:21 2023] vfs_write+0x1ef/0x280 [Mon Nov 20 11:44:21 2023] ksys_write+0x5f/0xe0 [Mon Nov 20 11:44:21 2023] do_syscall_64+0x59/0x90 [Mon Nov 20 11:44:21 2023] ? exit_to_user_mode_prepare+0xb6/0x100 [Mon Nov 20 11:44:21 2023] ? syscall_exit_to_user_mode+0x12/0x30 [Mon Nov 20 11:44:21 2023] ? do_syscall_64+0x69/0x90 [Mon Nov 20 11:44:21 2023] ? do_syscall_64+0x69/0x90 [Mon Nov 20 11:44:21 2023] entry_SYSCALL_64_after_hwframe+0x63/0xcd [Mon Nov 20 11:44:21 2023] RIP: 0033:0x7fb0ce53e917 [Mon Nov 20 11:44:21 2023] RSP: 002b:00007ffc0f551a18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [Mon Nov 20 11:44:21 2023] RAX: ffffffffffffffda RBX: 00007ffc0f552430 RCX: 00007fb0ce53e917 [Mon Nov 20 11:44:21 2023] RDX: 0000000000000001 RSI: 00007ffc0f551b20 RDI: 0000000000000004 [Mon Nov 20 11:44:21 2023] RBP: 0000000000000004 R08: 0000000000000000 R09: 00007ffc0f5518a0 [Mon Nov 20 11:44:21 2023] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffc0f551b20 [Mon Nov 20 11:44:21 2023] R13: 00007ffc0f551b20 R14: 00005604131f4228 R15: 00007ffc0f551bf0 [Mon Nov 20 11:44:21 2023] </TASK> 2. multiple apps that are performing IO on the lvm lvols and subsequently that end up in md array gets stuck as below. submit_bio->md_handle_request->raid5_make_request->make_stripe_request->raid5_get_active_stripe* Any new operation that issues an IO to the md volume gets stuck (ex. blkid) with the same signature as below. [Mon Nov 20 11:44:21 2023] Call Trace: [Mon Nov 20 11:44:21 2023] <TASK> [Mon Nov 20 11:44:21 2023] __schedule+0x248/0x620 [Mon Nov 20 11:44:21 2023] schedule+0x2d/0x60 [Mon Nov 20 11:44:21 2023] raid5_get_active_stripe+0x25a/0x2f0 [raid456] [Mon Nov 20 11:44:21 2023] ? cpuacct_percpu_seq_show+0x10/0x10 [Mon Nov 20 11:44:21 2023] make_stripe_request+0x9b/0x490 [raid456] [Mon Nov 20 11:44:21 2023] ? bdev_start_io_acct+0x47/0x100 [Mon Nov 20 11:44:21 2023] raid5_make_request+0x16f/0x3e0 [raid456] [Mon Nov 20 11:44:21 2023] ? sched_show_numa+0xf0/0xf0 [Mon Nov 20 11:44:21 2023] md_handle_request+0x132/0x1e0 [Mon Nov 20 11:44:21 2023] ? bio_split_to_limits+0x51/0x90 [Mon Nov 20 11:44:21 2023] __submit_bio+0x86/0x130 [Mon Nov 20 11:44:21 2023] __submit_bio_noacct+0x81/0x1f0 [Mon Nov 20 11:44:21 2023] submit_bio_wait+0x54/0xb0 [Mon Nov 20 11:44:21 2023] __blkdev_direct_IO_simple+0x105/0x200 [Mon Nov 20 11:44:21 2023] ? xlog_wait_on_iclog+0x176/0x180 [xfs] [Mon Nov 20 11:44:21 2023] ? wake_up_q+0x90/0x90 [Mon Nov 20 11:44:21 2023] ? submit_bio_wait+0xb0/0xb0 [Mon Nov 20 11:44:21 2023] generic_file_direct_write+0x99/0x1e0 [Mon Nov 20 11:44:21 2023] __generic_file_write_iter+0x9e/0x1a0 [Mon Nov 20 11:44:21 2023] blkdev_write_iter+0xe4/0x160 [Mon Nov 20 11:44:21 2023] do_iter_readv_writev+0x124/0x190 [Mon Nov 20 11:44:21 2023] do_iter_write+0x81/0x150 [Mon Nov 20 11:44:21 2023] vfs_writev+0xae/0x170 [Mon Nov 20 11:44:21 2023] ? syscall_exit_work+0x11a/0x150 [Mon Nov 20 11:44:21 2023] ? syscall_exit_to_user_mode+0x12/0x30 [Mon Nov 20 11:44:21 2023] ? do_syscall_64+0x69/0x90 [Mon Nov 20 11:44:21 2023] __x64_sys_pwritev+0xb1/0x100 [Mon Nov 20 11:44:21 2023] do_syscall_64+0x59/0x90 [Mon Nov 20 11:44:21 2023] ? syscall_exit_to_user_mode+0x12/0x30 [Mon Nov 20 11:44:21 2023] ? do_syscall_64+0x69/0x90 [Mon Nov 20 11:44:21 2023] ? exit_to_user_mode_prepare+0xec/0x100 [Mon Nov 20 11:44:21 2023] ? syscall_exit_to_user_mode+0x12/0x30 [Mon Nov 20 11:44:21 2023] ? do_syscall_64+0x69/0x90 [Mon Nov 20 11:44:21 2023] ? sysvec_apic_timer_interrupt+0x3c/0x90 [Mon Nov 20 11:44:21 2023] entry_SYSCALL_64_after_hwframe+0x63/0xcd [Mon Nov 20 11:44:21 2023] RIP: 0033:0x7fc906522638 [Mon Nov 20 11:44:21 2023] RSP: 002b:00007fc8d37fd560 EFLAGS: 00000246 ORIG_RAX: 0000000000000128 [Mon Nov 20 11:44:21 2023] RAX: ffffffffffffffda RBX: 0000000608252000 RCX: 00007fc906522638 [Mon Nov 20 11:44:21 2023] RDX: 0000000000000005 RSI: 00007fc8c84008d0 RDI: 0000000000000096 [Mon Nov 20 11:44:21 2023] RBP: 00007fc8d37fd640 R08: 0000000000000000 R09: 0000000000000000 [Mon Nov 20 11:44:21 2023] R10: 0000000608252000 R11: 0000000000000246 R12: 0000000000000096 [Mon Nov 20 11:44:21 2023] R13: 00007fc8c84008d0 R14: 00007fc800000005 R15: 0000000000000000 [Mon Nov 20 11:44:21 2023] </TASK> in this particular case, md array is as below md127 : active raid4 sdf[2](S) sdc[0] 209583104 blocks super 1.2 level 4, 1024k chunk, algorithm 0 [2/1] [U_] root# cd /sys/block/md127/ root# cd md/ [root md]# cat degraded 1 [root md]# cat fail_last_dev 0 [root md]# cat raid_disks 2 [root md]# cat max_read_errors 20 [root md]# cat level raid4 [root md]# cat layout 0 [root md]# cat stripe_cache_active 1025 [root md]# cat stripe_cache_size 1025 [root md]# cat sync_action frozen [root md]# ps -ef | grep mdadm root 65108 34728 0 11:41 ? 00:00:00 mdadm -G /dev/md127 --assume-clean -n 3 --add /dev/sdf root 590277 486946 0 16:21 pts/2 00:00:00 grep --color=auto mdadm [root md]# [root md]# mdadm -D /dev/md127 /dev/md127: Version : 1.2 Creation Time : Mon Nov 20 11:29:34 2023 Raid Level : raid4 Array Size : 209583104 (199.87 GiB 214.61 GB) Used Dev Size : 209583104 (199.87 GiB 214.61 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Mon Nov 20 11:41:43 2023 State : active, degraded Active Devices : 1 Working Devices : 2 Failed Devices : 0 Spare Devices : 1 Chunk Size : 1024K Consistency Policy : resync Name : any:pwx0 UUID : 0cde4499:2772c2e2:707cfb4b:f88f9a5d Events : 16 Number Major Minor RaidDevice State 0 8 32 0 active sync /dev/sdc - 0 0 1 removed 2 8 80 - spare /dev/sdf [root md]# [root md]# mdadm -E /dev/sdf /dev/sdf: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 0cde4499:2772c2e2:707cfb4b:f88f9a5d Name : any:pwx0 Creation Time : Mon Nov 20 11:29:34 2023 Raid Level : raid4 Raid Devices : 2 Avail Dev Size : 419166208 sectors (199.87 GiB 214.61 GB) Array Size : 209583104 KiB (199.87 GiB 214.61 GB) Data Offset : 264192 sectors Super Offset : 8 sectors Unused Space : before=264040 sectors, after=0 sectors State : active Device UUID : cbe1abb1:0e9f0b05:6b1af8d2:98e202c5 Update Time : Mon Nov 20 11:41:43 2023 Bad Block Log : 512 entries available at offset 136 sectors Checksum : 8526998e - correct Events : 16 Chunk Size : 1024K Device Role : spare Array State : A. ('A' == active, '.' == missing, 'R' == replacing) [root md]# I found several tickets recently that sounded similar to this path. 1/ md array reconfiguration with active IOs causing deadlocks. https://www.spinics.net/lists/kernel/msg4786173.html 2/ stripe cache sizing issue? in the case I observed that the stripe cache is utilized full. Can't find the reference to this discussion. I did try on updating with a new stripe_cache_size, but the write to the sysfs file got stuck and setting did not change. <snip begin> [root@pwx-ocp-152-245-m2dnx-worker-0-mb9gb md]# cat stripe_cache_active 1025 [root@pwx-ocp-152-245-m2dnx-worker-0-mb9gb md]# cat stripe_cache_size 1025 [root@pwx-ocp-152-245-m2dnx-worker-0-mb9gb md]# echo 2048 > stripe_cache_size ^C [root@pwx-ocp-152-245-m2dnx-worker-0-mb9gb md]# cat stripe_cache_size 1025 <snip end> My setup is an OCP environment as below. I would appreciate if any pointers/suggestions on understanding this further. ## setup/environment details. root# uname -a Linux pwx-ocp-152-245-m2dnx-worker-0-mb9gb 5.14.0-284.36.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Oct 5 08:11:31 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux root# cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" ID="rhcos" ID_LIKE="rhel fedora" VERSION="413.92.202310210500-0" VERSION_ID="4.13" VARIANT="CoreOS" VARIANT_ID=coreos PLATFORM_ID="platform:el9" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 413.92.202310210500-0 (Plow)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:9::coreos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.13/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.13" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.13" OPENSHIFT_VERSION="4.13" RHEL_VERSION="9.2" OSTREE_VERSION="413.92.202310210500-0" root# root# mdadm --version mdadm - v4.2 - 2021-12-30 - 8 root# lvm version LVM version: 2.03.17(2) (2022-11-10) Library version: 1.02.187 (2022-11-10) Driver version: 4.47.0 Configuration: ./configure --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-default-dm-run-dir=/run --with-default-run-dir=/run/lvm --with-default-pid-dir=/run --with-default-locking-dir=/run/lock/lvm --with-usrlibdir=/usr/lib64 --enable-fsadm --enable-write_install --with-user= --with-group= --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --enable-pkgconfig --enable-cmdlib --enable-dmeventd --enable-blkid_wiping --with-cluster=internal --with-udevdir=/usr/lib/udev/rules.d --enable-udev_sync --with-thin=internal --with-cache=internal --enable-lvmpolld --enable-lvmlockd-dlm --enable-lvmlockd-dlmcontrol --enable-lvmlockd-sanlock --enable-dbus-service --enable-notify-dbus --enable-dmfilemapd --with-writecache=internal --with-vdo=internal --with-vdo-format=/usr/bin/vdoformat --with-integrity=internal --with-default-use-devices-file=1 --disable-silent-rules --enable-app-machineid --enable-editline --disable-readline root# Best regards LN .