On Wed, Jan 18 2023 at 7:29P -0500, Nikos Tsironis <ntsironis@xxxxxxxxxxx> wrote: > Under certain conditions, swapping a table, that includes a dm-era > target, with a new table, causes a deadlock. > > This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked > in the suspended dm-era target. > > dm-era executes all metadata operations in a worker thread, which stops > processing requests when the target is suspended, and resumes again when > the target is resumed. > > So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era > device blocks, until the device is resumed. > > If we then load a new table to the device, while the aforementioned > dmsetup command is blocked in dm-era, and resume the device, we > deadlock. > > The problem is that the 'dmsetup status' and 'dmsetup message' commands > hold a reference to the live table, i.e., they hold an SRCU read lock on > md->io_barrier, while they are blocked. > > When the device is resumed, the old table is replaced with the new one > by dm_swap_table(), which ends up calling synchronize_srcu() on > md->io_barrier. > > Since the blocked dmsetup command is holding the SRCU read lock, and the > old table is never resumed, 'dmsetup resume' blocks too, and we have a > deadlock. > > The only way to recover is by rebooting. > > Steps to reproduce: > > 1. Create device with dm-era target > > # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192" > > 2. Suspend the device > > # dmsetup suspend eradev > > 3. Load new table to device, e.g., to resize the device. Note, that, we > must load the new table _after_ suspending the device to ensure the > constructor of the new target instance reads up-to-date metadata, as > committed by the post-suspend hook of dm-era. > > # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192" > > 4. Device now has LIVE and INACTIVE tables > > # dmsetup info eradev > Name: eradev > State: SUSPENDED > Read Ahead: 16384 > Tables present: LIVE & INACTIVE > Open count: 0 > Event number: 0 > Major, minor: 252, 2 > Number of targets: 1 > > 5. Retrieve the status of the device. This blocks because the device is > suspended. Equivalently, any 'dmsetup message' operation would block > too. This command holds the SRCU read lock on md->io_barrier. > > # dmsetup status eradev I'll have a look at this flow, it seems to me we shouldn't stack up 'dmsetup status' and 'dmsetup message' commands if the table is suspended. I think it is unique to dm-era that you don't allow to _read_ metadata operations while a device is suspended. But messages really shouldn't be sent when the device is suspended. As-is DM is pretty silently cutthroat about that constraint. Resulting in deadlock is obviously cutthroat... > 6. Resume the device. The resume operation tries to swap the old table > with the new one and deadlocks, because it synchronizes SRCU for the > old table, while the blocked 'dmsetup status' holds the SRCU read > lock. And the old table is never resumed again at this point. > > # dmsetup resume eradev > > 7. The relevant dmesg logs are: > > [ 7093.345486] dm-2: detected capacity change from 1048576 to 2097152 > [ 7250.875665] INFO: task dmsetup:1986 blocked for more than 120 seconds. > [ 7250.875722] Not tainted 5.16.0-rc2-release+ #16 > [ 7250.875756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 7250.875803] task:dmsetup state:D stack: 0 pid: 1986 ppid: 1313 flags:0x00000000 > [ 7250.875809] Call Trace: > [ 7250.875812] <TASK> > [ 7250.875816] __schedule+0x330/0x8b0 > [ 7250.875827] schedule+0x4e/0xc0 > [ 7250.875831] schedule_timeout+0x20f/0x2e0 > [ 7250.875836] ? do_set_pte+0xb8/0x120 > [ 7250.875843] ? prep_new_page+0x91/0xa0 > [ 7250.875847] wait_for_completion+0x8c/0xf0 > [ 7250.875854] perform_rpc+0x95/0xb0 [dm_era] > [ 7250.875862] in_worker1.constprop.20+0x48/0x70 [dm_era] > [ 7250.875867] ? era_iterate_devices+0x30/0x30 [dm_era] > [ 7250.875872] ? era_status+0x64/0x1e0 [dm_era] > [ 7250.875877] era_status+0x64/0x1e0 [dm_era] > [ 7250.875882] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] > [ 7250.875900] ? __mod_node_page_state+0x82/0xc0 > [ 7250.875909] retrieve_status+0xbc/0x1e0 [dm_mod] > [ 7250.875921] ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod] > [ 7250.875932] table_status+0x61/0xa0 [dm_mod] > [ 7250.875942] ctl_ioctl+0x1b5/0x4f0 [dm_mod] > [ 7250.875956] dm_ctl_ioctl+0xa/0x10 [dm_mod] > [ 7250.875966] __x64_sys_ioctl+0x8e/0xd0 > [ 7250.875970] do_syscall_64+0x3a/0xd0 > [ 7250.875974] entry_SYSCALL_64_after_hwframe+0x44/0xae > [ 7250.875980] RIP: 0033:0x7f20b7cd4017 > [ 7250.875984] RSP: 002b:00007ffd443874b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > [ 7250.875988] RAX: ffffffffffffffda RBX: 000055d69d6bd0e0 RCX: 00007f20b7cd4017 > [ 7250.875991] RDX: 000055d69d6bd0e0 RSI: 00000000c138fd0c RDI: 0000000000000003 > [ 7250.875993] RBP: 000000000000001e R08: 00007f20b81df648 R09: 00007ffd44387320 > [ 7250.875996] R10: 00007f20b81deb53 R11: 0000000000000246 R12: 000055d69d6bd110 > [ 7250.875998] R13: 00007f20b81deb53 R14: 000055d69d6bd000 R15: 0000000000000000 > [ 7250.876002] </TASK> > [ 7250.876004] INFO: task dmsetup:1987 blocked for more than 120 seconds. > [ 7250.876046] Not tainted 5.16.0-rc2-release+ #16 > [ 7250.876083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 7250.876129] task:dmsetup state:D stack: 0 pid: 1987 ppid: 1385 flags:0x00000000 > [ 7250.876134] Call Trace: > [ 7250.876136] <TASK> > [ 7250.876138] __schedule+0x330/0x8b0 > [ 7250.876142] schedule+0x4e/0xc0 > [ 7250.876145] schedule_timeout+0x20f/0x2e0 > [ 7250.876149] ? __queue_work+0x226/0x420 > [ 7250.876156] wait_for_completion+0x8c/0xf0 > [ 7250.876160] __synchronize_srcu.part.19+0x92/0xc0 > [ 7250.876167] ? __bpf_trace_rcu_stall_warning+0x10/0x10 > [ 7250.876173] ? dm_swap_table+0x2f4/0x310 [dm_mod] > [ 7250.876185] dm_swap_table+0x2f4/0x310 [dm_mod] > [ 7250.876198] ? table_load+0x360/0x360 [dm_mod] > [ 7250.876207] dev_suspend+0x95/0x250 [dm_mod] > [ 7250.876217] ctl_ioctl+0x1b5/0x4f0 [dm_mod] > [ 7250.876231] dm_ctl_ioctl+0xa/0x10 [dm_mod] > [ 7250.876240] __x64_sys_ioctl+0x8e/0xd0 > [ 7250.876244] do_syscall_64+0x3a/0xd0 > [ 7250.876247] entry_SYSCALL_64_after_hwframe+0x44/0xae > [ 7250.876252] RIP: 0033:0x7f15e9254017 > [ 7250.876254] RSP: 002b:00007ffffdc59458 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > [ 7250.876257] RAX: ffffffffffffffda RBX: 000055d4d99560e0 RCX: 00007f15e9254017 > [ 7250.876260] RDX: 000055d4d99560e0 RSI: 00000000c138fd06 RDI: 0000000000000003 > [ 7250.876261] RBP: 000000000000000f R08: 00007f15e975f648 R09: 00007ffffdc592c0 > [ 7250.876263] R10: 00007f15e975eb53 R11: 0000000000000246 R12: 000055d4d9956110 > [ 7250.876265] R13: 00007f15e975eb53 R14: 000055d4d9956000 R15: 0000000000000001 > [ 7250.876269] </TASK> > > Fix this by allowing metadata operations triggered by user space to run > in the corresponding user space thread, instead of queueing them for > execution by the dm-era worker thread. Allowing them to run while the device is suspended is _not_ the correct way to work-around this deadlock situation. I think it'd be useful to understand why your userspace is tools are allowing these messages and status to a device that is suspended. FYI, status shouldn't trigger write IOs if the device is suspended. Both dm-cache and dm-thin have this in status as a guard around its work to ensure updated accounting (as a side-effect of committing metadata), e.g.: /* Commit to ensure statistics aren't out-of-date */ if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti)) (void) commit(cache, false); > This way, even if the device is suspended, the metadata operations will > run and release the SRCU read lock, preventing a subsequent table reload > (dm_swap_table()) from deadlocking. > > To make this possible use a mutex to serialize access to the metadata > between the worker thread and the user space threads. > > This is a follow up on https://listman.redhat.com/archives/dm-devel/2021-December/msg00044.html. > > Nikos Tsironis (2): > dm era: allocate in-core writesets when loading metadata > dm era: avoid deadlock when swapping table > > drivers/md/dm-era-target.c | 78 ++++++++++++++++++++++++++++++-------- > 1 file changed, 63 insertions(+), 15 deletions(-) > > -- > 2.30.2 > -- dm-devel mailing list dm-devel@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/dm-devel