Re: [PATCH 0/2] dm era: avoid deadlock when swapping table with dm-era target

Mike Snitzer <snitzer@xxxxxxxxxx> · Wed, 18 Jan 2023 11:28:19 -0500

On Wed, Jan 18 2023 at  7:29P -0500,
Nikos Tsironis <ntsironis@xxxxxxxxxxx> wrote:

> Under certain conditions, swapping a table, that includes a dm-era
> target, with a new table, causes a deadlock.
> 
> This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked
> in the suspended dm-era target.
> 
> dm-era executes all metadata operations in a worker thread, which stops
> processing requests when the target is suspended, and resumes again when
> the target is resumed.
> 
> So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era
> device blocks, until the device is resumed.
> 
> If we then load a new table to the device, while the aforementioned
> dmsetup command is blocked in dm-era, and resume the device, we
> deadlock.
> 
> The problem is that the 'dmsetup status' and 'dmsetup message' commands
> hold a reference to the live table, i.e., they hold an SRCU read lock on
> md->io_barrier, while they are blocked.
> 
> When the device is resumed, the old table is replaced with the new one
> by dm_swap_table(), which ends up calling synchronize_srcu() on
> md->io_barrier.
> 
> Since the blocked dmsetup command is holding the SRCU read lock, and the
> old table is never resumed, 'dmsetup resume' blocks too, and we have a
> deadlock.
> 
> The only way to recover is by rebooting.
> 
> Steps to reproduce:
> 
> 1. Create device with dm-era target
> 
>     # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192"
> 
> 2. Suspend the device
> 
>     # dmsetup suspend eradev
> 
> 3. Load new table to device, e.g., to resize the device. Note, that, we
>    must load the new table _after_ suspending the device to ensure the
>    constructor of the new target instance reads up-to-date metadata, as
>    committed by the post-suspend hook of dm-era.
> 
>     # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192"
> 
> 4. Device now has LIVE and INACTIVE tables
> 
>     # dmsetup info eradev
>     Name:              eradev
>     State:             SUSPENDED
>     Read Ahead:        16384
>     Tables present:    LIVE & INACTIVE
>     Open count:        0
>     Event number:      0
>     Major, minor:      252, 2
>     Number of targets: 1
> 
> 5. Retrieve the status of the device. This blocks because the device is
>    suspended. Equivalently, any 'dmsetup message' operation would block
>    too. This command holds the SRCU read lock on md->io_barrier.
> 
>     # dmsetup status eradev

I'll have a look at this flow, it seems to me we shouldn't stack up
'dmsetup status' and 'dmsetup message' commands if the table is
suspended.

I think it is unique to dm-era that you don't allow to _read_ metadata
operations while a device is suspended.  But messages really shouldn't
be sent when the device is suspended.  As-is DM is pretty silently
cutthroat about that constraint.

Resulting in deadlock is obviously cutthroat...

> 6. Resume the device. The resume operation tries to swap the old table
>    with the new one and deadlocks, because it synchronizes SRCU for the
>    old table, while the blocked 'dmsetup status' holds the SRCU read
>    lock. And the old table is never resumed again at this point.
> 
>     # dmsetup resume eradev
> 
> 7. The relevant dmesg logs are:
> 
> [ 7093.345486] dm-2: detected capacity change from 1048576 to 2097152
> [ 7250.875665] INFO: task dmsetup:1986 blocked for more than 120 seconds.
> [ 7250.875722]       Not tainted 5.16.0-rc2-release+ #16
> [ 7250.875756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 7250.875803] task:dmsetup         state:D stack:    0 pid: 1986 ppid:  1313 flags:0x00000000
> [ 7250.875809] Call Trace:
> [ 7250.875812]  <TASK>
> [ 7250.875816]  __schedule+0x330/0x8b0
> [ 7250.875827]  schedule+0x4e/0xc0
> [ 7250.875831]  schedule_timeout+0x20f/0x2e0
> [ 7250.875836]  ? do_set_pte+0xb8/0x120
> [ 7250.875843]  ? prep_new_page+0x91/0xa0
> [ 7250.875847]  wait_for_completion+0x8c/0xf0
> [ 7250.875854]  perform_rpc+0x95/0xb0 [dm_era]
> [ 7250.875862]  in_worker1.constprop.20+0x48/0x70 [dm_era]
> [ 7250.875867]  ? era_iterate_devices+0x30/0x30 [dm_era]
> [ 7250.875872]  ? era_status+0x64/0x1e0 [dm_era]
> [ 7250.875877]  era_status+0x64/0x1e0 [dm_era]
> [ 7250.875882]  ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod]
> [ 7250.875900]  ? __mod_node_page_state+0x82/0xc0
> [ 7250.875909]  retrieve_status+0xbc/0x1e0 [dm_mod]
> [ 7250.875921]  ? dm_get_live_or_inactive_table.isra.11+0x20/0x20 [dm_mod]
> [ 7250.875932]  table_status+0x61/0xa0 [dm_mod]
> [ 7250.875942]  ctl_ioctl+0x1b5/0x4f0 [dm_mod]
> [ 7250.875956]  dm_ctl_ioctl+0xa/0x10 [dm_mod]
> [ 7250.875966]  __x64_sys_ioctl+0x8e/0xd0
> [ 7250.875970]  do_syscall_64+0x3a/0xd0
> [ 7250.875974]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 7250.875980] RIP: 0033:0x7f20b7cd4017
> [ 7250.875984] RSP: 002b:00007ffd443874b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 7250.875988] RAX: ffffffffffffffda RBX: 000055d69d6bd0e0 RCX: 00007f20b7cd4017
> [ 7250.875991] RDX: 000055d69d6bd0e0 RSI: 00000000c138fd0c RDI: 0000000000000003
> [ 7250.875993] RBP: 000000000000001e R08: 00007f20b81df648 R09: 00007ffd44387320
> [ 7250.875996] R10: 00007f20b81deb53 R11: 0000000000000246 R12: 000055d69d6bd110
> [ 7250.875998] R13: 00007f20b81deb53 R14: 000055d69d6bd000 R15: 0000000000000000
> [ 7250.876002]  </TASK>
> [ 7250.876004] INFO: task dmsetup:1987 blocked for more than 120 seconds.
> [ 7250.876046]       Not tainted 5.16.0-rc2-release+ #16
> [ 7250.876083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 7250.876129] task:dmsetup         state:D stack:    0 pid: 1987 ppid:  1385 flags:0x00000000
> [ 7250.876134] Call Trace:
> [ 7250.876136]  <TASK>
> [ 7250.876138]  __schedule+0x330/0x8b0
> [ 7250.876142]  schedule+0x4e/0xc0
> [ 7250.876145]  schedule_timeout+0x20f/0x2e0
> [ 7250.876149]  ? __queue_work+0x226/0x420
> [ 7250.876156]  wait_for_completion+0x8c/0xf0
> [ 7250.876160]  __synchronize_srcu.part.19+0x92/0xc0
> [ 7250.876167]  ? __bpf_trace_rcu_stall_warning+0x10/0x10
> [ 7250.876173]  ? dm_swap_table+0x2f4/0x310 [dm_mod]
> [ 7250.876185]  dm_swap_table+0x2f4/0x310 [dm_mod]
> [ 7250.876198]  ? table_load+0x360/0x360 [dm_mod]
> [ 7250.876207]  dev_suspend+0x95/0x250 [dm_mod]
> [ 7250.876217]  ctl_ioctl+0x1b5/0x4f0 [dm_mod]
> [ 7250.876231]  dm_ctl_ioctl+0xa/0x10 [dm_mod]
> [ 7250.876240]  __x64_sys_ioctl+0x8e/0xd0
> [ 7250.876244]  do_syscall_64+0x3a/0xd0
> [ 7250.876247]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 7250.876252] RIP: 0033:0x7f15e9254017
> [ 7250.876254] RSP: 002b:00007ffffdc59458 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 7250.876257] RAX: ffffffffffffffda RBX: 000055d4d99560e0 RCX: 00007f15e9254017
> [ 7250.876260] RDX: 000055d4d99560e0 RSI: 00000000c138fd06 RDI: 0000000000000003
> [ 7250.876261] RBP: 000000000000000f R08: 00007f15e975f648 R09: 00007ffffdc592c0
> [ 7250.876263] R10: 00007f15e975eb53 R11: 0000000000000246 R12: 000055d4d9956110
> [ 7250.876265] R13: 00007f15e975eb53 R14: 000055d4d9956000 R15: 0000000000000001
> [ 7250.876269]  </TASK>
> 
> Fix this by allowing metadata operations triggered by user space to run
> in the corresponding user space thread, instead of queueing them for
> execution by the dm-era worker thread.

Allowing them to run while the device is suspended is _not_ the
correct way to work-around this deadlock situation.  I think it'd be
useful to understand why your userspace is tools are allowing these
messages and status to a device that is suspended.

FYI, status shouldn't trigger write IOs if the device is suspended.
Both dm-cache and dm-thin have this in status as a guard around its
work to ensure updated accounting (as a side-effect of committing
metadata), e.g.:

                /* Commit to ensure statistics aren't out-of-date */
                if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti))
                        (void) commit(cache, false);

> This way, even if the device is suspended, the metadata operations will
> run and release the SRCU read lock, preventing a subsequent table reload
> (dm_swap_table()) from deadlocking.
> 
> To make this possible use a mutex to serialize access to the metadata
> between the worker thread and the user space threads.
> 
> This is a follow up on https://listman.redhat.com/archives/dm-devel/2021-December/msg00044.html.
> 
> Nikos Tsironis (2):
>   dm era: allocate in-core writesets when loading metadata
>   dm era: avoid deadlock when swapping table
> 
>  drivers/md/dm-era-target.c | 78 ++++++++++++++++++++++++++++++--------
>  1 file changed, 63 insertions(+), 15 deletions(-)
> 
> -- 
> 2.30.2
> 

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel