Dne 05. 11. 24 v 13:27 wangzhiqiang (Q) napsal(a):
Hi Team,
Here's a hungtask issue occurs in the dm-snapshot scenario,
reproduce by concurrent run vgchange --refresh and dmsetup -f remove vg-snap.
vgchange dmsetup dmsetup
table_load (load snapshot)
table_load snapshot to error
remove snapshot
suspend origin/cow/real
table_load(snapshot already remove)
take type_lock and issue io to cow in snapshot_ctr
table_load (wait type_lock)
[root@localhost ~]# ps aux | grep D
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1818066 0.0 0.0 0 0 ? D Nov04 0:03 [kworker/3:2+ksnaphd]
root 2972729 0.5 2.1 87256 73032 pts/1 D<L 20:17 0:00 vgchange --refresh vg
root 2972761 0.0 0.3 23464 10636 pts/1 D 20:17 0:00 dmsetup -f remove vg-snap
Snapshot has remove after suspend origin/cow/real during vgchange --refresh, and then load
snapshot will take type_lock and issue io to cow in snapshot_ctr, the io process by kworker
but cow has suspend lead to hungtask in kernel.
Does we have some way to fix it?
It's like guessing from crystal ball what you were doing and what is the state
of the system in use.
Usually the most info you will get from 'dmsetup info -c'
If you have there any device in suspend - it's likely blocking the progress of
other commands which might be waiting on device resume.
In practice you are doing something which is not supportable in any way - you
can't interfere with DM tables of those device which are being manipulated by
lvm2 command (there is a good reason we use locked sections to ensure
exclusive access to those devices).
To recover from case you would need to know where the lvm2 command was
interfered and reaload & resume those device that are already expected to be
there and funcional - and this might be non-trivial operation if you have not
grabbed 'dmsetup table' state prior your interfering manipulation command -
which in practice is 'replacing' any existing target with 'error' target -
this can possibly create even a combination of devices that were not tested
before - thus causing some unexpected code flow.
It's also good to know which kernel version you are working with - over the
time many DM kernel bugs where fixed - so please make sure you are testing on
6.11 kernel.
Regards
Zdenek