Hi,
Thanks for the insight. Please refer below.
On 2024/05/31 14:34, Zdenek Kabelac wrote:
Dne 30. 05. 24 v 12:21 Jaco Kroon napsal(a):
Hi,
Possible lvm2 command deadlock scenario:
crowsnest [12:15:47] /run/lvm # fuser //run/lock/lvm/*
/run/lock/lvm/P_global: 17231
/run/lock/lvm/V_lvm: 16087 17231
crowsnest [12:15:54] /run/lvm # ps axf | grep -E '16087|17231'
24437 pts/1 S+ 0:00 | \_ grep --colour=auto -E 16087|17231
16087 ? S 0:00 | | \_ /sbin/lvcreate -kn -An
-s -n fsck_cerberus /dev/lvm/backup_cerberus
17231 ? S 0:00 | \_ /sbin/lvs --noheadings
--nameprefixes
crowsnest [12:17:40] /run/lvm # dmsetup udevcookies
Cookie Semid Value Last semop time Last change time
0xd4d2051 10 1 Thu May 30 02:34:05 2024 Thu May
30 02:32:22 2024
This was almost 10 hours ago.
crowsnest [12:17:44] /run/lvm # dmsetup udevcomplete 0xd4d2051
DM_COOKIE_COMPLETED=0xd4d2051
crowsnest [12:18:43] /run/lvm # ps axf | grep -E '16087|17231'
27252 pts/1 S+ 0:00 | \_ grep --colour=auto -E 16087|17231
crowsnest [12:18:45] /run/lvm #
Allows progress again.
Hi
I'm kind of missing here to see your 'deadlock' scenario from this
description.
Well, stuff blocks, until the cookie is released by using the dmset
udevcomplete command, so wrong wording perhaps?
Lvm2 takes the VG lock - creates LV - waits for udev till it's
finished with its job and confirms all the udev work with dmsetup
udevcomplete.
So what I understand from this is that udevcomplete ends up never
executing? Is there some way of confirming this?
If something 'kills' your udev worker (which may eventually happen
on some 'very very very' busy system - you may need to set up longer
timeout for systemd to kill udev worker (I believe it's just 30seconds
by default).
Well, I guess upwards of 1GB/s of IO at times qualify. No systemd, so
more likely the udev configuration. When this happens we see load
averages upwards of 50 ... so based on that I'd say busy is probably a
reasonable assessment. Normally doesn't exceed load average values
10-15 at most.
Even so ... event_timeout = 180 should be a VERY, VERY long time in
terms of udev event processing.
I'm not seeing anything for udev in /var/log/messages (which is
explicitly configured to log everything logged to syslog). But it may
also be a case of "not logging because LVM isn't processing any IO (/var
is stored on it's own LV).
If it happens your cookies blocks your lvm2 command - you can
'unblock' them with 'dmsetup udevcomplete_all' - but that's a sign
your system is already in very bad state.
It's also unclear which OS are you using - Debian, Fedora, ???
Gentoo.
Version of your packages ?
I thought I did provide this:
Kernel version was 6.4.12 when this hapened, is now 6.9.3.
crowsnest [12:19:47] /run/lvm # udevadm --version
254
aka systemd-utils-254.10
lvm2-2.03.22
Thanks for the feedback, what you say makes perfect sense, and the
implication is that there are only a few options:
1. Something is resulting in the udev trigger to take longer than three
minutes, and the dmsetup udevcomplete never being executed.
2. Something goes horribly wrong during the udev trigger (which invokes
dmsetup a few times) processing and the process crashes, never executing
dmsetup udevcomplete.
This could potentially be due to extremely heavy disk IO, or LVM itself
freezing IO.
Given the rulesets the only way I see this happening is if dmsetup
command takes very long to load - and even in the degraded (most
filesystems blocked) state it was fast to execute.
Or if udevd itself has problems accessing /sys - which I find extremely
unlikely.
I don't see the default value for udev_log from the config. Explicitly
set to debug now, but still not seeing anything logged to syslog.
Running with udevd --debug, which logs to a ramdisk on /run. Hopefully
(if/when this happens again) that may shed some light. There is 256GB
of RAM available, so as long as the log doesn't grow too quickly should
be fine.
Kind regards,
Jaco