Hi,
Please refer below.
On 2024/06/03 21:25, Zdenek Kabelac wrote:
Dne 03. 06. 24 v 14:56 Jaco Kroon napsal(a):
Hi,
Thanks for the insight. Please refer below.
On 2024/05/31 14:34, Zdenek Kabelac wrote:
Dne 30. 05. 24 v 12:21 Jaco Kroon napsal(a):
Hi,
I'm kind of missing here to see your 'deadlock' scenario from this
description.
Well, stuff blocks, until the cookie is released by using the dmset
udevcomplete command, so wrong wording perhaps?
Lvm2 takes the VG lock - creates LV - waits for udev till it's
finished with its job and confirms all the udev work with dmsetup
udevcomplete.
So what I understand from this is that udevcomplete ends up never
executing? Is there some way of confirming this?
udevcomplete needs someone to create 'semaphore' for completion in
the first place.
I'm not familiar with the LVM internals and the flows of different
processes, even though you can probably safely consider me a "semi power
user". I do have compsci background, so do understand most of the
principles of locking etc, but how they are applied in the LVM
environment I'm clueless.
It's also unclear which OS are you using - Debian, Fedora, ???
Gentoo.
Version of your packages ?
I thought I did provide this:
Kernel version was 6.4.12 when this hapened, is now 6.9.3.
crowsnest [12:19:47] /run/lvm # udevadm --version
254
aka systemd-utils-254.10
lvm2-2.03.22
Since this is most likely your personal build - please provide full
output of
'lvm version' command.
crowsnest [09:46:04] ~ # lvm version
LVM version: 2.03.22(2) (2023-08-02)
Library version: 1.02.196 (2023-08-02)
Driver version: 4.48.0
Configuration: ./configure --prefix=/usr
--build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu
--mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share
--sysconfdir=/etc --localstatedir=/var/lib --datarootdir=/usr/share
--disable-dependency-tracking --disable-silent-rules
--docdir=/usr/share/doc/lvm2-2.03.22-r5
--htmldir=/usr/share/doc/lvm2-2.03.22-r5/html --enable-dmfilemapd
--enable-dmeventd --enable-cmdlib --enable-fsadm --enable-lvmpolld
--with-mirrors=internal --with-snapshots=internal --with-thin=internal
--with-cache=internal --with-thin-check=/usr/sbin/thin_check
--with-cache-check=/usr/sbin/cache_check
--with-thin-dump=/usr/sbin/thin_dump
--with-cache-dump=/usr/sbin/cache_dump
--with-thin-repair=/usr/sbin/thin_repair
--with-cache-repair=/usr/sbin/cache_repair
--with-thin-restore=/usr/sbin/thin_restore
--with-cache-restore=/usr/sbin/cache_restore --with-symvers=gnu
--enable-readline --disable-selinux --enable-pkgconfig
--with-confdir=/etc --exec-prefix= --sbindir=/sbin
--with-staticdir=/sbin --libdir=/lib64 --with-usrlibdir=/usr/lib64
--with-default-dm-run-dir=/run --with-default-run-dir=/run/lvm
--with-default-locking-dir=/run/lock/lvm --with-default-pid-dir=/run
--enable-udev_rules --enable-udev_sync --with-udevdir=/lib/udev/rules.d
--disable-lvmlockd-sanlock --disable-notify-dbus --disable-app-machineid
--disable-systemd-journal --without-systemd-run --disable-valgrind-pool
--with-systemdsystemunitdir=/lib/systemd/system CLDFLAGS=-Wl,-O1
-Wl,--as-needed
For the 'udev' synchronization, there needs to be '--enable-udev_sync'
configure option. So let's check which configure/build option were
used here.
And also preferably upstream udev rules.
--enable-udev_sync all there.
To the best of my knowledge the udev rules are stock, certainly neither
myself nor any of my colleagues modified them. They would generally
defer to me, and I won't touch that unless I understand the
implications, which in this case I just don't.
Thanks for the feedback, what you say makes perfect sense, and the
implication is that there are only a few options:
1. Something is resulting in the udev trigger to take longer than
three minutes, and the dmsetup udevcomplete never being executed.
systemd simply kills udev worker if takes too long.
However on properly running system, it would be very very unusual to
hit these timeouts - you would need to work with thousands of
devices....
32 physical NL-SAS drives, combined into 3 RAID6 arrays using mdadm.
These three md devices serve as PVs for LVM, single VG.
73 LVs, just over half of which are mounted. Most of those are thin
volumes inside:
crowsnest [09:54:04] ~ # lvdisplay /dev/lvm/thin_pool
--- Logical volume ---
LV Name thin_pool
VG Name lvm
LV UUID twLSE1-3ckG-WRSO-5eHc-G3fY-YS2v-as4ABC
LV Write Access read/write (activated read only)
LV Creation host, time crowsnest, 2020-02-19 12:26:00 +0200
LV Pool metadata thin_pool_tmeta
LV Pool data thin_pool_tdata
LV Status available
# open 0
LV Size 125.00 TiB
Allocated pool data 73.57%
Allocated metadata 9.05%
Current LE 32768000
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 1024
Block device 253:11
The rest are snapshots of the LVs that are mounted so that we have a
roll-back destination in case of filesystem corruption (these snaps are
made in multiple steps, first a snap of the origin is made, this is then
fsck'ed, if that's successful it's fstrim'ed before being renamed into
the final "save" location - any previously saved copy is first lvremove'd).
I'd describe that as a few tens of devices, not a few hundred and
certainly not thousands of devices.
This could potentially be due to extremely heavy disk IO, or LVM
itself freezing IO.
well reducing the percentage of '/proc/sys/vm/dirty_ration' may
possibly help
when your disk system is too slow and you create a very lengthy 'sync'
io queues...
crowsnest [09:52:21] /proc/sys/vm # cat dirty_ratio
20
Happy to lower that even more if it would help.
Internet (Redhat) states:
Starts active writeback of dirty data at this percentage of total memory
for the generator of dirty data, via pdflush. The default value is |40|.
I'm assuming the default is 20 though, not 40, since I can't find that
I've reconfigured this value.
Should probably remain higher than dirty_background_ratio (which is
currently 10), dirty_background_bytes is 0.
I don't see the default value for udev_log from the config.
Explicitly set to debug now, but still not seeing anything logged to
syslog. Running with udevd --debug, which logs to a ramdisk on /run.
Hopefully (if/when this happens again) that may shed some light.
There is 256GB of RAM available, so as long as the log doesn't grow
too quickly should be fine.
A lot of RAM may possibly create a huge amount of dirty pages...
May I safely interpret this as "lower the dirty_ratio even further"?
Given a value of 10 and 20 I'm assuming that pdflush will start flushing
out in the background when >~26GB of in-memory data is dirty, or if the
data has been dirty for more than 5 seconds (dirty_writeback_centisecs =
500).
Don't mind lowering the dirty_background ratio as low as 1 even? But
won't the primary dirty_ratio start blocking processes from writing if
>40% of the caches/buffers are considered dirty?
Kind regards,
Jaco