Re: lvm2 deadlock

Jaco Kroon <jaco@xxxxxxxxx> · Tue, 4 Jun 2024 10:46:06 +0200

Hi,

Please refer below.

On 2024/06/03 21:25, Zdenek Kabelac wrote:
Dne 03. 06. 24 v 14:56 Jaco Kroon napsal(a):
Hi,

Thanks for the insight.  Please refer below.

On 2024/05/31 14:34, Zdenek Kabelac wrote:
Dne 30. 05. 24 v 12:21 Jaco Kroon napsal(a):
Hi,

I'm kind of missing here to see your 'deadlock' scenario from this 
description.
Well, stuff blocks, until the cookie is released by using the dmset 
udevcomplete command, so wrong wording perhaps?

Lvm2 takes the VG lock - creates LV - waits for udev till it's 
finished with its job and confirms all the udev work with dmsetup 
udevcomplete.

So what I understand from this is that udevcomplete ends up never 
executing?  Is there some way of confirming this?

udevcomplete needs  someone to create 'semaphore' for completion in 
the first place.

I'm not familiar with the LVM internals and the flows of different 
processes, even though you can probably safely consider me a "semi power 
user".  I do have compsci background, so do understand most of the 
principles of locking etc, but how they are applied in the LVM 
environment I'm clueless.

It's also unclear which OS are you using - Debian, Fedora, ???

Gentoo.

Version of your packages ?

I thought I did provide this:

Kernel version was 6.4.12 when this hapened, is now 6.9.3.

crowsnest [12:19:47] /run/lvm # udevadm --version
254

aka systemd-utils-254.10

lvm2-2.03.22

Since this is most likely your personal build - please provide full 
output of
'lvm version'  command.

crowsnest [09:46:04] ~ # lvm version
  LVM version:     2.03.22(2) (2023-08-02)
  Library version: 1.02.196 (2023-08-02)
  Driver version:  4.48.0
  Configuration:   ./configure --prefix=/usr 
--build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu 
--mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share 
--sysconfdir=/etc --localstatedir=/var/lib --datarootdir=/usr/share 
--disable-dependency-tracking --disable-silent-rules 
--docdir=/usr/share/doc/lvm2-2.03.22-r5 
--htmldir=/usr/share/doc/lvm2-2.03.22-r5/html --enable-dmfilemapd 
--enable-dmeventd --enable-cmdlib --enable-fsadm --enable-lvmpolld 
--with-mirrors=internal --with-snapshots=internal --with-thin=internal 
--with-cache=internal --with-thin-check=/usr/sbin/thin_check 
--with-cache-check=/usr/sbin/cache_check 
--with-thin-dump=/usr/sbin/thin_dump 
--with-cache-dump=/usr/sbin/cache_dump 
--with-thin-repair=/usr/sbin/thin_repair 
--with-cache-repair=/usr/sbin/cache_repair 
--with-thin-restore=/usr/sbin/thin_restore 
--with-cache-restore=/usr/sbin/cache_restore --with-symvers=gnu 
--enable-readline --disable-selinux --enable-pkgconfig 
--with-confdir=/etc --exec-prefix= --sbindir=/sbin 
--with-staticdir=/sbin --libdir=/lib64 --with-usrlibdir=/usr/lib64 
--with-default-dm-run-dir=/run --with-default-run-dir=/run/lvm 
--with-default-locking-dir=/run/lock/lvm --with-default-pid-dir=/run 
--enable-udev_rules --enable-udev_sync --with-udevdir=/lib/udev/rules.d 
--disable-lvmlockd-sanlock --disable-notify-dbus --disable-app-machineid 
--disable-systemd-journal --without-systemd-run --disable-valgrind-pool 
--with-systemdsystemunitdir=/lib/systemd/system CLDFLAGS=-Wl,-O1 
-Wl,--as-needed

For the 'udev' synchronization, there needs to be '--enable-udev_sync' 
configure option. So let's check which configure/build option were 
used here.
And also preferably upstream udev rules.

--enable-udev_sync all there.

To the best of my knowledge the udev rules are stock, certainly neither 
myself nor any of my colleagues modified them.  They would generally 
defer to me, and I won't touch that unless I understand the 
implications, which in this case I just don't.

Thanks for the feedback, what you say makes perfect sense, and the 
implication is that there are only a few options:

1.  Something is resulting in the udev trigger to take longer than 
three minutes, and the dmsetup udevcomplete never being executed.

systemd simply kills udev worker if takes too long.

However on properly running system, it would be very very unusual to 
hit these timeouts  - you would need to work with thousands of 
devices....

32 physical NL-SAS drives, combined into 3 RAID6 arrays using mdadm.

These three md devices serve as PVs for LVM, single VG.

73 LVs, just over half of which are mounted.  Most of those are thin 
volumes inside:

crowsnest [09:54:04] ~ # lvdisplay /dev/lvm/thin_pool
  --- Logical volume ---
  LV Name                thin_pool
  VG Name                lvm
  LV UUID                twLSE1-3ckG-WRSO-5eHc-G3fY-YS2v-as4ABC
  LV Write Access        read/write (activated read only)
  LV Creation host, time crowsnest, 2020-02-19 12:26:00 +0200
  LV Pool metadata       thin_pool_tmeta
  LV Pool data           thin_pool_tdata
  LV Status              available
  # open                 0
  LV Size                125.00 TiB
  Allocated pool data    73.57%
  Allocated metadata     9.05%
  Current LE             32768000
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     1024
  Block device           253:11

The rest are snapshots of the LVs that are mounted so that we have a 
roll-back destination in case of filesystem corruption (these snaps are 
made in multiple steps, first a snap of the origin is made, this is then 
fsck'ed, if that's successful it's fstrim'ed before being renamed into 
the final "save" location - any previously saved copy is first lvremove'd).

I'd describe that as a few tens of devices, not a few hundred and 
certainly not thousands of devices.

This could potentially be due to extremely heavy disk IO, or LVM 
itself freezing IO.

well reducing the percentage of '/proc/sys/vm/dirty_ration' may 
possibly help
when your disk system is too slow and you create a very lengthy 'sync' 
io queues...

crowsnest [09:52:21] /proc/sys/vm # cat dirty_ratio
20

Happy to lower that even more if it would help.

Internet (Redhat) states:

Starts active writeback of dirty data at this percentage of total memory 
for the generator of dirty data, via pdflush. The default value is |40|.

I'm assuming the default is 20 though, not 40, since I can't find that 
I've reconfigured this value.

Should probably remain higher than dirty_background_ratio (which is 
currently 10), dirty_background_bytes is 0.

I don't see the default value for udev_log from the config. 
Explicitly set to debug now, but still not seeing anything logged to 
syslog. Running with udevd --debug, which logs to a ramdisk on /run.  
Hopefully (if/when this happens again) that may shed some light.  
There is 256GB of RAM available, so as long as the log doesn't grow 
too quickly should be fine.

A lot of RAM may possibly create a huge amount of dirty pages...

May I safely interpret this as "lower the dirty_ratio even further"?

Given a value of 10 and 20 I'm assuming that pdflush will start flushing 
out in the background when >~26GB of in-memory data is dirty, or if the 
data has been dirty for more than 5 seconds (dirty_writeback_centisecs = 
500).

Don't mind lowering the dirty_background ratio as low as 1 even? But 
won't the primary dirty_ratio start blocking processes from writing if 
>40% of the caches/buffers are considered dirty?

Kind regards,
Jaco