Re: lvm2 deadlock

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Please refer below.

On 2024/06/03 21:25, Zdenek Kabelac wrote:
Dne 03. 06. 24 v 14:56 Jaco Kroon napsal(a):
Hi,

Thanks for the insight.  Please refer below.

On 2024/05/31 14:34, Zdenek Kabelac wrote:
Dne 30. 05. 24 v 12:21 Jaco Kroon napsal(a):
Hi,

I'm kind of missing here to see your 'deadlock' scenario from this description.
Well, stuff blocks, until the cookie is released by using the dmset udevcomplete command, so wrong wording perhaps?

Lvm2 takes the VG lock - creates LV - waits for udev till it's finished with its job and confirms all the udev work with dmsetup udevcomplete.

So what I understand from this is that udevcomplete ends up never executing?  Is there some way of confirming this?

udevcomplete needs  someone to create 'semaphore' for completion in the first place.


I'm not familiar with the LVM internals and the flows of different processes, even though you can probably safely consider me a "semi power user".  I do have compsci background, so do understand most of the principles of locking etc, but how they are applied in the LVM environment I'm clueless.




It's also unclear which OS are you using - Debian, Fedora, ???

Gentoo.

Version of your packages ?

I thought I did provide this:

Kernel version was 6.4.12 when this hapened, is now 6.9.3.

crowsnest [12:19:47] /run/lvm # udevadm --version
254

aka systemd-utils-254.10

lvm2-2.03.22

Since this is most likely your personal build - please provide full output of
'lvm version'  command.


crowsnest [09:46:04] ~ # lvm version
  LVM version:     2.03.22(2) (2023-08-02)
  Library version: 1.02.196 (2023-08-02)
  Driver version:  4.48.0
  Configuration:   ./configure --prefix=/usr --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib --datarootdir=/usr/share --disable-dependency-tracking --disable-silent-rules --docdir=/usr/share/doc/lvm2-2.03.22-r5 --htmldir=/usr/share/doc/lvm2-2.03.22-r5/html --enable-dmfilemapd --enable-dmeventd --enable-cmdlib --enable-fsadm --enable-lvmpolld --with-mirrors=internal --with-snapshots=internal --with-thin=internal --with-cache=internal --with-thin-check=/usr/sbin/thin_check --with-cache-check=/usr/sbin/cache_check --with-thin-dump=/usr/sbin/thin_dump --with-cache-dump=/usr/sbin/cache_dump --with-thin-repair=/usr/sbin/thin_repair --with-cache-repair=/usr/sbin/cache_repair --with-thin-restore=/usr/sbin/thin_restore --with-cache-restore=/usr/sbin/cache_restore --with-symvers=gnu --enable-readline --disable-selinux --enable-pkgconfig --with-confdir=/etc --exec-prefix= --sbindir=/sbin --with-staticdir=/sbin --libdir=/lib64 --with-usrlibdir=/usr/lib64 --with-default-dm-run-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-default-pid-dir=/run --enable-udev_rules --enable-udev_sync --with-udevdir=/lib/udev/rules.d --disable-lvmlockd-sanlock --disable-notify-dbus --disable-app-machineid --disable-systemd-journal --without-systemd-run --disable-valgrind-pool --with-systemdsystemunitdir=/lib/systemd/system CLDFLAGS=-Wl,-O1 -Wl,--as-needed



For the 'udev' synchronization, there needs to be '--enable-udev_sync' configure option. So let's check which configure/build option were used here.
And also preferably upstream udev rules.


--enable-udev_sync all there.

To the best of my knowledge the udev rules are stock, certainly neither myself nor any of my colleagues modified them.  They would generally defer to me, and I won't touch that unless I understand the implications, which in this case I just don't.



Thanks for the feedback, what you say makes perfect sense, and the implication is that there are only a few options:

1.  Something is resulting in the udev trigger to take longer than three minutes, and the dmsetup udevcomplete never being executed.

systemd simply kills udev worker if takes too long.

However on properly running system, it would be very very unusual to hit these timeouts  - you would need to work with thousands of devices....


32 physical NL-SAS drives, combined into 3 RAID6 arrays using mdadm.

These three md devices serve as PVs for LVM, single VG.

73 LVs, just over half of which are mounted.  Most of those are thin volumes inside:

crowsnest [09:54:04] ~ # lvdisplay /dev/lvm/thin_pool
  --- Logical volume ---
  LV Name                thin_pool
  VG Name                lvm
  LV UUID                twLSE1-3ckG-WRSO-5eHc-G3fY-YS2v-as4ABC
  LV Write Access        read/write (activated read only)
  LV Creation host, time crowsnest, 2020-02-19 12:26:00 +0200
  LV Pool metadata       thin_pool_tmeta
  LV Pool data           thin_pool_tdata
  LV Status              available
  # open                 0
  LV Size                125.00 TiB
  Allocated pool data    73.57%
  Allocated metadata     9.05%
  Current LE             32768000
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     1024
  Block device           253:11

The rest are snapshots of the LVs that are mounted so that we have a roll-back destination in case of filesystem corruption (these snaps are made in multiple steps, first a snap of the origin is made, this is then fsck'ed, if that's successful it's fstrim'ed before being renamed into the final "save" location - any previously saved copy is first lvremove'd).

I'd describe that as a few tens of devices, not a few hundred and certainly not thousands of devices.




This could potentially be due to extremely heavy disk IO, or LVM itself freezing IO.

well reducing the percentage of '/proc/sys/vm/dirty_ration' may possibly help when your disk system is too slow and you create a very lengthy 'sync' io queues...


crowsnest [09:52:21] /proc/sys/vm # cat dirty_ratio
20

Happy to lower that even more if it would help.

Internet (Redhat) states:

Starts active writeback of dirty data at this percentage of total memory for the generator of dirty data, via pdflush. The default value is |40|.

I'm assuming the default is 20 though, not 40, since I can't find that I've reconfigured this value.

Should probably remain higher than dirty_background_ratio (which is currently 10), dirty_background_bytes is 0.



I don't see the default value for udev_log from the config. Explicitly set to debug now, but still not seeing anything logged to syslog. Running with udevd --debug, which logs to a ramdisk on /run.  Hopefully (if/when this happens again) that may shed some light.  There is 256GB of RAM available, so as long as the log doesn't grow too quickly should be fine.

A lot of RAM may possibly create a huge amount of dirty pages...


May I safely interpret this as "lower the dirty_ratio even further"?

Given a value of 10 and 20 I'm assuming that pdflush will start flushing out in the background when >~26GB of in-memory data is dirty, or if the data has been dirty for more than 5 seconds (dirty_writeback_centisecs = 500).

Don't mind lowering the dirty_background ratio as low as 1 even? But won't the primary dirty_ratio start blocking processes from writing if >40% of the caches/buffers are considered dirty?

Kind regards,
Jaco





[Index of Archives]     [Gluster Users]     [Kernel Development]     [Linux Clusters]     [Device Mapper]     [Security]     [Bugtraq]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]

  Powered by Linux