Re: lvmpolld causes high cpu load issue

Zdenek Kabelac <zdenek.kabelac@xxxxxxxxx> · Wed, 17 Aug 2022 11:46:16 +0200

Dne 17. 08. 22 v 10:43 Heming Zhao napsal(a):
On Wed, Aug 17, 2022 at 10:06:35AM +0200, Zdenek Kabelac wrote:
Dne 17. 08. 22 v 4:03 Heming Zhao napsal(a):
On Tue, Aug 16, 2022 at 12:26:51PM +0200, Zdenek Kabelac wrote:
Dne 16. 08. 22 v 12:08 Heming Zhao napsal(a):
Ooh, very sorry, the subject is wrong, not IO performance but cpu high load
is triggered by pvmove.

On Tue, Aug 16, 2022 at 11:38:52AM +0200, Zdenek Kabelac wrote:
Dne 16. 08. 22 v 11:28 Heming Zhao napsal(a):
Hello maintainers & list,

I bring a story:
One SUSE customer suffered lvmpolld issue, which cause IO performance dramatic
decrease.

How to trigger:
When machine connects large number of LUNs (eg 80~200), pvmove (eg, move a single
disk to a new one, cmd like: pvmove disk1 disk2), the system will suffer high
cpu load. But when system connects ~10 LUNs, the performance is fine.

We found two work arounds:
1. set lvm.conf 'activation/polling_interval=120'.
2. write a speical udev rule, which make udev ignore the event for mpath devices.
       echo 'ENV{DM_UUID}=="mpath-*", OPTIONS+="nowatch"' >\
        /etc/udev/rules.d/90-dm-watch.rules

Run above any one of two can make the performance issue disappear.

** the root cause **

lvmpolld will do interval requeset info job for updating the pvmove status

On every polling_interval time, lvm2 will update vg metadata. The update job will
call sys_close, which will trigger systemd-udevd IN_CLOSE_WRITE event, eg:
      2022-<time>-xxx <hostname> systemd-udevd[pid]: dm-179: Inotify event: 8 for /dev/dm-179
(8 is IN_CLOSE_WRITE.)

These VGs underlying devices are multipath devices. So when lvm2 update metatdata,
even if pvmove write a few data, the sys_close action trigger udev's "watch"
mechanism to gets notified frequently about a process that has written to the
device and closed it. This causes frequent, pointless re-evaluation of the udev
rules for these devices.

My question: Does LVM2 maintainers have any idea to fix this bug?

In my view, does lvm2 could drop VGs devices fds until pvmove finish?

Hi

Please provide more info about lvm2  metadata and also some  'lvs -avvvvv'
trace so we can get better picture about the layout - also version of
lvm2,systemd,kernel in use.

pvmove is progressing by mirroring each segment of an LV - so if there would
be a lot of segments - then each such update may trigger udev watch rule
event.

But ATM I could hardly imagine how this could cause some 'dramatic'
performance decrease -  maybe there is something wrong with udev rules on
the system ?

What is the actual impact ?

Note - pvmove was never designed as a high performance operation (in fact it
tries to not eat all the disk bandwidth as such)

Regards
Zdenek

My mistake, I write here again:
The subject is wrong, not IO performance but cpu high load is triggered by pvmove.

There is no IO performance issue.

When system is connecting 80~200, the cpu load increase by 15~20, the
cpu usage by ~20%, which corresponds to about ~5,6 cores and led at
times to the cores fully utilized.
In another word: a single pvmove process cost 5-6 (sometime 10) cores
utilization. It's abnormal & unaccepted.

The lvm2 is 2.03.05, kernel is 5.3. systemd is v246.

BTW:
I change this mail subject from:  lvmpolld causes IO performance issue
to: lvmpolld causes high cpu load issue
Please use this mail for later discussing.

Hi

Could you please retest with recent version of lvm2. There have been
certainly some improvements in scanning - which might have caused in the
older releases some higher CPU usage with longer set of devices.

Regards

Zdenek

The highest lvm2 version in SUSE products is lvm2-2.03.15, does this
version include the improvements change?
Could you mind to point out which commits related with the improvements?
I don't have the reproducible env, I need to get a little detail before
asking customer to try new version.

Please try to reproduce your customer's problem and see if the newer version
solves the issue.   Otherwise we could waste hours on theoretical
discussions what might or might not have helped with this problem. Having a
reproducer is a starting point for fixing it, if the problem is still there.

Here is one commit that may possibly affect CPU load:

d2522f4a05aa027bcc911ecb832450bc19b7fb57

Regards

Zdenek

I gave a little bit explain for the root cause in previous mail, And the
work around <2> also matchs my analysis.

The machine connects lots of LUNs. pvmove one disk will trigger lvm2
update all underlying mpath devices (80~200). I guess the update job is
vg_commit() which updates latest metadata info, and the metadata locates in
all PVs. The update job finished with close(2) which trigger hundreds
devices udevd IN_CLOSE_WRITE event. every IN_CLOSE_WRITE will trigger
mpathd udev rules (11-dm-mpath.rules) to start scanning devices. So the
real world will flooding hundreds of multipath processes, the cpus load
become high.

Your 'guess explanation' is not as useful as you might think - as we do not 
know the layout of lvm2 metadata, how many disks are involved into the 
operation, number of segments  and many other things (in RHEL we have 
'sosreport' to harvest all the needed info).

ATM I'm not even sure if you are complaining about how CPU usage of lvmpolld 
or just huge udev rules processing overhead.

If you have too many disks in VG  (again unclear how many there are paths and 
how many distinct PVs) - user may *significantly* reduce burden associated 
with metadata updating by reducing number of 'actively' maintained metadata 
areas in VG - so i.e. if you have 100PVs in VG - you may keep metadata only on 
5-10 PVs to have 'enough' duplicate copies of lvm2 metadata within VG 
(vgchange --metadaatacopies X) - clearly it depends on the use case and how 
many PVs are added/removed from a VG over the lifetime....

There are IMHO still too many variations to guess from - so it's easier to 
create the most similar reproducer to your customer case if you can't reveal 
more physical info about it  and  lvm2 test suite has lot of power to emulate 
most of your system setup combination (it's easy to put there 100 fake PVs and 
prepare metadata set similar to customer's one - once we will have a local 
reproducer it's easier to seek for solution.

Zdenek

_______________________________________________
linux-lvm mailing list
linux-lvm@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/