Re: lvm2 deadlock

Jaco Kroon <jaco@xxxxxxxxx> · Wed, 5 Jun 2024 10:59:29 +0200

Hi,

On 2024/06/04 18:07, Zdenek Kabelac wrote:
Dne 04. 06. 24 v 13:52 Jaco Kroon napsal(a):
Hi,

On 2024/06/04 12:48, Roger Heflin wrote:

Use the *_bytes values.  If they are non-zero then they are used and
that allows setting even below 1% (quite large on anything with a lot
of ram).

I have been using this for quite a while:
vm.dirty_background_bytes = 3000000
vm.dirty_bytes = 5000000

What I am noticing immediately is that the "free" value as per "free 
-m" is definitely much higher, which to me is indicative that we're 
not caching as aggressively as can be done.  Will monitor this for 
the time being:

crowsnest [13:50:09] ~ # free -m
                total        used        free      shared buff/cache 
available
Mem:          257661        6911      105313           7 145436      
248246
Swap:              0           0           0

The Total DISK WRITE and Current DISK Write values in in iotop seems 
to have a tighter correlation now (no longer seeing constant Total 
DISK WRITE with spikes in current, seems to be more even now).
The free value how now dropped drastically anyway.  So looks like the 
increase of free was a temporary situation.

Hi

So now while we are solving various system setting - there are more 
things to think through.
Yea.  Realised we derailed, but given that the theory is that "stuff" is 
blocking the complete (probably due to backlogged IO?), it's not 
completely unrelated is it?

The big 'range' of unwritten data may put them in risk for the 'power' 
failure.

I'd be more worried about host crash in this case to be honest (dual PSU 
and in several years we've not had a single phase or PDU failure).

On the other hand large  'dirty pages'  allows system to 'optimize'  
and even bypass storing them on disk if they are frequently changed - 
so in this case 'lower' dirty ration may cause significant performance 
impact - so please check whats the typical workload and what is result...

Based on observations from task timings last night I reckon workloads 
are around 25% faster on average.  Tasks that used to run just shy of 20 
hours (would still have been busy right now) completed last night in 
just under 15 .  This would need to be monitored over time though, as a 
single run is definitely not authoritative.  This was with the _bytes 
settings as suggested by Roger.

For the specific use-case I doubt "frequently changed" applies, and it's 
probably best to get the data persisted as soon as possible, allowing 
for improved "future IO capacity" (hope my wording makes sense).

It's worth to mention lvm2 support  writecache target to kind of 
offload dirty pages to fast storage...
We normally use raid controller battery backup for this in other 
environments, not relevant in this specific case though, we are using 
dm-cache in other environments mostly for a read-cache (ie, 
write-through strategy) on NVMe though because the raid controller 
whilst buffering writes really sucks at serving reads, which given the 
nature of spinning drives makes perfect sense, and given the amount of 
READ on those two hosts the NVMe setup more than quadrupled throughput 
there.

Last but not least -  disk scheduling policies also do have impact - 
to i.e. ensure better fairness - at the prices of lower throughput...
We normally use mq-deadline, in this setup I notice this has been 
updated to "none", the plan was to revert, this was done in 
collaboration with a discussion with Bart van Assche.  Happy to revert 
this to be honest. 
https://lore.kernel.org/all/07d8b189-9379-560b-3291-3feb66d98e5c@xxxxxxx/ 
relates.

So now let's get back to lvm2  'possible' deadlock - which I'm still 
not fully certain we deciphered in this thread yet.

So if you happen to 'spot' stuck  commands -  do you notice anything 
strange in systemd journal -  usually when  systemd decides to kill 
udevd worker task - it's briefly stated in journal - with this check 
we would kind of know that reason of your problems was killed worked 
that was not able to 'finalize' lvm command which is waiting for 
confirmation from udev (currently without any timeout limits).

Not using systemd, but udev does come from the systemd package. Nothing 
in the logs at all for udev, as mentioned previously. Don't seem to be 
able to get normal logs working, but I have set up the debuglog now.  
This does log very detailed, except there are no timestamps.  So *if* 
this happens again hopefully we'll be able to look for some working that 
was killed rather than merely exited.  What I can see is that it looks 
like a single forked worker can perform multiple tasks and execute 
multiple other calls, so I believe that the three minute timeout is 
*overall*, not on just a single RUN command, which implies that the 
theory that udevcomplete is never signalled is very much valid.

To unstuck such command  'udevcomplete_all' is a cure - but as said - 
the system is already kind of 'damaged' since udev is failing and has 
'invalid' information about devices...
Agreed.  It gets things going again, which really just allows for a 
cleaner reboot rather than echo b > /proc/sysrq-trigger or remotely 
yanking the power (which is where we normally end up at if we don't 
catch it early enough).

So maybe you could check whether your journal around date&time of 
problem has some 'interesting'  'killing action' record ?

If we can get normal udev logging working correctly that would be great, 
but this is not your responsibility, so let me figure out how I can get 
udevd to log tot syslog (if that is even possible given the way things 
seems to be moving with systemd).

Kind regards,
Jaco