Re: lvm2 deadlock

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 2024/06/04 18:07, Zdenek Kabelac wrote:
Dne 04. 06. 24 v 13:52 Jaco Kroon napsal(a):
Hi,

On 2024/06/04 12:48, Roger Heflin wrote:

Use the *_bytes values.  If they are non-zero then they are used and
that allows setting even below 1% (quite large on anything with a lot
of ram).

I have been using this for quite a while:
vm.dirty_background_bytes = 3000000
vm.dirty_bytes = 5000000


What I am noticing immediately is that the "free" value as per "free -m" is definitely much higher, which to me is indicative that we're not caching as aggressively as can be done.  Will monitor this for the time being:

crowsnest [13:50:09] ~ # free -m
                total        used        free      shared buff/cache available Mem:          257661        6911      105313           7 145436      248246
Swap:              0           0           0

The Total DISK WRITE and Current DISK Write values in in iotop seems to have a tighter correlation now (no longer seeing constant Total DISK WRITE with spikes in current, seems to be more even now).
The free value how now dropped drastically anyway.  So looks like the increase of free was a temporary situation.

Hi

So now while we are solving various system setting - there are more things to think through.
Yea.  Realised we derailed, but given that the theory is that "stuff" is blocking the complete (probably due to backlogged IO?), it's not completely unrelated is it?

The big 'range' of unwritten data may put them in risk for the 'power' failure.

I'd be more worried about host crash in this case to be honest (dual PSU and in several years we've not had a single phase or PDU failure).

On the other hand large  'dirty pages'  allows system to 'optimize'  and even bypass storing them on disk if they are frequently changed - so in this case 'lower' dirty ration may cause significant performance impact - so please check whats the typical workload and what is result...

Based on observations from task timings last night I reckon workloads are around 25% faster on average.  Tasks that used to run just shy of 20 hours (would still have been busy right now) completed last night in just under 15 .  This would need to be monitored over time though, as a single run is definitely not authoritative.  This was with the _bytes settings as suggested by Roger.

For the specific use-case I doubt "frequently changed" applies, and it's probably best to get the data persisted as soon as possible, allowing for improved "future IO capacity" (hope my wording makes sense).


It's worth to mention lvm2 support  writecache target to kind of offload dirty pages to fast storage...
We normally use raid controller battery backup for this in other environments, not relevant in this specific case though, we are using dm-cache in other environments mostly for a read-cache (ie, write-through strategy) on NVMe though because the raid controller whilst buffering writes really sucks at serving reads, which given the nature of spinning drives makes perfect sense, and given the amount of READ on those two hosts the NVMe setup more than quadrupled throughput there.

Last but not least -  disk scheduling policies also do have impact - to i.e. ensure better fairness - at the prices of lower throughput...
We normally use mq-deadline, in this setup I notice this has been updated to "none", the plan was to revert, this was done in collaboration with a discussion with Bart van Assche.  Happy to revert this to be honest. https://lore.kernel.org/all/07d8b189-9379-560b-3291-3feb66d98e5c@xxxxxxx/ relates.

So now let's get back to lvm2  'possible' deadlock - which I'm still not fully certain we deciphered in this thread yet.

So if you happen to 'spot' stuck  commands -  do you notice anything strange in systemd journal -  usually when  systemd decides to kill udevd worker task - it's briefly stated in journal - with this check we would kind of know that reason of your problems was killed worked that was not able to 'finalize' lvm command which is waiting for confirmation from udev (currently without any timeout limits).

Not using systemd, but udev does come from the systemd package. Nothing in the logs at all for udev, as mentioned previously. Don't seem to be able to get normal logs working, but I have set up the debuglog now.  This does log very detailed, except there are no timestamps.  So *if* this happens again hopefully we'll be able to look for some working that was killed rather than merely exited.  What I can see is that it looks like a single forked worker can perform multiple tasks and execute multiple other calls, so I believe that the three minute timeout is *overall*, not on just a single RUN command, which implies that the theory that udevcomplete is never signalled is very much valid.


To unstuck such command  'udevcomplete_all' is a cure - but as said - the system is already kind of 'damaged' since udev is failing and has 'invalid' information about devices...
Agreed.  It gets things going again, which really just allows for a cleaner reboot rather than echo b > /proc/sysrq-trigger or remotely yanking the power (which is where we normally end up at if we don't catch it early enough).

So maybe you could check whether your journal around date&time of problem has some 'interesting'  'killing action' record ?

If we can get normal udev logging working correctly that would be great, but this is not your responsibility, so let me figure out how I can get udevd to log tot syslog (if that is even possible given the way things seems to be moving with systemd).

Kind regards,
Jaco





[Index of Archives]     [Gluster Users]     [Kernel Development]     [Linux Clusters]     [Device Mapper]     [Security]     [Bugtraq]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]

  Powered by Linux