Hi,
On 2024/06/04 18:07, Zdenek Kabelac wrote:
Dne 04. 06. 24 v 13:52 Jaco Kroon napsal(a):
Hi,
On 2024/06/04 12:48, Roger Heflin wrote:
Use the *_bytes values. If they are non-zero then they are used and
that allows setting even below 1% (quite large on anything with a lot
of ram).
I have been using this for quite a while:
vm.dirty_background_bytes = 3000000
vm.dirty_bytes = 5000000
What I am noticing immediately is that the "free" value as per "free
-m" is definitely much higher, which to me is indicative that we're
not caching as aggressively as can be done. Will monitor this for
the time being:
crowsnest [13:50:09] ~ # free -m
total used free shared buff/cache
available
Mem: 257661 6911 105313 7 145436
248246
Swap: 0 0 0
The Total DISK WRITE and Current DISK Write values in in iotop seems
to have a tighter correlation now (no longer seeing constant Total
DISK WRITE with spikes in current, seems to be more even now).
The free value how now dropped drastically anyway. So looks like the
increase of free was a temporary situation.
Hi
So now while we are solving various system setting - there are more
things to think through.
Yea. Realised we derailed, but given that the theory is that "stuff" is
blocking the complete (probably due to backlogged IO?), it's not
completely unrelated is it?
The big 'range' of unwritten data may put them in risk for the 'power'
failure.
I'd be more worried about host crash in this case to be honest (dual PSU
and in several years we've not had a single phase or PDU failure).
On the other hand large 'dirty pages' allows system to 'optimize'
and even bypass storing them on disk if they are frequently changed -
so in this case 'lower' dirty ration may cause significant performance
impact - so please check whats the typical workload and what is result...
Based on observations from task timings last night I reckon workloads
are around 25% faster on average. Tasks that used to run just shy of 20
hours (would still have been busy right now) completed last night in
just under 15 . This would need to be monitored over time though, as a
single run is definitely not authoritative. This was with the _bytes
settings as suggested by Roger.
For the specific use-case I doubt "frequently changed" applies, and it's
probably best to get the data persisted as soon as possible, allowing
for improved "future IO capacity" (hope my wording makes sense).
It's worth to mention lvm2 support writecache target to kind of
offload dirty pages to fast storage...
We normally use raid controller battery backup for this in other
environments, not relevant in this specific case though, we are using
dm-cache in other environments mostly for a read-cache (ie,
write-through strategy) on NVMe though because the raid controller
whilst buffering writes really sucks at serving reads, which given the
nature of spinning drives makes perfect sense, and given the amount of
READ on those two hosts the NVMe setup more than quadrupled throughput
there.
Last but not least - disk scheduling policies also do have impact -
to i.e. ensure better fairness - at the prices of lower throughput...
We normally use mq-deadline, in this setup I notice this has been
updated to "none", the plan was to revert, this was done in
collaboration with a discussion with Bart van Assche. Happy to revert
this to be honest.
https://lore.kernel.org/all/07d8b189-9379-560b-3291-3feb66d98e5c@xxxxxxx/
relates.
So now let's get back to lvm2 'possible' deadlock - which I'm still
not fully certain we deciphered in this thread yet.
So if you happen to 'spot' stuck commands - do you notice anything
strange in systemd journal - usually when systemd decides to kill
udevd worker task - it's briefly stated in journal - with this check
we would kind of know that reason of your problems was killed worked
that was not able to 'finalize' lvm command which is waiting for
confirmation from udev (currently without any timeout limits).
Not using systemd, but udev does come from the systemd package. Nothing
in the logs at all for udev, as mentioned previously. Don't seem to be
able to get normal logs working, but I have set up the debuglog now.
This does log very detailed, except there are no timestamps. So *if*
this happens again hopefully we'll be able to look for some working that
was killed rather than merely exited. What I can see is that it looks
like a single forked worker can perform multiple tasks and execute
multiple other calls, so I believe that the three minute timeout is
*overall*, not on just a single RUN command, which implies that the
theory that udevcomplete is never signalled is very much valid.
To unstuck such command 'udevcomplete_all' is a cure - but as said -
the system is already kind of 'damaged' since udev is failing and has
'invalid' information about devices...
Agreed. It gets things going again, which really just allows for a
cleaner reboot rather than echo b > /proc/sysrq-trigger or remotely
yanking the power (which is where we normally end up at if we don't
catch it early enough).
So maybe you could check whether your journal around date&time of
problem has some 'interesting' 'killing action' record ?
If we can get normal udev logging working correctly that would be great,
but this is not your responsibility, so let me figure out how I can get
udevd to log tot syslog (if that is even possible given the way things
seems to be moving with systemd).
Kind regards,
Jaco