Re: lvm2 deadlock

Jaco Kroon <jaco@xxxxxxxxx> · Tue, 4 Jun 2024 16:07:49 +0200

Hi,

On 2024/06/04 15:30, Roger Heflin wrote:
My experience is that heavy disk io/batch disk io systems work better
with these values being smallish.

Ie both even under 10MB or so.    About all having the number larger
has done is trick io benchmarks that don't force a sync at the end,
and/or appear to make large saves happen faster.

There is also the freeze/pause for outstandingwritesMB/<iorateMB>
seconds, smaller shortens the freeze.

I don't see a use case for having large values.   It seems to have no
real upside and several downsides.  Get the buffer size small enough
and you will still get pauses to clear the writes the be pauses will
be short enough to not be a problem.

Thanks, this is extremely insightful.  So with original values there 
could be "up to" ~ 50GB outstanding for write, let's assume that's all 
to one disk (extremely unlikely, and assuming 100MB/s which is 
optimistic if it's random access) this will take upwards of 500 seconds, 
which is a hellishly long time in our world.

I think the value of 500MB I've set now should almost never exceed 10s 
or so for a sync even if everything is targeted at a single drive.  I 
think we're OK with that on this specific host.

Kind regards,
Jaco

On Tue, Jun 4, 2024 at 6:52 AM Jaco Kroon <jaco@xxxxxxxxx> wrote:
Hi,

On 2024/06/04 12:48, Roger Heflin wrote:

Use the *_bytes values.  If they are non-zero then they are used and
that allows setting even below 1% (quite large on anything with a lot
of ram).

I have been using this for quite a while:
vm.dirty_background_bytes = 3000000
vm.dirty_bytes = 5000000

crowsnest [13:32:48] ~ # sysctl vm.dirty_background_bytes=3000000
vm.dirty_background_bytes = 3000000
crowsnest [13:32:59] ~ # sysctl vm.dirty_bytes=500000000
vm.dirty_bytes = 500000000

And persisted via /etc/sysctl.conf

Thank you.  Must be noted this host doesn't do much else other than disk
IO, so I'm hoping the 500MB value will be OK, this is just so IO won't
block CPU heavy-at-the-time tasks.

The purpose of 256GB RAM was so that we could have ~250GB worth of disk
cache (obviously we don't want all of that to be dirty, OS and "used"
used to be below 4GB, now generally around 8-12GB, currently it's in
"quiet" time, so a bit lower, just busy running some background
compression).  As per iostat:

avg-cpu:  %user   %nice %system %iowait %steal   %idle
             7.73   18.43   18.96   37.86    0.00   17.01

Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s MB_read
MB_wrtn    MB_dscd
md2             392.13        10.00         5.11         0.00 4244888
2167644          0
md3            2270.12        43.88        56.82         0.00 18626309
24120982          0
md4            1406.06        30.47        16.83         0.00
12934654    7143330          0

That's total 35805851 MB (34.1B) read and 33431956 MB (31.9TB) written
in just under 5 days.

What I am noticing immediately is that the "free" value as per "free -m"
is definitely much higher, which to me is indicative that we're not
caching as aggressively as can be done.  Will monitor this for the time
being:

crowsnest [13:50:09] ~ # free -m
                 total        used        free      shared buff/cache
available
Mem:          257661        6911      105313           7 145436      248246
Swap:              0           0           0

The Total DISK WRITE and Current DISK Write values in in iotop seems to
have a tighter correlation now (no longer seeing constant Total DISK
WRITE with spikes in current, seems to be more even now).

Kind regards,
Jaco