Hi Robert,
We are definitely aware of this issue. It appears to often be related
to snap trimming and we believe possibly related to excessive thrashing
of the rocksdb block cache. I suspect that when bluefs_buffered_io is
enabled it hides the issue and people don't notice the problem, but that
might be related to why we see the other issue with the kernel with rgw
workloads. I would recommend that if you didn't see issues with
bluefs_buffered_io enabled, you can re-enable it and periodically check
to make sure you aren't hitting issues with kernel swap. Unfortunately
we are sort of between a rock and a hard place on this one until we
solve the root cause.
Right now we're looking at trying to reduce thrashing in the rocksdb
block cache(s) by splitting up onode and omap (and potentially pglog and
allocator) block cache into their own distinct entities. My hope is
that we can finesse the situation so that the overall system page cache
is no longer required to avoid execessive reads assuming enough memory
has been assigned to the osd_memory_target.
Mark
On 1/11/21 9:47 AM, Robert Sander wrote:
Hi,
bluefs_buffered_io was disabled in Ceph version 14.2.11.
The cluster started last year with 14.2.5 and got upgraded over the year now running 14.2.16.
The performance was OK first but got abysmal bad at the end of 2020.
We checked the components and HDDs and SSDs seem to be fine. Single disk benchmarks showed performance according the specs.
Today we (re-)enabled bluefs_buffered_io and restarted all OSD processes on 248 HDDs distributed over 12 nodes.
Now the benchmarks are fine again: 434MB/s write instead of 60MB/s, 960MB/s read instead of 123MB/s.
This setting was disabled in 14.2.11 because "in some test cases it appears to cause excessive swap utilization by the linux kernel and a large negative performance impact after several hours of run time."
We have to monitor if this will happen in our cluster. Is there any other negative side effect currently known?
Here are the rados bench values, first with bluefs_buffered_io=false, then with bluefs_buffered_io=true:
Bench Total Total Write Object Band Stddev Max Min Average Stddev Max Min Average Stddev Max Min
time writes Read size width Bandwidth IOPS Latency (s)
run reads size (MB/sec)
made
false write 33,081 490 4194304 4194304 59,2485 71,3829 264 0 14 17,8702 66 0 1,07362 2,83017 20,71 0,0741089
false seq 15,8226 490 4194304 4194304 123,874 30 46,8659 174 0 0,51453 9,53873 0,00343417
false rand 38,2615 2131 4194304 4194304 222,782 55 109,374 415 0 0,28191 12,1039 0,00327948
true write 30,4612 3308 4194304 4194304 434,389 26,0323 480 376 108 6,50809 120 94 0,14683 0,07368 0,99791 0,0751249
true seq 13,7628 3308 4194304 4194304 961,429 240 22,544 280 184 0,06528 0,88676 0,00338191
true rand 30,1007 8247 4194304 4194304 1095,92 273 25,5066 313 213 0,05719 0,99140 0,00325295
Regards
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx