And to add some references, there is a PR on hold here: https://github.com/ceph/ceph/pull/38044 which links some relevant trackers entries. Outside of large block.db removals (e.g. from backfilling or snap trimming) we didn't notice a huge difference -- though that is not conclusive. There are several PG removal optimizations in the pipeline which hopefully fix the issues in a different way, rather than needing buffered io. -- dan On Mon, Jan 11, 2021 at 5:20 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote: > > Hi Robert, > > > We are definitely aware of this issue. It appears to often be related > to snap trimming and we believe possibly related to excessive thrashing > of the rocksdb block cache. I suspect that when bluefs_buffered_io is > enabled it hides the issue and people don't notice the problem, but that > might be related to why we see the other issue with the kernel with rgw > workloads. I would recommend that if you didn't see issues with > bluefs_buffered_io enabled, you can re-enable it and periodically check > to make sure you aren't hitting issues with kernel swap. Unfortunately > we are sort of between a rock and a hard place on this one until we > solve the root cause. > > > Right now we're looking at trying to reduce thrashing in the rocksdb > block cache(s) by splitting up onode and omap (and potentially pglog and > allocator) block cache into their own distinct entities. My hope is > that we can finesse the situation so that the overall system page cache > is no longer required to avoid execessive reads assuming enough memory > has been assigned to the osd_memory_target. > > > Mark > > > On 1/11/21 9:47 AM, Robert Sander wrote: > > Hi, > > > > bluefs_buffered_io was disabled in Ceph version 14.2.11. > > > > The cluster started last year with 14.2.5 and got upgraded over the year now running 14.2.16. > > > > The performance was OK first but got abysmal bad at the end of 2020. > > > > We checked the components and HDDs and SSDs seem to be fine. Single disk benchmarks showed performance according the specs. > > > > Today we (re-)enabled bluefs_buffered_io and restarted all OSD processes on 248 HDDs distributed over 12 nodes. > > > > Now the benchmarks are fine again: 434MB/s write instead of 60MB/s, 960MB/s read instead of 123MB/s. > > > > This setting was disabled in 14.2.11 because "in some test cases it appears to cause excessive swap utilization by the linux kernel and a large negative performance impact after several hours of run time." > > We have to monitor if this will happen in our cluster. Is there any other negative side effect currently known? > > > > Here are the rados bench values, first with bluefs_buffered_io=false, then with bluefs_buffered_io=true: > > > > Bench Total Total Write Object Band Stddev Max Min Average Stddev Max Min Average Stddev Max Min > > time writes Read size width Bandwidth IOPS Latency (s) > > run reads size (MB/sec) > > made > > false write 33,081 490 4194304 4194304 59,2485 71,3829 264 0 14 17,8702 66 0 1,07362 2,83017 20,71 0,0741089 > > false seq 15,8226 490 4194304 4194304 123,874 30 46,8659 174 0 0,51453 9,53873 0,00343417 > > false rand 38,2615 2131 4194304 4194304 222,782 55 109,374 415 0 0,28191 12,1039 0,00327948 > > true write 30,4612 3308 4194304 4194304 434,389 26,0323 480 376 108 6,50809 120 94 0,14683 0,07368 0,99791 0,0751249 > > true seq 13,7628 3308 4194304 4194304 961,429 240 22,544 280 184 0,06528 0,88676 0,00338191 > > true rand 30,1007 8247 4194304 4194304 1095,92 273 25,5066 313 213 0,05719 0,99140 0,00325295 > > > > Regards > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx