Re: bluefs_buffered_io=false performance regression

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Mon, 11 Jan 2021 17:48:47 +0100

And to add some references, there is a PR on hold here:
https://github.com/ceph/ceph/pull/38044 which links some relevant
trackers entries.
Outside of large block.db removals (e.g. from backfilling or snap
trimming) we didn't notice a huge difference -- though that is not
conclusive.
There are several PG removal optimizations in the pipeline which
hopefully fix the issues in a different way, rather than needing
buffered io.

-- dan

On Mon, Jan 11, 2021 at 5:20 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>
> Hi Robert,
>
>
> We are definitely aware of this issue.  It appears to often be related
> to snap trimming and we believe possibly related to excessive thrashing
> of the rocksdb block cache.  I suspect that when bluefs_buffered_io is
> enabled it hides the issue and people don't notice the problem, but that
> might be related to why we see the other issue with the kernel with rgw
> workloads.  I would recommend that if you didn't see issues with
> bluefs_buffered_io enabled, you can re-enable it and periodically check
> to make sure you aren't hitting issues with kernel swap.  Unfortunately
> we are sort of between a rock and a hard place on this one until we
> solve the root cause.
>
>
> Right now we're looking at trying to reduce thrashing in the rocksdb
> block cache(s) by splitting up onode and omap (and potentially pglog and
> allocator) block cache into their own distinct entities.  My hope is
> that we can finesse the situation so that the overall system page cache
> is no longer required to avoid execessive reads assuming enough memory
> has been assigned to the osd_memory_target.
>
>
> Mark
>
>
> On 1/11/21 9:47 AM, Robert Sander wrote:
> > Hi,
> >
> > bluefs_buffered_io was disabled in Ceph version 14.2.11.
> >
> > The cluster started last year with 14.2.5 and got upgraded over the year now running 14.2.16.
> >
> > The performance was OK first but got abysmal bad at the end of 2020.
> >
> > We checked the components and HDDs and SSDs seem to be fine. Single disk benchmarks showed performance according the specs.
> >
> > Today we (re-)enabled bluefs_buffered_io and restarted all OSD processes on 248 HDDs distributed over 12 nodes.
> >
> > Now the benchmarks are fine again: 434MB/s write instead of 60MB/s, 960MB/s read instead of 123MB/s.
> >
> > This setting was disabled in 14.2.11 because "in some test cases it appears to cause excessive swap utilization by the linux kernel and a large negative performance impact after several hours of run time."
> > We have to monitor if this will happen in our cluster. Is there any other negative side effect currently known?
> >
> > Here are the rados bench values, first with bluefs_buffered_io=false, then with bluefs_buffered_io=true:
> >
> > Bench         Total   Total   Write   Object  Band    Stddev  Max     Min     Average Stddev  Max     Min     Average Stddev  Max     Min
> >               time    writes  Read    size    width        Bandwidth                       IOPS                         Latency (s)
> >               run     reads   size            (MB/sec)
> >                       made
> > false write   33,081  490     4194304 4194304 59,2485 71,3829 264     0       14      17,8702 66      0       1,07362 2,83017 20,71   0,0741089
> > false seq     15,8226 490     4194304 4194304 123,874                         30      46,8659 174     0       0,51453         9,53873 0,00343417
> > false rand    38,2615 2131    4194304 4194304 222,782                         55      109,374 415     0       0,28191         12,1039 0,00327948
> > true write    30,4612 3308    4194304 4194304 434,389 26,0323 480     376     108     6,50809 120     94      0,14683 0,07368 0,99791 0,0751249
> > true seq      13,7628 3308    4194304 4194304 961,429                         240     22,544  280     184     0,06528         0,88676 0,00338191
> > true rand     30,1007 8247    4194304 4194304 1095,92                         273     25,5066 313     213     0,05719         0,99140 0,00325295
> >
> > Regards
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx