Re: High OSD latencies afer Upgrade 14.2.16 -> 14.2.22

Rainer Krienke <krienke@xxxxxxxxxxxxxx> · Mon, 19 Jul 2021 08:59:34 +0200

Hello Josh,

thank you very much for your answer. Well we do not use encryption on 
our OSDs, but the symptoms you found are quite similar to what I 
observed after the upgrade to 14.2.22.

We also use the new default with bluefs_buffered_io=true which probably 
cause the higher OSD latencies seen. Since we do not use dmcrypt but 
still have higher latencies as well as higher throuput it seems dmcrypt 
ist not the only way to see these symptims.

How did you find out that dmcrypt causes this situation in your case?

Thanks you also for the description what bluefs_buffered_io=true 
actually does. I googled a bit but did not find any documentation about 
this buffer. Do you know where more information can be found?

Thanks a lot
Rainer

Am 16.07.21 um 16:35 schrieb Josh Baergen:
Hi Rainer,

Are you using dmcrypt on your OSDs? I ask because I'm wondering if 
you're seeing something similar to what we saw in our systems with 
bluefs_buffered_io=true (as it is by default in 14.2.22, whereas it's 
false in 14.2.16): With this set, bluefs writes go through Linux's 
buffer cache, and large writes actually get chopped up at buffer 
boundaries, with the expectation that the I/O scheduler will merge those 
writes into larger ones again. However, because the dmcrypt layer is a 
bit slower, we found that the individual writes would leak through to 
the scheduler at a slow enough pace that they wouldn't all be merged, 
resulting in a bunch of smaller writes to the HDDs that were missing 
revolutions, increasing OSD commit/apply latency. Like you, we didn't 
see a major effect on end user performance for most of our systems, but 
the effect was pretty drastic for one of them. Not as big of an issue 
for SSDs, of course, because most of them have some sort of cache (SLC 
or otherwise) that can absorb these smaller writes and internally commit 
as larger ones. (Theoretically a writeback cache in front of HDDs should 
help with this as well, but we tend to avoid those.)

bluefs_buffered_io=true helps with some rocksdb iterate-and-delete 
workloads in particular, such as snaptrim. PG removal was optimized in 
14.2.17 to avoid the performance issues that bluefs_buffered_io is 
needed to solve. We run HDDs in some of our RGW clusters only and so we 
chose to set bluefs_buffered_io=false for all of our HDD nodes to avoid 
the latency hit, since PG removal is the only workload of this type in a 
typical RGW system.

Josh

On Fri, Jul 16, 2021 at 6:58 AM Rainer Krienke <krienke@xxxxxxxxxxxxxx 
<mailto:krienke@xxxxxxxxxxxxxx>> wrote:

    Hello,

    Today I upgraded a ceph (HDD) cluster consisting of 9 hosts with
    each 16
    OSDs (a total of 144) to the latest Nautilus version 14.2.22. The
    upgrade proceeded without problems. The cluster is healthy. After all
    hosts were on 14.2.22 I saw in grafana that OSD latencies were by
    85msec
    after an hour  they dropped to about 45 ms. And now probably because
    the
    cluster faces a little higher IO demand from the Proxmox client side
    the
    OSD latencies are again at 57ms.

    Before the upgrade running 14.2.16 this value was about 33msec.

    I looked at ceph os perf where I can see an always changing set of OSDS
    that have latencies of about 300, right after the upgrade up some
    had up
    to 800 ms. Now there are always say 20 OSD that are between 100 and
    400msec. They are not all from one host and this high latency osd set
    has members that stay longer in this high state and others that change
    more often to a lower value again:

    # ceph osd perf|sort -n -k 2|tail -30
    134                 37                37
       19                 38                38
    112                 39                39
       12                 42                42
       75                 42                42
       67                 43                43
       51                 45                45
       81                 45                45
       92                 50                50
       40                 56                56
       63                 60                60
       59                 61                61
    128                 65                65
    135                 65                65
    124                 66                66
    117                 94                94
       35                 94                94
       26                112               112
       14                127               127
       56                135               135
    100                164               164
       83                168               168
       62                177               177
       82                182               182
       30                186               186
       72                186               186
    102                203               203
    131                211               211
    121                247               247
       46                254               254
    137                340               340

    On the other hand if I try to test performance on a linux VM running on
    proxmox that uses this cluster as a storage backend I do not have the
    feeling that its slower than before, when I test eg IO Performance
    using
    bonnie++ .  It actually seems to be faster. But why then the higher osd
    latencies?

    Does anyone have an idea why those latencies could have nearly doubled?
    How can I find out more about this strangeness? Any Ideas?

    Thanks
    Rainer
    -- 
    Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
    56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke
    <http://www.uni-koblenz.de/~krienke>, Tel: +49261287 1312
    PGP: http://www.uni-koblenz.de/~krienke/mypgp.html
    <http://www.uni-koblenz.de/~krienke/mypgp.html>,     Fax: +49261287
    1001312
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    To unsubscribe send an email to ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>

--
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
PGP: http://www.uni-koblenz.de/~krienke/mypgp.html,     Fax: +49261287 
1001312
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx