Re: slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

Mark Nelson <mnelson@xxxxxxxxxx> · Fri, 18 Jan 2019 11:06:54 -0600

On 1/18/19 9:22 AM, Nils Fahldieck - Profihost AG wrote:
Hello Mark,

I'm answering on behalf of Stefan.
Am 18.01.19 um 00:22 schrieb Mark Nelson:
On 1/17/19 4:06 PM, Stefan Priebe - Profihost AG wrote:
Hello Mark,

after reading
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/

again i'm really confused how the behaviour is exactly under 12.2.8
regarding memory and 12.2.10.

Also i stumpled upon "When tcmalloc and cache autotuning is enabled," -
we're compiling against and using jemalloc. What happens in this case?

Hi Stefan,

The autotuner uses the existing in-tree perfglue code that grabs the
tcmalloc heap and unmapped memory statistics to determine how to tune
the caches.  Theoretically we might be able to do the same thing for
jemalloc and maybe even glibc malloc, but there's no perfglue code for
those yet.  If the autotuner can't get heap statistics it won't try to
tune the caches and should instead revert to using the
bluestore_cache_size and whatever the ratios are (the same as if you set
bluestore_cache_autotune to false).

Thank you for that information on the difference between tcmalloc and
jemalloc. We compiled a new 12.2.10 version using tcmalloc. I upgraded a
cluster, which was running _our_ old 12.2.10 version (which used
jemalloc). This cluster has a very low load, so the
jemalloc-ceph-version didn't trigger any performance problems. Prior to
upgrading, one OSD never used more than 1 GB of RAM. After upgrading
there are OSDs using approx. 5,7 GB right now.

I also removed the 'osd_memory_target' option, which we falsely believed
has replaced 'bluestore_cache_size'.

We still have to test this on a cluster generating more I/O load.

For now, this seems to be working fine. Thanks.

Also i saw now - that 12.2.10 uses 1GB mem max while 12.2.8 uses 6-7GB
Mem (with bluestore_cache_size = 1073741824).

If you are using the autotuner (but it sounds like maybe you are not if
jemalloc is being used?) you'll want to set the osd_memory_target at
least 1GB higher than what you previously had the bluestore_cache_size
set to.  It's likely that trying to set the OSD to stay within 1GB of
memory will cause the cache to sit at osd_memory_cache_min because the
tuner simply can't shrink the cache enough to meet the target (too much
other memory consumed by pglog, rocksdb WAL buffers, random other stuff).

The fact that you see 6-7GB of mem usage with 12.2.8 vs 1GB with 12.2.10
sounds like a clue.  A bluestore OSD using 1GB of memory is going to
have very little space for cache and it's quite likely that it would be
performing reads from disk for a variety of reasons.  Getting to the
root of that might explain what's going on.  If you happen to still have
a 12.2.8 OSD up that's consuming 6-7GB of memory (with
bluestore_cache_size = 1073741824), can you dump the mempool stats and
running configuration for it?

This is one OSD from a different cluster using approximately 6,1 GB of
memory. This OSD and it's cluster is still running with version 12.2.8.

This OSD (and every other OSD running with 12.2.8) is still configured
with 'bluestore_cache_size = 1073741824'. Please see the following
pastebins:

ceph daemon osd.NNN dump_mempools
https://pastebin.com/Pdcrr4ut

And

ceph daemon osd.NNN show config
https://pastebin.com/nkKpNFU3

Best Regards
Nils

Hi Nils,

Forgive me if you already said this, but is osd.32 backed by an SSD?  I 
believe what you are seeing is that the OSD is actually using 3GB of 
cache due to:

bluestore_cache_size_ssd = 3221225472

on line 132 of your show config paste.

That is backed up by the mempool data:

    "bluestore_cache_other": {
        "items": 62839413,
        "bytes": 2573767714
    },

    "total": {
        "items": 214595893,
        "bytes": 3087934707
    }

IE even though you guys set bluestore_cache_size to 1GB, it is being 
overridden by bluestore_cache_size_ssd.  Later when you compiled the 
tcmalloc version of 12.2.10 and set the osd_memory_target to 1GB, it was 
properly being applied and the autotuner desperately attempted to fit 
the entire OSD into 1GB of memory by shrinking all of the caches to fit 
within osd_memory_cache_min (128MB by default).  Ultimately that lead to 
many reads from disk as even the rocksdb bloom filters may not have 
properly fit into that small of a cache.  Generally I think the absolute 
minimum osd_memory_target for bluestore is probably around 1.5-2GB (with 
potential performance penalties), but 3-4GB gives it a lot more 
breathing room.  If you are ok with the OSD taking up 6-7GB of memory 
you might set the osd_memory_target accordingly.

The reason we wrote the autotuning code is to try to make all of this 
simpler and more explicit.  The idea is that a user shouldn't need to 
think about any of this beyond giving the OSD a target for how much 
memory it should consume and let it worry about figuring out how to use 
it.  We're still working on making it smarter, but the goal is for it to 
ultimately make better decisions about balancing caches than humans can 
make.  It can keep track of what kind of data is currently hot and 
adjust the size of the caches accordingly.  IE for rgw bucket indexes it 
might be more important to cache OMAP in the rocksdb block cache.  For 
small random writes to a rbd volume, it might be far more important to 
have a larger bluestore onode cache.  The trick is that this adds 
overhead and also seems to make the memory allocator work harder so 
there's some tradeoff involved.

Mark

Thanks,

Mark

Greets,
Stefan

Am 17.01.19 um 22:59 schrieb Stefan Priebe - Profihost AG:
Hello Mark,

for whatever reason i didn't get your mails - most probably you kicked
me out of CC/TO and only sent to the ML? I've only subscribed to a daily
digest. (changed that for now)

So i'm very sorry to answer so late.

My messages might sound a bit confuse as it isn't easy reproduced and we
tried a lot to find out what's going on.

As 12.2.10 does not contain the pg hard limit i don't suspect it is
related to it.

What i can tell right now is:

1.) Under 12.2.8 we've set bluestore_cache_size = 1073741824

2.) While upgrading to 12.2.10 we replaced it with osd_memory_target =
1073741824

3.) i also tried 12.2.10 without setting osd_memory_target or
bluestore_cache_size

4.) it's not kernel related - for some unknown reason it worked for some
hours with a newer kernel but gave problems again later

5.) a backfill with 12.2.10 of 6x 2TB SSDs took about 14 hours using
12.2.10 while it took 2 hours with 12.2.8

6.) with 12.2.10 i have a constant rate of 100% read i/o (400-500MB/s)
on most of my bluestore OSDs - while on 12.2.8 i've 100kb - 2MB/s max
read on 12.2.8.

7.) upgrades on small clusters or fresh installs seem to work fine. (no
idea why or it is related to cluste size)

That's currently all i know.

Thanks a lot!

Greets,
Stefan
Am 16.01.19 um 20:56 schrieb Stefan Priebe - Profihost AG:
i reverted the whole cluster back to 12.2.8 - recovery speed also
dropped from 300-400MB/s to 20MB/s on 12.2.10. So something is really
broken.

Greets,
Stefan
Am 16.01.19 um 16:00 schrieb Stefan Priebe - Profihost AG:
This is not the case with 12.2.8 - it happens with 12.2.9 as well.
After
boot all pgs are instantly active - not inactive pgs at least not
noticable in ceph -s.

With 12.2.9 or 12.2.10 or eben current upstream/luminous it takes
minutes until all pgs are active again.

Greets,
Stefan
Am 16.01.19 um 15:22 schrieb Stefan Priebe - Profihost AG:
Hello,

while digging into this further i saw that it takes ages until all
pgs
are active. After starting the OSD 3% of all pgs are inactive and it
takes minutes after they're active.

The log of the OSD is full of:

2019-01-16 15:19:13.568527 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f
1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472
pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 185 upset
size 3 up 2
2019-01-16 15:19:13.568637 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f
1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472
pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 2 upset size
3 up 3
2019-01-16 15:19:15.909327 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f
1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472
pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=184,(3+0)=3}}] _update_calc_stats ml 184 upset
size 3 up 2
2019-01-16 15:19:15.909446 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f
1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472
pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=184,(3+0)=3}}] _update_calc_stats ml 3 upset size
3 up 3
2019-01-16 15:19:23.503231 7fecb97ff700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f
1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472
pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=183,(3+0)=3}}] _update_calc_stats ml 183 upset
size 3 up 2

Greets,
Stefan
Am 16.01.19 um 09:12 schrieb Stefan Priebe - Profihost AG:
Hi,

no ok it was not. Bug still present. It was only working because the
osdmap was so far away that it has started backfill instead of
recovery.

So it happens only in the recovery case.

Greets,
Stefan

Am 15.01.19 um 16:02 schrieb Stefan Priebe - Profihost AG:
Am 15.01.19 um 12:45 schrieb Marc Roos:
   I upgraded this weekend from 12.2.8 to 12.2.10 without such
issues
(osd's are idle)
it turns out this was a kernel bug. Updating to a newer kernel -
has
solved this issue.

Greets,
Stefan

-----Original Message-----
From: Stefan Priebe - Profihost AG [mailto:s.priebe@xxxxxxxxxxxx]
Sent: 15 January 2019 10:26
To: ceph-users@xxxxxxxxxxxxxx
Cc: n.fahldieck@xxxxxxxxxxxx
Subject: Re:  slow requests and high i/o / read
rate on
bluestore osds after upgrade 12.2.8 -> 12.2.10

Hello list,

i also tested current upstream/luminous branch and it happens
as well. A
clean install works fine. It only happens on upgraded bluestore
osds.

Greets,
Stefan

Am 14.01.19 um 20:35 schrieb Stefan Priebe - Profihost AG:
while trying to upgrade a cluster from 12.2.8 to 12.2.10 i'm
experience
issues with bluestore osds - so i canceled the upgrade and all
bluestore
osds are stopped now.

After starting a bluestore osd i'm seeing a lot of slow requests
caused
by very high read rates.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda              45,00   187,00  767,00   39,00 482040,00
8660,00
1217,62    58,16   74,60   73,85   89,23   1,24 100,00

it reads permanently with 500MB/s from the disk and can't service
client
requests. Overall client read rate is at 10.9MiB/s rd

I can't reproduce this with 12.2.8. Is this a known bug /
regression?

Greets,
Stefan

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com