Re: slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 17 Jan 2019 17:22:50 -0600

On 1/17/19 4:06 PM, Stefan Priebe - Profihost AG wrote:
Hello Mark,

after reading
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
again i'm really confused how the behaviour is exactly under 12.2.8
regarding memory and 12.2.10.

Also i stumpled upon "When tcmalloc and cache autotuning is enabled," -
we're compiling against and using jemalloc. What happens in this case?

Hi Stefan,

The autotuner uses the existing in-tree perfglue code that grabs the 
tcmalloc heap and unmapped memory statistics to determine how to tune 
the caches.  Theoretically we might be able to do the same thing for 
jemalloc and maybe even glibc malloc, but there's no perfglue code for 
those yet.  If the autotuner can't get heap statistics it won't try to 
tune the caches and should instead revert to using the 
bluestore_cache_size and whatever the ratios are (the same as if you set 
bluestore_cache_autotune to false).

Also i saw now - that 12.2.10 uses 1GB mem max while 12.2.8 uses 6-7GB
Mem (with bluestore_cache_size = 1073741824).

If you are using the autotuner (but it sounds like maybe you are not if 
jemalloc is being used?) you'll want to set the osd_memory_target at 
least 1GB higher than what you previously had the bluestore_cache_size 
set to.  It's likely that trying to set the OSD to stay within 1GB of 
memory will cause the cache to sit at osd_memory_cache_min because the 
tuner simply can't shrink the cache enough to meet the target (too much 
other memory consumed by pglog, rocksdb WAL buffers, random other stuff).

The fact that you see 6-7GB of mem usage with 12.2.8 vs 1GB with 12.2.10 
sounds like a clue.  A bluestore OSD using 1GB of memory is going to 
have very little space for cache and it's quite likely that it would be 
performing reads from disk for a variety of reasons.  Getting to the 
root of that might explain what's going on.  If you happen to still have 
a 12.2.8 OSD up that's consuming 6-7GB of memory (with 
bluestore_cache_size = 1073741824), can you dump the mempool stats and 
running configuration for it?

ceph daemon osd.NNN dump_mempools

And

ceph daemon osd.NNN show config

Thanks,

Mark

Greets,
Stefan

Am 17.01.19 um 22:59 schrieb Stefan Priebe - Profihost AG:
Hello Mark,

for whatever reason i didn't get your mails - most probably you kicked
me out of CC/TO and only sent to the ML? I've only subscribed to a daily
digest. (changed that for now)

So i'm very sorry to answer so late.

My messages might sound a bit confuse as it isn't easy reproduced and we
tried a lot to find out what's going on.

As 12.2.10 does not contain the pg hard limit i don't suspect it is
related to it.

What i can tell right now is:

1.) Under 12.2.8 we've set bluestore_cache_size = 1073741824

2.) While upgrading to 12.2.10 we replaced it with osd_memory_target =
1073741824

3.) i also tried 12.2.10 without setting osd_memory_target or
bluestore_cache_size

4.) it's not kernel related - for some unknown reason it worked for some
hours with a newer kernel but gave problems again later

5.) a backfill with 12.2.10 of 6x 2TB SSDs took about 14 hours using
12.2.10 while it took 2 hours with 12.2.8

6.) with 12.2.10 i have a constant rate of 100% read i/o (400-500MB/s)
on most of my bluestore OSDs - while on 12.2.8 i've 100kb - 2MB/s max
read on 12.2.8.

7.) upgrades on small clusters or fresh installs seem to work fine. (no
idea why or it is related to cluste size)

That's currently all i know.

Thanks a lot!

Greets,
Stefan
Am 16.01.19 um 20:56 schrieb Stefan Priebe - Profihost AG:
i reverted the whole cluster back to 12.2.8 - recovery speed also
dropped from 300-400MB/s to 20MB/s on 12.2.10. So something is really
broken.

Greets,
Stefan
Am 16.01.19 um 16:00 schrieb Stefan Priebe - Profihost AG:
This is not the case with 12.2.8 - it happens with 12.2.9 as well. After
boot all pgs are instantly active - not inactive pgs at least not
noticable in ceph -s.

With 12.2.9 or 12.2.10 or eben current upstream/luminous it takes
minutes until all pgs are active again.

Greets,
Stefan
Am 16.01.19 um 15:22 schrieb Stefan Priebe - Profihost AG:
Hello,

while digging into this further i saw that it takes ages until all pgs
are active. After starting the OSD 3% of all pgs are inactive and it
takes minutes after they're active.

The log of the OSD is full of:

2019-01-16 15:19:13.568527 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 185 upset size 3 up 2
2019-01-16 15:19:13.568637 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 2 upset size 3 up 3
2019-01-16 15:19:15.909327 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=184,(3+0)=3}}] _update_calc_stats ml 184 upset size 3 up 2
2019-01-16 15:19:15.909446 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=184,(3+0)=3}}] _update_calc_stats ml 3 upset size 3 up 3
2019-01-16 15:19:23.503231 7fecb97ff700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=183,(3+0)=3}}] _update_calc_stats ml 183 upset size 3 up 2

Greets,
Stefan
Am 16.01.19 um 09:12 schrieb Stefan Priebe - Profihost AG:
Hi,

no ok it was not. Bug still present. It was only working because the
osdmap was so far away that it has started backfill instead of recovery.

So it happens only in the recovery case.

Greets,
Stefan

Am 15.01.19 um 16:02 schrieb Stefan Priebe - Profihost AG:
Am 15.01.19 um 12:45 schrieb Marc Roos:

I upgraded this weekend from 12.2.8 to 12.2.10 without such issues
(osd's are idle)

it turns out this was a kernel bug. Updating to a newer kernel - has
solved this issue.

Greets,
Stefan

-----Original Message-----
From: Stefan Priebe - Profihost AG [mailto:s.priebe@xxxxxxxxxxxx]
Sent: 15 January 2019 10:26
To: ceph-users@xxxxxxxxxxxxxx
Cc: n.fahldieck@xxxxxxxxxxxx
Subject: Re:  slow requests and high i/o / read rate on
bluestore osds after upgrade 12.2.8 -> 12.2.10

Hello list,

i also tested current upstream/luminous branch and it happens as well. A
clean install works fine. It only happens on upgraded bluestore osds.

Greets,
Stefan

Am 14.01.19 um 20:35 schrieb Stefan Priebe - Profihost AG:
while trying to upgrade a cluster from 12.2.8 to 12.2.10 i'm
experience
issues with bluestore osds - so i canceled the upgrade and all
bluestore
osds are stopped now.

After starting a bluestore osd i'm seeing a lot of slow requests
caused
by very high read rates.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda              45,00   187,00  767,00   39,00 482040,00  8660,00
1217,62    58,16   74,60   73,85   89,23   1,24 100,00

it reads permanently with 500MB/s from the disk and can't service
client
requests. Overall client read rate is at 10.9MiB/s rd

I can't reproduce this with 12.2.8. Is this a known bug / regression?

Greets,
Stefan

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com