Re: slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

Nils Fahldieck - Profihost AG <n.fahldieck@xxxxxxxxxxxx> · Fri, 18 Jan 2019 16:25:02 +0100

Hello Mark,

I'm answering on behalf of Stefan.
Am 18.01.19 um 00:22 schrieb Mark Nelson:
> 
> On 1/17/19 4:06 PM, Stefan Priebe - Profihost AG wrote:
>> Hello Mark,
>>
>> after reading
>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
>>
>> again i'm really confused how the behaviour is exactly under 12.2.8
>> regarding memory and 12.2.10.
>>
>> Also i stumpled upon "When tcmalloc and cache autotuning is enabled," -
>> we're compiling against and using jemalloc. What happens in this case?
> 
> 
> Hi Stefan,
> 
> 
> The autotuner uses the existing in-tree perfglue code that grabs the
> tcmalloc heap and unmapped memory statistics to determine how to tune
> the caches.  Theoretically we might be able to do the same thing for
> jemalloc and maybe even glibc malloc, but there's no perfglue code for
> those yet.  If the autotuner can't get heap statistics it won't try to
> tune the caches and should instead revert to using the
> bluestore_cache_size and whatever the ratios are (the same as if you set
> bluestore_cache_autotune to false).

Thank you for that information on the difference between tcmalloc and
jemalloc. We compiled a new 12.2.10 version using tcmalloc. I upgraded a
cluster, which was running _our_ old 12.2.10 version (which used
jemalloc). This cluster has a very low load, so the
jemalloc-ceph-version didn't trigger any performance problems. Prior to
upgrading, one OSD never used more than 1 GB of RAM. After upgrading
there are OSDs using approx. 5,7 GB right now.

I also removed the 'osd_memory_target' option, which we falsely believed
has replaced 'bluestore_cache_size'.

We still have to test this on a cluster generating more I/O load.

For now, this seems to be working fine. Thanks.

> 
>>
>> Also i saw now - that 12.2.10 uses 1GB mem max while 12.2.8 uses 6-7GB
>> Mem (with bluestore_cache_size = 1073741824).
> 
> 
> If you are using the autotuner (but it sounds like maybe you are not if
> jemalloc is being used?) you'll want to set the osd_memory_target at
> least 1GB higher than what you previously had the bluestore_cache_size
> set to.  It's likely that trying to set the OSD to stay within 1GB of
> memory will cause the cache to sit at osd_memory_cache_min because the
> tuner simply can't shrink the cache enough to meet the target (too much
> other memory consumed by pglog, rocksdb WAL buffers, random other stuff).
> 
> The fact that you see 6-7GB of mem usage with 12.2.8 vs 1GB with 12.2.10
> sounds like a clue.  A bluestore OSD using 1GB of memory is going to
> have very little space for cache and it's quite likely that it would be
> performing reads from disk for a variety of reasons.  Getting to the
> root of that might explain what's going on.  If you happen to still have
> a 12.2.8 OSD up that's consuming 6-7GB of memory (with
> bluestore_cache_size = 1073741824), can you dump the mempool stats and
> running configuration for it?

This is one OSD from a different cluster using approximately 6,1 GB of
memory. This OSD and it's cluster is still running with version 12.2.8.

This OSD (and every other OSD running with 12.2.8) is still configured
with 'bluestore_cache_size = 1073741824'. Please see the following
pastebins:

> 
> 
> ceph daemon osd.NNN dump_mempools
https://pastebin.com/Pdcrr4ut
> 
> 
> And
> 
> 
> ceph daemon osd.NNN show config
https://pastebin.com/nkKpNFU3
> 
Best Regards
Nils
> 
> Thanks,
> 
> Mark
> 
> 
>>
>> Greets,
>> Stefan
>>
>> Am 17.01.19 um 22:59 schrieb Stefan Priebe - Profihost AG:
>>> Hello Mark,
>>>
>>> for whatever reason i didn't get your mails - most probably you kicked
>>> me out of CC/TO and only sent to the ML? I've only subscribed to a daily
>>> digest. (changed that for now)
>>>
>>> So i'm very sorry to answer so late.
>>>
>>> My messages might sound a bit confuse as it isn't easy reproduced and we
>>> tried a lot to find out what's going on.
>>>
>>> As 12.2.10 does not contain the pg hard limit i don't suspect it is
>>> related to it.
>>>
>>> What i can tell right now is:
>>>
>>> 1.) Under 12.2.8 we've set bluestore_cache_size = 1073741824
>>>
>>> 2.) While upgrading to 12.2.10 we replaced it with osd_memory_target =
>>> 1073741824
>>>
>>> 3.) i also tried 12.2.10 without setting osd_memory_target or
>>> bluestore_cache_size
>>>
>>> 4.) it's not kernel related - for some unknown reason it worked for some
>>> hours with a newer kernel but gave problems again later
>>>
>>> 5.) a backfill with 12.2.10 of 6x 2TB SSDs took about 14 hours using
>>> 12.2.10 while it took 2 hours with 12.2.8
>>>
>>> 6.) with 12.2.10 i have a constant rate of 100% read i/o (400-500MB/s)
>>> on most of my bluestore OSDs - while on 12.2.8 i've 100kb - 2MB/s max
>>> read on 12.2.8.
>>>
>>> 7.) upgrades on small clusters or fresh installs seem to work fine. (no
>>> idea why or it is related to cluste size)
>>>
>>> That's currently all i know.
>>>
>>> Thanks a lot!
>>>
>>> Greets,
>>> Stefan
>>> Am 16.01.19 um 20:56 schrieb Stefan Priebe - Profihost AG:
>>>> i reverted the whole cluster back to 12.2.8 - recovery speed also
>>>> dropped from 300-400MB/s to 20MB/s on 12.2.10. So something is really
>>>> broken.
>>>>
>>>> Greets,
>>>> Stefan
>>>> Am 16.01.19 um 16:00 schrieb Stefan Priebe - Profihost AG:
>>>>> This is not the case with 12.2.8 - it happens with 12.2.9 as well.
>>>>> After
>>>>> boot all pgs are instantly active - not inactive pgs at least not
>>>>> noticable in ceph -s.
>>>>>
>>>>> With 12.2.9 or 12.2.10 or eben current upstream/luminous it takes
>>>>> minutes until all pgs are active again.
>>>>>
>>>>> Greets,
>>>>> Stefan
>>>>> Am 16.01.19 um 15:22 schrieb Stefan Priebe - Profihost AG:
>>>>>> Hello,
>>>>>>
>>>>>> while digging into this further i saw that it takes ages until all
>>>>>> pgs
>>>>>> are active. After starting the OSD 3% of all pgs are inactive and it
>>>>>> takes minutes after they're active.
>>>>>>
>>>>>> The log of the OSD is full of:
>>>>>>
>>>>>>
>>>>>> 2019-01-16 15:19:13.568527 7fecbf7da700  0 osd.33 pg_epoch: 1318479
>>>>>> pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
>>>>>> 21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
>>>>>> ec=133405/133405 lis/c 1318472/1278145 les/c/f
>>>>>> 1318473/1278148/1211861 131
>>>>>> 8472/1318472/1318472) [33,3,22] r=0 lpr=1318472
>>>>>> pi=[1278145,1318472)/1
>>>>>> rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
>>>>>> overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
>>>>>> mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 185 upset
>>>>>> size 3 up 2
>>>>>> 2019-01-16 15:19:13.568637 7fecbf7da700  0 osd.33 pg_epoch: 1318479
>>>>>> pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
>>>>>> 21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
>>>>>> ec=133405/133405 lis/c 1318472/1278145 les/c/f
>>>>>> 1318473/1278148/1211861 131
>>>>>> 8472/1318472/1318472) [33,3,22] r=0 lpr=1318472
>>>>>> pi=[1278145,1318472)/1
>>>>>> rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
>>>>>> overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
>>>>>> mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 2 upset size
>>>>>> 3 up 3
>>>>>> 2019-01-16 15:19:15.909327 7fecbf7da700  0 osd.33 pg_epoch: 1318479
>>>>>> pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
>>>>>> 21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
>>>>>> ec=133405/133405 lis/c 1318472/1278145 les/c/f
>>>>>> 1318473/1278148/1211861 131
>>>>>> 8472/1318472/1318472) [33,3,22] r=0 lpr=1318472
>>>>>> pi=[1278145,1318472)/1
>>>>>> rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
>>>>>> overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
>>>>>> mbc={255={(2+0)=184,(3+0)=3}}] _update_calc_stats ml 184 upset
>>>>>> size 3 up 2
>>>>>> 2019-01-16 15:19:15.909446 7fecbf7da700  0 osd.33 pg_epoch: 1318479
>>>>>> pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
>>>>>> 21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
>>>>>> ec=133405/133405 lis/c 1318472/1278145 les/c/f
>>>>>> 1318473/1278148/1211861 131
>>>>>> 8472/1318472/1318472) [33,3,22] r=0 lpr=1318472
>>>>>> pi=[1278145,1318472)/1
>>>>>> rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
>>>>>> overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
>>>>>> mbc={255={(2+0)=184,(3+0)=3}}] _update_calc_stats ml 3 upset size
>>>>>> 3 up 3
>>>>>> 2019-01-16 15:19:23.503231 7fecb97ff700  0 osd.33 pg_epoch: 1318479
>>>>>> pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
>>>>>> 21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
>>>>>> ec=133405/133405 lis/c 1318472/1278145 les/c/f
>>>>>> 1318473/1278148/1211861 131
>>>>>> 8472/1318472/1318472) [33,3,22] r=0 lpr=1318472
>>>>>> pi=[1278145,1318472)/1
>>>>>> rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
>>>>>> overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
>>>>>> mbc={255={(2+0)=183,(3+0)=3}}] _update_calc_stats ml 183 upset
>>>>>> size 3 up 2
>>>>>>
>>>>>> Greets,
>>>>>> Stefan
>>>>>> Am 16.01.19 um 09:12 schrieb Stefan Priebe - Profihost AG:
>>>>>>> Hi,
>>>>>>>
>>>>>>> no ok it was not. Bug still present. It was only working because the
>>>>>>> osdmap was so far away that it has started backfill instead of
>>>>>>> recovery.
>>>>>>>
>>>>>>> So it happens only in the recovery case.
>>>>>>>
>>>>>>> Greets,
>>>>>>> Stefan
>>>>>>>
>>>>>>> Am 15.01.19 um 16:02 schrieb Stefan Priebe - Profihost AG:
>>>>>>>> Am 15.01.19 um 12:45 schrieb Marc Roos:
>>>>>>>>>   I upgraded this weekend from 12.2.8 to 12.2.10 without such
>>>>>>>>> issues
>>>>>>>>> (osd's are idle)
>>>>>>>>
>>>>>>>> it turns out this was a kernel bug. Updating to a newer kernel -
>>>>>>>> has
>>>>>>>> solved this issue.
>>>>>>>>
>>>>>>>> Greets,
>>>>>>>> Stefan
>>>>>>>>
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Stefan Priebe - Profihost AG [mailto:s.priebe@xxxxxxxxxxxx]
>>>>>>>>> Sent: 15 January 2019 10:26
>>>>>>>>> To: ceph-users@xxxxxxxxxxxxxx
>>>>>>>>> Cc: n.fahldieck@xxxxxxxxxxxx
>>>>>>>>> Subject: Re:  slow requests and high i/o / read
>>>>>>>>> rate on
>>>>>>>>> bluestore osds after upgrade 12.2.8 -> 12.2.10
>>>>>>>>>
>>>>>>>>> Hello list,
>>>>>>>>>
>>>>>>>>> i also tested current upstream/luminous branch and it happens
>>>>>>>>> as well. A
>>>>>>>>> clean install works fine. It only happens on upgraded bluestore
>>>>>>>>> osds.
>>>>>>>>>
>>>>>>>>> Greets,
>>>>>>>>> Stefan
>>>>>>>>>
>>>>>>>>> Am 14.01.19 um 20:35 schrieb Stefan Priebe - Profihost AG:
>>>>>>>>>> while trying to upgrade a cluster from 12.2.8 to 12.2.10 i'm
>>>>>>>>> experience
>>>>>>>>>> issues with bluestore osds - so i canceled the upgrade and all
>>>>>>>>> bluestore
>>>>>>>>>> osds are stopped now.
>>>>>>>>>>
>>>>>>>>>> After starting a bluestore osd i'm seeing a lot of slow requests
>>>>>>>>> caused
>>>>>>>>>> by very high read rates.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>>>>> sda              45,00   187,00  767,00   39,00 482040,00 
>>>>>>>>>> 8660,00
>>>>>>>>>> 1217,62    58,16   74,60   73,85   89,23   1,24 100,00
>>>>>>>>>>
>>>>>>>>>> it reads permanently with 500MB/s from the disk and can't service
>>>>>>>>> client
>>>>>>>>>> requests. Overall client read rate is at 10.9MiB/s rd
>>>>>>>>>>
>>>>>>>>>> I can't reproduce this with 12.2.8. Is this a known bug /
>>>>>>>>>> regression?
>>>>>>>>>>
>>>>>>>>>> Greets,
>>>>>>>>>> Stefan
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>
>>>>>>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com