Re: slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> · Thu, 17 Jan 2019 23:06:09 +0100

Hello Mark,

after reading
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
again i'm really confused how the behaviour is exactly under 12.2.8
regarding memory and 12.2.10.

Also i stumpled upon "When tcmalloc and cache autotuning is enabled," -
we're compiling against and using jemalloc. What happens in this case?

Also i saw now - that 12.2.10 uses 1GB mem max while 12.2.8 uses 6-7GB
Mem (with bluestore_cache_size = 1073741824).

Greets,
Stefan

Am 17.01.19 um 22:59 schrieb Stefan Priebe - Profihost AG:
> Hello Mark,
> 
> for whatever reason i didn't get your mails - most probably you kicked
> me out of CC/TO and only sent to the ML? I've only subscribed to a daily
> digest. (changed that for now)
> 
> So i'm very sorry to answer so late.
> 
> My messages might sound a bit confuse as it isn't easy reproduced and we
> tried a lot to find out what's going on.
> 
> As 12.2.10 does not contain the pg hard limit i don't suspect it is
> related to it.
> 
> What i can tell right now is:
> 
> 1.) Under 12.2.8 we've set bluestore_cache_size = 1073741824
> 
> 2.) While upgrading to 12.2.10 we replaced it with osd_memory_target =
> 1073741824
> 
> 3.) i also tried 12.2.10 without setting osd_memory_target or
> bluestore_cache_size
> 
> 4.) it's not kernel related - for some unknown reason it worked for some
> hours with a newer kernel but gave problems again later
> 
> 5.) a backfill with 12.2.10 of 6x 2TB SSDs took about 14 hours using
> 12.2.10 while it took 2 hours with 12.2.8
> 
> 6.) with 12.2.10 i have a constant rate of 100% read i/o (400-500MB/s)
> on most of my bluestore OSDs - while on 12.2.8 i've 100kb - 2MB/s max
> read on 12.2.8.
> 
> 7.) upgrades on small clusters or fresh installs seem to work fine. (no
> idea why or it is related to cluste size)
> 
> That's currently all i know.
> 
> Thanks a lot!
> 
> Greets,
> Stefan
> Am 16.01.19 um 20:56 schrieb Stefan Priebe - Profihost AG:
>> i reverted the whole cluster back to 12.2.8 - recovery speed also
>> dropped from 300-400MB/s to 20MB/s on 12.2.10. So something is really
>> broken.
>>
>> Greets,
>> Stefan
>> Am 16.01.19 um 16:00 schrieb Stefan Priebe - Profihost AG:
>>> This is not the case with 12.2.8 - it happens with 12.2.9 as well. After
>>> boot all pgs are instantly active - not inactive pgs at least not
>>> noticable in ceph -s.
>>>
>>> With 12.2.9 or 12.2.10 or eben current upstream/luminous it takes
>>> minutes until all pgs are active again.
>>>
>>> Greets,
>>> Stefan
>>> Am 16.01.19 um 15:22 schrieb Stefan Priebe - Profihost AG:
>>>> Hello,
>>>>
>>>> while digging into this further i saw that it takes ages until all pgs
>>>> are active. After starting the OSD 3% of all pgs are inactive and it
>>>> takes minutes after they're active.
>>>>
>>>> The log of the OSD is full of:
>>>>
>>>>
>>>> 2019-01-16 15:19:13.568527 7fecbf7da700  0 osd.33 pg_epoch: 1318479
>>>> pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
>>>> 21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
>>>> ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
>>>> 8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
>>>> rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
>>>> overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
>>>> mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 185 upset size 3 up 2
>>>> 2019-01-16 15:19:13.568637 7fecbf7da700  0 osd.33 pg_epoch: 1318479
>>>> pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
>>>> 21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
>>>> ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
>>>> 8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
>>>> rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
>>>> overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
>>>> mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 2 upset size 3 up 3
>>>> 2019-01-16 15:19:15.909327 7fecbf7da700  0 osd.33 pg_epoch: 1318479
>>>> pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
>>>> 21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
>>>> ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
>>>> 8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
>>>> rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
>>>> overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
>>>> mbc={255={(2+0)=184,(3+0)=3}}] _update_calc_stats ml 184 upset size 3 up 2
>>>> 2019-01-16 15:19:15.909446 7fecbf7da700  0 osd.33 pg_epoch: 1318479
>>>> pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
>>>> 21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
>>>> ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
>>>> 8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
>>>> rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
>>>> overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
>>>> mbc={255={(2+0)=184,(3+0)=3}}] _update_calc_stats ml 3 upset size 3 up 3
>>>> 2019-01-16 15:19:23.503231 7fecb97ff700  0 osd.33 pg_epoch: 1318479
>>>> pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
>>>> 21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
>>>> ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
>>>> 8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
>>>> rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
>>>> overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
>>>> mbc={255={(2+0)=183,(3+0)=3}}] _update_calc_stats ml 183 upset size 3 up 2
>>>>
>>>> Greets,
>>>> Stefan
>>>> Am 16.01.19 um 09:12 schrieb Stefan Priebe - Profihost AG:
>>>>> Hi,
>>>>>
>>>>> no ok it was not. Bug still present. It was only working because the
>>>>> osdmap was so far away that it has started backfill instead of recovery.
>>>>>
>>>>> So it happens only in the recovery case.
>>>>>
>>>>> Greets,
>>>>> Stefan
>>>>>
>>>>> Am 15.01.19 um 16:02 schrieb Stefan Priebe - Profihost AG:
>>>>>>
>>>>>> Am 15.01.19 um 12:45 schrieb Marc Roos:
>>>>>>>  
>>>>>>> I upgraded this weekend from 12.2.8 to 12.2.10 without such issues 
>>>>>>> (osd's are idle)
>>>>>>
>>>>>>
>>>>>> it turns out this was a kernel bug. Updating to a newer kernel - has
>>>>>> solved this issue.
>>>>>>
>>>>>> Greets,
>>>>>> Stefan
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Stefan Priebe - Profihost AG [mailto:s.priebe@xxxxxxxxxxxx] 
>>>>>>> Sent: 15 January 2019 10:26
>>>>>>> To: ceph-users@xxxxxxxxxxxxxx
>>>>>>> Cc: n.fahldieck@xxxxxxxxxxxx
>>>>>>> Subject: Re:  slow requests and high i/o / read rate on 
>>>>>>> bluestore osds after upgrade 12.2.8 -> 12.2.10
>>>>>>>
>>>>>>> Hello list,
>>>>>>>
>>>>>>> i also tested current upstream/luminous branch and it happens as well. A
>>>>>>> clean install works fine. It only happens on upgraded bluestore osds.
>>>>>>>
>>>>>>> Greets,
>>>>>>> Stefan
>>>>>>>
>>>>>>> Am 14.01.19 um 20:35 schrieb Stefan Priebe - Profihost AG:
>>>>>>>> while trying to upgrade a cluster from 12.2.8 to 12.2.10 i'm 
>>>>>>> experience
>>>>>>>> issues with bluestore osds - so i canceled the upgrade and all 
>>>>>>> bluestore
>>>>>>>> osds are stopped now.
>>>>>>>>
>>>>>>>> After starting a bluestore osd i'm seeing a lot of slow requests 
>>>>>>> caused
>>>>>>>> by very high read rates.
>>>>>>>>
>>>>>>>>
>>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>>> sda              45,00   187,00  767,00   39,00 482040,00  8660,00
>>>>>>>> 1217,62    58,16   74,60   73,85   89,23   1,24 100,00
>>>>>>>>
>>>>>>>> it reads permanently with 500MB/s from the disk and can't service 
>>>>>>> client
>>>>>>>> requests. Overall client read rate is at 10.9MiB/s rd
>>>>>>>>
>>>>>>>> I can't reproduce this with 12.2.8. Is this a known bug / regression?
>>>>>>>>
>>>>>>>> Greets,
>>>>>>>> Stefan
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com