deep-scrub / backfilling: large amount of SLOW_OPS after upgrade to 13.2.8

Stefan Kooman <stefan@xxxxxx> · Mon, 23 Dec 2019 15:55:18 +0100

Hi,

After the upgrade to 13.2.8 deep-scrub has a big impact on client IO:
loads of SLOW_OPS and high latency. We hardly ever had SLOW_OPS, but
since the upgrade the impact is so big that we even have OSDs marking
each other out (OSD op thread timeout) multiple times during the scrub
window. Plenty of CPU / RAM / IOPS left, hardly any load on these OSD
servers. Has there anything changed in this release that can explain
this behaviour?

Besides this the impact of rebalance is very severe as well. With only
the balancer remapping a couple of PGs at a time there are loads of
(MDS_)SLOW_OPS. This morning the cephfs metadata pool got rebalanced ...
and that triggered a lot of SLOW_OPS. One particular OSD was pegged at
1000% CPU for more than half an hour (not doing that much IO): that's 10
cores going full throttle! After a restart this issue was gone.

Thanks,

Stefan

-- 
| BIT BV  https://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info@xxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx