Re: deep-scrub / backfilling: large amount of SLOW_OPS after upgrade to 13.2.8

Stefan Kooman <stefan@xxxxxx> · Thu, 9 Jan 2020 13:26:39 +0100

Hi,

Quoting Stefan Kooman (stefan@xxxxxx):
> Hi,
> 
> After the upgrade to 13.2.8 deep-scrub has a big impact on client IO:
> loads of SLOW_OPS and high latency. We hardly ever had SLOW_OPS, but
> since the upgrade the impact is so big that we even have OSDs marking
> each other out (OSD op thread timeout) multiple times during the scrub
> window. Plenty of CPU / RAM / IOPS left, hardly any load on these OSD
> servers. Has there anything changed in this release that can explain
> this behaviour?
> 
> Besides this the impact of rebalance is very severe as well. With only
> the balancer remapping a couple of PGs at a time there are loads of
> (MDS_)SLOW_OPS. This morning the cephfs metadata pool got rebalanced ...
> and that triggered a lot of SLOW_OPS. One particular OSD was pegged at
> 1000% CPU for more than half an hour (not doing that much IO): that's 10
> cores going full throttle! After a restart this issue was gone.

We can now also trigger SLOW_OPS on a bunch of OSDs when we do a "rbd du
-p $POOL", something that has never been an issue. The images in
the rbd pools have the following features enabled: layering,
exclusive-lock, object-map, fast-diff, deep-flatten.

Has there anything changed in 13.2.8 that affects these kind of
operations?

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info@xxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx