Re: osd max scrubs not honored?

David Turner <drakonstein@xxxxxxxxx> · Thu, 28 Sep 2017 22:43:44 +0000

I often schedule the deep scrubs for a cluster so that none of them will happen on their own and will always be run using my cron/scripts.  For instance, set the deep scrub interval to 2 months and schedule a cron that will take care of all of the deep scrubs within a month.  If for any reason the script stops working, the PGs will still be scrubbed at least every 2 months.  But the script should ensure that they happen every month but only during the times of day I'm running the cron.  That way I can ease up on deep scrubbing when the cluster needs a little more performance or is going through a big recovery, but also catch it back up.
There are also config settings to ensure that scrubs only happen during hours of the day you want them to so you can avoid major client IO regardless of how you scrub your cluster.

On Thu, Sep 28, 2017 at 6:36 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
Also, realize the deep scrub interval is a per-PG thing and (unfortunately) the OSD doesn't use a global view of its PG deep scrub ages to try and schedule them intelligently across that time. If you really want to try and force this out, I believe a few sites have written scripts to do it by turning off deep scrubs, forcing individual PGs to deep scrub at intervals, and then enabling deep scrubs again.-Greg

On Wed, Sep 27, 2017 at 6:34 AM David Turner <drakonstein@xxxxxxxxx> wrote:
This isn't an answer, but a suggestion to try and help track it down as I'm not sure what the problem is. Try querying the admin socket for your osds and look through all of their config options and settings for something that might explain why you have multiple deep scrubs happening on a single osd at the same time.
However if you misspoke and only have 1 deep scrub per osd but multiple people node, then what you are seeing is expected behavior.  I believe that luminous added a sleep seeing for scrub io that also might help.  Looking through the admin socket dump of settings looking for scrub should give you some ideas of things to try.

On Tue, Sep 26, 2017, 2:04 PM J David <j.david.lists@xxxxxxxxx> wrote:
With “osd max scrubs” set to 1 in ceph.conf, which I believe is also

the default, at almost all times, there are 2-3 deep scrubs running.

3 simultaneous deep scrubs is enough to cause a constant stream of:

mon.ceph1 [WRN] Health check update: 69 slow requests are blocked > 32

sec (REQUEST_SLOW)

This seems to correspond with all three deep scrubs hitting the same

OSD at the same time, starving out all other I/O requests for that

OSD.  But it can happen less frequently and less severely with two or

even one deep scrub running.  Nonetheless, consumers of the cluster

are not thrilled with regular instances of 30-60 second disk I/Os.

The cluster is five nodes, 15 OSDs, and there is one pool with 512

placement groups.  The cluster is running:

ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)

All of the OSDs are bluestore, with HDD storage and SSD block.db.

Even setting “osd deep scrub interval = 1843200” hasn’t resolved this

issue, though it seems to get the number down from 3 to 2, which at

least cuts down on the frequency of requests stalling out.  With 512

pgs, that should mean that one pg gets deep-scrubbed per hour, and it

seems like a deep-scrub takes about 20 minutes.  So what should be

happening is that 1/3rd of the time there should be one deep scrub,

and 2/3rds of the time there shouldn’t be any.  Yet instead we have

2-3 deep scrubs running at all times.

Looking at “ceph pg dump” shows that about 7 deep scrubs get launched per hour:

$sudo ceph pg dump | fgrep active | awk ‘{print$23” “$24" "$1}' |

fgrep 2017-09-26 | sort -rn | head -22

dumped all

2017-09-26 16:42:46.781761 0.181

2017-09-26 16:41:40.056816 0.59

2017-09-26 16:39:26.216566 0.9e

2017-09-26 16:26:43.379806 0.19f

2017-09-26 16:24:16.321075 0.60

2017-09-26 16:08:36.095040 0.134

2017-09-26 16:03:33.478330 0.b5

2017-09-26 15:55:14.205885 0.1e2

2017-09-26 15:54:31.413481 0.98

2017-09-26 15:45:58.329782 0.71

2017-09-26 15:34:51.777681 0.1e5

2017-09-26 15:32:49.669298 0.c7

2017-09-26 15:01:48.590645 0.1f

2017-09-26 15:01:00.082014 0.199

2017-09-26 14:45:52.893951 0.d9

2017-09-26 14:43:39.870689 0.140

2017-09-26 14:28:56.217892 0.fc

2017-09-26 14:28:49.665678 0.e3

2017-09-26 14:11:04.718698 0.1d6

2017-09-26 14:09:44.975028 0.72

2017-09-26 14:06:17.945012 0.8a

2017-09-26 13:54:44.199792 0.ec

What’s going on here?

Why isn’t the limit on scrubs being honored?

It would also be great if scrub I/O were surfaced in “ceph status” the

way recovery I/O is, especially since it can have such a significant

impact on client operations.

Thanks!

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com