Deep-Scrub Scheduling

mike.dawson@xxxxxxxxxxxx (Mike Dawson) · Wed, 07 May 2014 22:19:20 -0400

My write-heavy cluster struggles under the additional load created by 
deep-scrub from time to time. As I have instrumented the cluster more, 
it has become clear that there is something I cannot explain happening 
in the scheduling of PGs to undergo deep-scrub.

Please refer to these images [0][1] to see two graphical representations 
of how deep-scrub goes awry in my cluster. These were two separate 
incidents. Both show a period of "happy" scrub and deep-scrubs and 
stable writes/second across the cluster, then an approximately 5x jump 
in concurrent deep-scrubs where client IO is cut by nearly 50%.

The first image (deep-scrub-issue1.jpg) shows a happy cluster with low 
numbers of scrub and deep-scrub running until about 10pm, then something 
triggers deep-scrubs to increase about 5x and remain high until I 
manually 'ceph osd set nodeep-scrub' at approx 10am. During the time of 
higher concurrent deep-scrubs, IOPS drop significantly due to OSD 
spindle contention preventing qemu/rbd clients from writing like normal.

The second image (deep-scrub-issue2.jpg) shows a similar approx 5x jump 
in concurrent deep-scrubs and associated drop in writes/second. This 
image also adds a summary of the 'dump historic ops' which show the to 
be expected jump in the slowest ops in the cluster.

Does anyone have an idea of what is happening when the spike in 
concurrent deep-scrub occurs and how to prevent the adverse effects, 
outside of disabling deep-scrub permanently?

0: http://www.mikedawson.com/deep-scrub-issue1.jpg
1: http://www.mikedawson.com/deep-scrub-issue2.jpg

Thanks,
Mike Dawson