Mike, You can find the last scrub info for a given PG with "ceph pg x.yy query". -Aaron On Wed, May 7, 2014 at 8:47 PM, Mike Dawson <mike.dawson at cloudapt.com>wrote: > Perhaps, but if that were the case, would you expect the max concurrent > number of deep-scrubs to approach the number of OSDs in the cluster? > > I have 72 OSDs in this cluster and concurrent deep-scrubs seem to peak at > a max of 12. Do pools (two in use) and replication settings (3 copies in > both pools) factor in? > > 72 OSDs / (2 pools * 3 copies) = 12 max concurrent deep-scrubs > > That seems plausible (without looking at the code). > > But, if I 'ceph osd set nodeep-scrub' then 'ceph osd unset nodeep-scrub', > the count of concurrent deep-scrubs doesn't resume the high level, but > rather stays low seemingly for days at a time, until the next onslaught. If > driven by the max scrub interval, shouldn't it jump quickly back up? > > Is there way to find the last scrub time for a given PG via the CLI to > know for sure? > > Thanks, > Mike Dawson > > > On 5/7/2014 10:59 PM, Gregory Farnum wrote: > >> Is it possible you're running into the max scrub intervals and jumping >> up to one-per-OSD from a much lower normal rate? >> >> On Wednesday, May 7, 2014, Mike Dawson <mike.dawson at cloudapt.com >> <mailto:mike.dawson at cloudapt.com>> wrote: >> >> My write-heavy cluster struggles under the additional load created >> by deep-scrub from time to time. As I have instrumented the cluster >> more, it has become clear that there is something I cannot explain >> happening in the scheduling of PGs to undergo deep-scrub. >> >> Please refer to these images [0][1] to see two graphical >> representations of how deep-scrub goes awry in my cluster. These >> were two separate incidents. Both show a period of "happy" scrub and >> deep-scrubs and stable writes/second across the cluster, then an >> approximately 5x jump in concurrent deep-scrubs where client IO is >> cut by nearly 50%. >> >> The first image (deep-scrub-issue1.jpg) shows a happy cluster with >> low numbers of scrub and deep-scrub running until about 10pm, then >> something triggers deep-scrubs to increase about 5x and remain high >> until I manually 'ceph osd set nodeep-scrub' at approx 10am. During >> the time of higher concurrent deep-scrubs, IOPS drop significantly >> due to OSD spindle contention preventing qemu/rbd clients from >> writing like normal. >> >> The second image (deep-scrub-issue2.jpg) shows a similar approx 5x >> jump in concurrent deep-scrubs and associated drop in writes/second. >> This image also adds a summary of the 'dump historic ops' which show >> the to be expected jump in the slowest ops in the cluster. >> >> Does anyone have an idea of what is happening when the spike in >> concurrent deep-scrub occurs and how to prevent the adverse effects, >> outside of disabling deep-scrub permanently? >> >> 0: http://www.mikedawson.com/__deep-scrub-issue1.jpg >> <http://www.mikedawson.com/deep-scrub-issue1.jpg> >> 1: http://www.mikedawson.com/__deep-scrub-issue2.jpg >> >> <http://www.mikedawson.com/deep-scrub-issue2.jpg> >> >> Thanks, >> Mike Dawson >> _________________________________________________ >> ceph-users mailing list >> ceph-users at lists.ceph.com >> http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com >> >> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >> >> >> >> -- >> Software Engineer #42 @ http://inktank.com | http://ceph.com >> > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140507/a83ef792/attachment.htm>