Deep scrub distribution

Adrian Saul <Adrian.Saul@xxxxxxxxxxxxxxxxx> · Thu, 6 Jul 2017 01:51:52 +0000

During a recent snafu with a production cluster I disabled scrubbing and deep scrubbing in order to reduce load on the cluster while things backfilled and settled down.  The PTSD caused by the incident meant I was not keen to re-enable
 it until I was confident we had fixed the root cause of the issues (driver issues with a new NIC type introduced with new hardware that did not show up until production load hit them).   My cluster is using Jewel 10.2.1, and is a mix of SSD and SATA over 20
 hosts, 352 OSDs in total.

Fast forward a few weeks and I was ready to re-enable it.  On some reading I was concerned the cluster might kick off excessive scrubbing once I unset the flags, so I tried increasing the deep scrub interval from 7 days to 60 days – with
 most of the last deep scrubs being from over a month before I was hoping it would distribute them over the next 30 days.  Having unset the flag and carefully watched the cluster it seems to have just run a steady catch up without significant impact.  What
 I am noticing though is that the scrubbing is seeming to just run through the full set of PGs, so it did some 2280 PGs last night over 6 hours, and so far today in 12 hours another 4000 odd.  With 13408 PGs, I am guessing that all this will stop some time
 early tomorrow.

ceph-glb-fec-01[/var/log]$ sudo ceph pg dump|awk '{print $20}'|grep 2017|sort|uniq -c
dumped all in format plain
      5 2017-05-23
     18 2017-05-24
     33 2017-05-25
     52 2017-05-26
     89 2017-05-27
    114 2017-05-28
    144 2017-05-29
    172 2017-05-30
    256 2017-05-31
    191 2017-06-01
    230 2017-06-02
    369 2017-06-03
    606 2017-06-04
    680 2017-06-05
    919 2017-06-06
   1261 2017-06-07
   1876 2017-06-08
     15 2017-06-09
   2280 2017-07-05
   4098 2017-07-06

My concern is am I now set to have all 13408 PGs do a deep scrub in 60 days in a serial fashion again over 3 days.  I would much rather they distribute over that period.

Will the OSDs do this distribution themselves now they have caught up, or do I need to say create a script that will trigger batches of PGs to deep scrub over time to push out the distribution again?

Adrian Saul | Infrastructure
 Projects Team Lead 

IT

T 02
 9009 9041 | M +61
 402 075 760 

30 Ross St, Glebe NSW 2037

adrian.saul@xxxxxxxxxxxxxxxxx | www.tpg.com.au 

TPG Telecom (ASX: TPM)

This email and any attachments are confidential and may be subject to copyright, legal or some other professional privilege. They are
 intended solely for the attention and use of the named addressee(s). They may only be copied, distributed or disclosed with the consent of the copyright owner. If you have received this email by mistake or by breach of the confidentiality clause, please notify
 the sender immediately by return email and delete or destroy all copies of the email. Any confidentiality, privilege or copyright is not waived or lost because this email has been sent to you by mistake.

Confidentiality: This email and any attachments are confidential and may be subject to copyright, legal or some other professional privilege. They are intended solely for the attention and use of the named addressee(s). They may only be copied, distributed
 or disclosed with the consent of the copyright owner. If you have received this email by mistake or by breach of the confidentiality clause, please notify the sender immediately by return email and delete or destroy all copies of the email. Any confidentiality,
 privilege or copyright is not waived or lost because this email has been sent to you by mistake.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com