Thank you Dennis! We have made most of these changes and are waiting to see what happens. Thank you, Ray -----Original Message----- From: Denis Polom <denispolom@xxxxxxxxx> Sent: Saturday, March 12, 2022 1:40 AM To: ceph-users@xxxxxxx Subject: Re: Scrubbing Hi, I had similar problem on my larce cluster. What I found and helped me to solve it: Due to bad drives and replacing drives too often due to scrub error there was always some recovery operations going on. I did set this: osd_scrub_during_recovery true and it basically solved my issue. If not then you can try change the interval. I did it also from default once per week to two weeks: osd_deep_scrub_interval 1209600 and if you want or need to speed it up to get rid of not scrubbed in time PGs take a look into osd_max_scrubs default is 1 and if I need to speed it up I set it to 3 and I didn't recognize any performance impact. dp On 3/11/22 17:32, Ray Cunningham wrote: > That's what I thought. We looked at the cluster storage nodes and found them all to be less than .2 normalized maximum load. > > Our 'normal' BW for client IO according to ceph -s is around 60MB/s-100MB/s. I don't usually look at the IOPs so I don't have that number right now. We have seen GB/s numbers during repairs, so the cluster can get up there when the workload requires. > > We discovered that this system never got the auto repair setting configured to true and since we turned that on, we have been repairing PGs for the past 24 hours. So, maybe we've been bottlenecked by those? > > Thank you, > Ray > > > -----Original Message----- > From: norman.kern <norman.kern@xxxxxxx> > Sent: Thursday, March 10, 2022 9:27 > To: Ray Cunningham <ray.cunningham@xxxxxxxxxxxxxx> > Cc: ceph-users@xxxxxxx > Subject: Re: Re: Scrubbing > > Ray, > > You can use node-exporter+prom+grafana to collect the load of CPUs statistics. You can use uptime command to get the current statistics. > > On 3/10/22 10:51 PM, Ray Cunningham wrote: >> From: >> >> osd_scrub_load_threshold >> The normalized maximum load. Ceph will not scrub when the system load (as defined by getloadavg() / number of online CPUs) is higher than this number. Default is 0.5. >> >> Does anyone know how I can run getloadavg() / number of online CPUs so I can see what our load is? Is that a ceph command, or an OS command? >> >> Thank you, >> Ray >> >> >> -----Original Message----- >> From: Ray Cunningham >> Sent: Thursday, March 10, 2022 7:59 AM >> To: norman.kern <norman.kern@xxxxxxx> >> Cc: ceph-users@xxxxxxx >> Subject: RE: Scrubbing >> >> >> We have 16 Storage Servers each with 16TB HDDs and 2TB SSDs for DB/WAL, so we are using bluestore. The system is running Nautilus 14.2.19 at the moment, with an upgrade scheduled this month. I can't give you a complete ceph config dump as this is an offline customer system, but I can get answers for specific questions. >> >> Off the top of my head, we have set: >> >> osd_max_scrubs 20 >> osd_scrub_auto_repair true >> osd _scrub_load_threashold 0.6 >> We do not limit srub hours. >> >> Thank you, >> Ray >> >> >> >> >> -----Original Message----- >> From: norman.kern <norman.kern@xxxxxxx> >> Sent: Wednesday, March 9, 2022 7:28 PM >> To: Ray Cunningham <ray.cunningham@xxxxxxxxxxxxxx> >> Cc: ceph-users@xxxxxxx >> Subject: Re: Scrubbing >> >> Ray, >> >> Can you provide more information about your cluster(hardware and software configs)? >> >> On 3/10/22 7:40 AM, Ray Cunningham wrote: >>> make any difference. Do >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an >> email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx