Re: Scrubbing

Denis Polom <denispolom@xxxxxxxxx> · Sat, 12 Mar 2022 08:39:45 +0100

Hi,

I had similar problem on my larce cluster.

What I found and helped me to solve it:

Due to bad drives and replacing drives too often due to scrub error 
there was always some recovery operations going on.

I did set this:

osd_scrub_during_recovery true

and it basically solved my issue.

If not then you can try change the interval.

I did it also from default once per week to two weeks:

osd_deep_scrub_interval 1209600

and if you want or need to speed it up to get rid of not scrubbed in 
time PGs take a look into

osd_max_scrubs

default is 1 and if I need to speed it up I set it to 3 and I didn't 
recognize any performance impact.

dp

On 3/11/22 17:32, Ray Cunningham wrote:
That's what I thought. We looked at the cluster storage nodes and found them all to be less than .2 normalized maximum load.

Our 'normal' BW for client IO according to ceph -s is around 60MB/s-100MB/s. I don't usually look at the IOPs so I don't have that number right now. We have seen GB/s numbers during repairs, so the cluster can get up there when the workload requires.

We discovered that this system never got the auto repair setting configured to true and since we turned that on, we have been repairing PGs for the past 24 hours. So, maybe we've been bottlenecked by those?

Thank you,
Ray

-----Original Message-----
From: norman.kern <norman.kern@xxxxxxx>
Sent: Thursday, March 10, 2022 9:27
To: Ray Cunningham <ray.cunningham@xxxxxxxxxxxxxx>
Cc: ceph-users@xxxxxxx
Subject: Re:  Re: Scrubbing

Ray,

You can use node-exporter+prom+grafana  to collect the load of CPUs statistics. You can use uptime command to get the current statistics.

On 3/10/22 10:51 PM, Ray Cunningham wrote:
From:

osd_scrub_load_threshold
The normalized maximum load. Ceph will not scrub when the system load (as defined by getloadavg() / number of online CPUs) is higher than this number. Default is 0.5.

Does anyone know how I can run getloadavg() / number of online CPUs so I can see what our load is? Is that a ceph command, or an OS command?

Thank you,
Ray

-----Original Message-----
From: Ray Cunningham
Sent: Thursday, March 10, 2022 7:59 AM
To: norman.kern <norman.kern@xxxxxxx>
Cc: ceph-users@xxxxxxx
Subject: RE:  Scrubbing

We have 16 Storage Servers each with 16TB HDDs and 2TB SSDs for DB/WAL, so we are using bluestore. The system is running Nautilus 14.2.19 at the moment, with an upgrade scheduled this month. I can't give you a complete ceph config dump as this is an offline customer system, but I can get answers for specific questions.

Off the top of my head, we have set:

osd_max_scrubs 20
osd_scrub_auto_repair true
osd _scrub_load_threashold 0.6
We do not limit srub hours.

Thank you,
Ray

-----Original Message-----
From: norman.kern <norman.kern@xxxxxxx>
Sent: Wednesday, March 9, 2022 7:28 PM
To: Ray Cunningham <ray.cunningham@xxxxxxxxxxxxxx>
Cc: ceph-users@xxxxxxx
Subject: Re:  Scrubbing

Ray,

Can you  provide more information about your cluster(hardware and software configs)?

On 3/10/22 7:40 AM, Ray Cunningham wrote:
    make any difference. Do
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx