Re: Criteria behind disk scrubbing policy

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 9 Mar 2016 09:00:23 -0500 (EST)

Hi Veronica,

On Wed, 9 Mar 2016, Veronica Estrada Galinanes wrote:
> From ceph website: "Deep scrubbing (weekly) reads the data and uses
> checksums to ensure data integrity.”
> 
> 1. Why do you use "weekly" deep scrubbing? How many times a year one disk
> is scrubbed on average in a Ceph cluster? Schwartz* used three times per
> year. “Interestingly, we did not observe a further decrease in data loss as
> we increased the scrubbing frequency from three to eleven times per year;
> rather, a slight increase in data loss was noticed. This phenomenon comes
> from the POH (power-on-hour) effect on drive reliability. Aggressive
> scrubbing requires more power cycles, adversely affecting drive reliability.
> "

In Ceph clusters disks are generally never powered down.  It sounds like 
Schwarz et al were studying a different type of storage system.  In our 
case, it's all about how quickly you discover a defect and repair around 
it.

> 2. Do you have different scrubbing policies according to the redundancy
> method? For example, more scrubs when using replication rather than erasure
> coding?

We do let you change the scrub intervals on a per-pool basis as well (see 
pool_opts in osd/osd_types.h), so you can adjust the policy based on both 
the pool type (replicated vs erasure coded) or the importance of the data 
it contains.

sage