Scrub Error / How does ceph pg repair work?

Christian Eichelmann <christian.eichelmann@xxxxxxxx> · Mon, 11 May 2015 10:09:32 +0200

Hi all!

We are experiencing approximately 1 scrub error / inconsistent pg every
two days. As far as I know, to fix this you can issue a "ceph pg
repair", which works fine for us. I have a few qestions regarding the
behavior of the ceph cluster in such a case:

1. After ceph detects the scrub error, the pg is marked as inconsistent.
Does that mean that any IO to this pg is blocked until it is repaired?

2. Is this amount of scrub errors normal? We currently have only 150TB
in our cluster, distributed over 720 2TB disks.

3. As far as I know, a "ceph pg repair" just copies the content of the
primary pg to all replicas. Is this still the case? What if the primary
copy is the one having errors? We have a 4x replication level and it
would be cool if ceph would use one of the pg for recovery which has the
same checksum as the majority of pgs.

4. Some of this errors are happening at night. Since ceph reports this
as a critical error, our shift is called and wake up, just to issue a
single command. Do you see any problems in triggering this command
automatically via monitoring event? Is there a reason why ceph isn't
resolving these errors itself when it has enought replicas to do so?

Regards,
Christian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com