Re: Scrub Error / How does ceph pg repair work?

Christian Eichelmann <christian.eichelmann@xxxxxxxx> · Tue, 12 May 2015 09:00:43 +0200

Hi Christian, Hi Robert,

thank you for your replies!
I was already expecting something like this. But I am seriously worried
about that!

Just assume that this is happening at night. Our shift has not
necessarily enough knowledge to perform all the steps in Sebasien's
article. And if we always have to do that when a scrub error appears, we
are putting several hours per week into fixing such problems.

It is also very misleading that a command called "ceph pg repair" might
do quite the opposit and overwrite the "good" data in your cluster with
corrupt one. I don't know much about the interna of ceph, but if the
cluster can already recognize that checksums are not the same, why can't
he just build a quorum from the existing replicas if possible?

And again the question:
Are these placementgroups (scrub error, inconsistent) blocking on
read/write requests? Because if yes, we have a serious problem here...

Regards,
Christian

Am 12.05.2015 um 08:20 schrieb Christian Balzer:
> 
> Hello,
> 
> I can only nod emphatically to what Robert said, don't issue repairs
> unless you 
> a) don't care about the data or 
> b) have verified that your primary OSD is good.
> 
> See this for some details on how establish which replica(s) are actually
> good or not:
> http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/
> 
> Of course if you somehow wind up with more subtle data corruption and are
> faced with 3 slightly differing data sets, you may have have to resort to
> rolling a dice after all.
> 
> A word from the devs about the state of checksums and automatic repairs we
> can trust would be appreciated.
> 
> Christian
> 
> On Mon, 11 May 2015 10:19:08 -0600 Robert LeBlanc wrote:
> 
>> Personally I would not just run this command automatically because as you
>> stated, it only copies the primary PGs to the replicas and if the primary
>> is corrupt, you will corrupt your secondaries.I think the monitor log
>> shows which OSD has the problem so if it is not your primary, then just
>> issue the repair command.
>>
>> There was talk, and I believe work towards, Ceph storing a hash of the
>> object so that it can be smarter about which replica has the correct data
>> and automatically replicate the good data no matter where it is. I think
>> the first part, creating the hash and storing it, has been included in
>> Hammer. I'm not an authority on this so take it with a grain of salt.
>>
>> Right now our procedure is to find the PG files on the OSDs, perform a
>> MD5 on all of them and the one that doesn't match, overwrite, either by
>> issuing the PG repair command, or removing the bad PG files, rsyncing
>> them with the -X argument and then instructing a deep-scrub on the PG to
>> clear it up in Ceph.
>>
>> I've only tested this on an idle cluster, so I don't know how well it
>> will work on an active cluster. Since we issue a deep-scrub, if the PGs
>> of the replicas change during the rsync, it should come up with an
>> error. The idea is to keep rsyncing until the deep-scrub is clean. Be
>> warned that you may be aiming your gun at your foot with this!
>>
>> ----------------
>> Robert LeBlanc
>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>> On Mon, May 11, 2015 at 2:09 AM, Christian Eichelmann <
>> christian.eichelmann@xxxxxxxx> wrote:
>>
>>> Hi all!
>>>
>>> We are experiencing approximately 1 scrub error / inconsistent pg every
>>> two days. As far as I know, to fix this you can issue a "ceph pg
>>> repair", which works fine for us. I have a few qestions regarding the
>>> behavior of the ceph cluster in such a case:
>>>
>>> 1. After ceph detects the scrub error, the pg is marked as
>>> inconsistent. Does that mean that any IO to this pg is blocked until
>>> it is repaired?
>>>
>>> 2. Is this amount of scrub errors normal? We currently have only 150TB
>>> in our cluster, distributed over 720 2TB disks.
>>>
>>> 3. As far as I know, a "ceph pg repair" just copies the content of the
>>> primary pg to all replicas. Is this still the case? What if the primary
>>> copy is the one having errors? We have a 4x replication level and it
>>> would be cool if ceph would use one of the pg for recovery which has
>>> the same checksum as the majority of pgs.
>>>
>>> 4. Some of this errors are happening at night. Since ceph reports this
>>> as a critical error, our shift is called and wake up, just to issue a
>>> single command. Do you see any problems in triggering this command
>>> automatically via monitoring event? Is there a reason why ceph isn't
>>> resolving these errors itself when it has enought replicas to do so?
>>>
>>> Regards,
>>> Christian
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
> 
> 

-- 
Christian Eichelmann
Systemadministrator

1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting
Brauerstraße 48 · DE-76135 Karlsruhe
Telefon: +49 721 91374-8026
christian.eichelmann@xxxxxxxx

Amtsgericht Montabaur / HRB 6484
Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
Aufsichtsratsvorsitzender: Michael Scheeren
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com