Re: PGs inconsistent, do I fear data loss?

Denes Dolhay <denke@xxxxxxxxxxxx> · Sat, 28 Oct 2017 14:38:35 +0200



    Hello,
    First of all, I would recommend, that you use ceph pg repair
      wherever you can.
    

    When you have size=3 the cluster can compare 3 instances,
      therefore it is easier for it to spot which two is good, and which
      one is bad.
    When you use size=2 the case is harder for o-so-many ways:
    -According to the documentation it is harder to determine which
      object is the faulty.
    -If an osd dies the increased load (caused by the missing osd)
      and the extra io from the recovery process hits the other osd much
      harder, increasing the chance that another osd dies (because of
      disk failure caused by the sudden spike of extra load), and then
      you loose your data
    -If there is a bitrot in the remaining one replica, then you do
      not have any valid copies for your data
    So, to summarize it, the experts say, that it is MUCH safer to
      have size=3 min_size=2 (I'm far from an expert, I'm just quoting
      :))

    
    So, back to the task at hand:
    If you repaired all pgs that you coud by ceph pg repair, there is
      a manual recovery process, (written for filestore unfortunately):
    http://ceph.com/geen-categorie/ceph-manually-repair-object/
    The good news is, that there is a fuse client for bluestore too,
      so you can mount it by hand and repair it as per the linked
      document,
    

    I think that you could ceph osd pool set [pool] size 3 yo increase
    the copy count, but before that you should be certain that you have
    enough free space, and you'll not hit the osd pg count limits.

    
    [DISCLAIMER]:

    I have never done this, and I too have questions about this topic:

    
    [Questions to the list]

    How is it possible that the cluster cannot repair itself with ceph
    pg repair?

    No good copies are remaining?

    Cannot decide which copy is valid or up-to date?

    If so, why not, when there is checksum, mtime for everything?

    In this inconsistent state which object does the cluster serve when
    it doesn't know which one is the valid?

    
    Isn't there a way to do a more "online" repair?
     A way to examine, remove objects while running the osd?
    Or better yet, to tell the cluster that which copy should be used
      when repairing?
    There is a command, ceph pg force-recovery, but I cannot find
      documentation for it.

    
    Kind regards,
    Denes Dolhay.

    
    On 10/28/2017 01:05 PM, Mario Giammarco
      wrote:

    
      Hello,
        we recently upgraded two clusters to Ceph luminous with
          bluestore and we discovered that we have many more pgs in
          state active+clean+inconsistent. (Possible data damage, xx pgs
          inconsistent)
         
        This is probably due to checksums in bluestore that
          discover more errors.
        

        We have some pools with replica 2 and some with replica 3.
        

        I have read past forums thread and I have seen that Ceph do
          not repair automatically inconsistent pgs.
        

        Even manual repair sometime fails. 
        

        I would like to understand if I am losing my data:
        

        - with replica 2 I hope that ceph chooses right replica
          looking at checksums
        - with replica 3 I hope that there are no problems at all
        

        How can I tell ceph to simply create the second replica in
          another place?
        

        Because I suppose that with replica 2 and inconsistent pgs
          I have only one copy of data.
        

        Thank you in advance for any help.
        

        Mario
        

      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com