Scrub failing all the time, new inconsistencies keep appearing

Gonzalo Aguilar Delgado <gaguilar@xxxxxxxxxxxxxxxxxx> · Thu, 14 Sep 2017 15:39:26 +0200



    Hello, 

      
    I'm using ceph since
        long time ago. A day ago added jewel requirement for OSD. And
        upgraded crush map. 

      
        From this time I had all kind of errors, maybe because disks
        failing because rebalances or because there's a problem I don't
        know. 

      
    I have some pg
        active+clean+inconsistent, from different volumens. When I try
        to repair or do scrub I get:
    2017-09-14 15:24:32.139215  [ERR] 9.8b shard 2: soid
        9:d1c72806:::rb.0.21dc.238e1f29.0000000125ae:head data_digest
        0x903e1482 != data_digest 0x4d4e39be from auth oi
        9:d1c72806:::rb.0.21dc.238e1f29.0000000125ae:head(3982'375882
        osd.1.0:2494526 dirty|data_digest|omap_digest s 4194304 uv
        375794 dd 4d4e39be od ffffffff)

      2017-09-14 15:24:32.139220  [ERR] 9.8b shard 6: soid
        9:d1c72806:::rb.0.21dc.238e1f29.0000000125ae:head data_digest
        0x903e1482 != data_digest 0x4d4e39be from auth oi
        9:d1c72806:::rb.0.21dc.238e1f29.0000000125ae:head(3982'375882
        osd.1.0:2494526 dirty|data_digest|omap_digest s 4194304 uv
        375794 dd 4d4e39be od ffffffff)

      2017-09-14 15:24:32.139222  [ERR] 9.8b soid
        9:d1c72806:::rb.0.21dc.238e1f29.0000000125ae:head: failed to
        pick suitable auth object

      
    I removed one of the
        OSD and added a bigger one to the cluster. But still had the old
        authority disk in the machine. (But I removed from crush map and
        all as documentation says). Mine is a small cluster and I know
        it tends to be more critical since not enough replicas if
        something goes wrong:
    

    ID WEIGHT  TYPE NAME                 UP/DOWN REWEIGHT
        PRIMARY-AFFINITY 

        -1 4.27299 root
        default                                                

        -4 4.27299     rack
        rack-1                                             

        -2 1.00000         host
        blue-compute                                   

         0 1.00000             osd.0              up  1.00000         
        1.00000 

         2 1.00000             osd.2              up  1.00000         
        1.00000 

        -3 3.27299         host
        red-compute                                    

         4 1.00000             osd.4              up  1.00000         
        1.00000 

         3 1.36380             osd.3              up  1.00000         
        1.00000 

         6 0.90919             osd.6              up  1.00000         
        1.00000 

      
    the old osd.1 still in
        machine red-compute but outside the cluster. I repeat. My
        question is. 

      
    With this kind of error.
        Is anything I can do to recover from the error? 

      
    Second. If I cannot
        find an authority pg on the cluster, in osd.2 and osd.6 how can
        I fix it? Can I get it from the old osd.1. How?

      
    > ceph pg map 9.8b

          osdmap e7049 pg 9.8b (9.8b) -> up [6,2] acting [6,2]

      
    > rados list-inconsistent-pg high_value

      ["9.8b"]

      
    Any help on this?
    

    Thank you in advance.

      
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com