Re: Luminous won't fully recover

Shain Miley <SMiley@xxxxxxx> · Tue, 27 Jul 2021 18:59:43 +0000

Thanks,
I was able to get things back into a good state.  I had to restart a few osd's and I also noticed that at a point...all of the pg's that were preventing full recovery involved osd.8.  I removed that osd and things moved forward.  I reviewed the raid controller logs for that osd and although the disk was still listed as healthy...I found some errors in the controller log that must have been causing some problems reading some amount of data.

Thanks again.

Shain

On 7/23/21, 3:35 PM, "DHilsbos@xxxxxxxxxxxxxx" <DHilsbos@xxxxxxxxxxxxxx> wrote:

    Sean;

    These lines look bad:
    14 scrub errors
    Reduced data availability: 2 pgs inactive
    Possible data damage: 8 pgs inconsistent
    osd.95 (root=default,host=hqosd8) is down

    I suspect you ran into a hardware issue with one more drives in some of the servers that did not go offline.

    osd.95 is offline, you need to resolve this.

    You should fix your tunables, when you can (probably not part of your current issues).

    Thank you,

    Dominic L. Hilsbos, MBA 
    Vice President – Information Technology 
    Perform Air International Inc.
    DHilsbos@xxxxxxxxxxxxxx 
    https://urldefense.com/v3/__http://www.PerformAir.com__;!!Iwwt!FAQkxiDS80ZWksiJket210Oc_wLsRih_-WqhguEb44tq0_Ao7aqrgeIO_C8$ 

    -----Original Message-----
    From: Shain Miley [mailto:SMiley@xxxxxxx] 
    Sent: Friday, July 23, 2021 10:48 AM
    To: ceph-users@xxxxxxx
    Subject:  Luminous won't fully recover

    We recently had a few Ceph nodes go offline which required a reboot.  I have been able to get the cluster back to the state listed below however it does not seem like it will progress past the point of 23473/287823588 objects misplaced.

    Yesterday it was about 13% of the data that was misplaced…however this morning it has goteen to 0.008% but has not moved past this point in about an hour.

    Does anyone see anything in the output below that points to the problem and/or are there any suggestions that I can follow in order to figure out why the cluster health is not moving beyond this point?

    ---------------------------------------------------

    root@rbd1:~# ceph -s

    cluster:

        id:     504b5794-34bd-44e7-a8c3-0494cf800c23

        health: HEALTH_ERR

                crush map has legacy tunables (require argonaut, min is firefly)

                23473/287823588 objects misplaced (0.008%)

                14 scrub errors

                Reduced data availability: 2 pgs inactive

                Possible data damage: 8 pgs inconsistent

      services:

        mon: 3 daemons, quorum hqceph1,hqceph2,hqceph3

        mgr: hqceph2(active), standbys: hqceph3

        osd: 288 osds: 270 up, 270 in; 2 remapped pgs

        rgw: 1 daemon active

      data:

        pools:   17 pools, 9411 pgs

        objects: 95.95M objects, 309TiB

        usage:   936TiB used, 627TiB / 1.53PiB avail

        pgs:     0.021% pgs not active

                 23473/287823588 objects misplaced (0.008%)

                 9369 active+clean

                 30   active+clean+scrubbing+deep

                 8    active+clean+inconsistent

                 2    activating+remapped

                 2    active+clean+scrubbing

      io:

        client:   1000B/s rd, 0B/s wr, 0op/s rd, 0op/s wr

    root@rbd1:~# ceph health detail

    HEALTH_ERR crush map has legacy tunables (require argonaut, min is firefly); 1 osds down; 23473/287823588 objects misplaced (0.008%); 14 scrub errors; Reduced data availability: 3 pgs inactive, 13 pgs peering; Possible data damage: 8 pgs inconsistent; Degraded data redundancy: 408658/287823588 objects degraded (0.142%), 38 pgs degraded

    OLD_CRUSH_TUNABLES crush map has legacy tunables (require argonaut, min is firefly)

        see https://urldefense.com/v3/__http://docs.ceph.com/docs/master/rados/operations/crush-map/*tunables__;Iw!!Iwwt!FAQkxiDS80ZWksiJket210Oc_wLsRih_-WqhguEb44tq0_Ao7aqrwpPnNRE$ 

    OSD_DOWN 1 osds down

        osd.95 (root=default,host=hqosd8) is down

    OBJECT_MISPLACED 23473/287823588 objects misplaced (0.008%)

    OSD_SCRUB_ERRORS 14 scrub errors

    PG_AVAILABILITY Reduced data availability: 3 pgs inactive, 13 pgs peering

        pg 3.b41 is stuck peering for 106.682058, current state peering, last acting [204,190]

        pg 3.c33 is stuck peering for 103.403643, current state peering, last acting [228,274]

        pg 3.d15 is stuck peering for 128.537454, current state peering, last acting [286,24]

        pg 3.fa9 is stuck peering for 106.526146, current state peering, last acting [286,47]

        pg 3.fb7 is stuck peering for 105.878878, current state peering, last acting [62,97]

        pg 3.13a2 is stuck peering for 106.491138, current state peering, last acting [270,219]

        pg 3.1521 is stuck inactive for 170180.165265, current state activating+remapped, last acting [94,186,188]

        pg 3.1565 is stuck peering for 106.782784, current state peering, last acting [121,60]

        pg 3.157c is stuck peering for 128.557448, current state peering, last acting [128,268]

        pg 3.1744 is stuck peering for 106.639603, current state peering, last acting [192,142]

        pg 3.1ac8 is stuck peering for 127.839550, current state peering, last acting [221,190]

        pg 3.1e24 is stuck peering for 128.201670, current state peering, last acting [118,158]

        pg 3.1e46 is stuck inactive for 169121.764376, current state activating+remapped, last acting [87,199,170]

        pg 18.36 is stuck peering for 128.554121, current state peering, last acting [204]

        pg 21.1ce is stuck peering for 106.582584, current state peering, last acting [266,192]

    PG_DAMAGED Possible data damage: 8 pgs inconsistent

        pg 3.1ca is active+clean+inconsistent, acting [201,8,180]

        pg 3.56a is active+clean+inconsistent, acting [148,240,8]

        pg 3.b0f is active+clean+inconsistent, acting [148,260,8]

        pg 3.b56 is active+clean+inconsistent, acting [218,8,240]

        pg 3.10ff is active+clean+inconsistent, acting [262,8,211]

        pg 3.1192 is active+clean+inconsistent, acting [192,8,187]

        pg 3.124a is active+clean+inconsistent, acting [123,8,222]

        pg 3.1c55 is active+clean+inconsistent, acting [180,8,287]

    PG_DEGRADED Degraded data redundancy: 408658/287823588 objects degraded (0.142%), 38 pgs degraded

        pg 3.8f is active+undersized+degraded, acting [163,149]

        pg 3.ba is active+undersized+degraded, acting [68,280]

        pg 3.1aa is active+undersized+degraded, acting [176,211]

        pg 3.29e is active+undersized+degraded, acting [241,194]

        pg 3.323 is active+undersized+degraded, acting [78,194]

        pg 3.343 is active+undersized+degraded, acting [242,144]

        pg 3.4ae is active+undersized+degraded, acting [153,237]

        pg 3.524 is active+undersized+degraded, acting [252,222]

        pg 3.5c9 is active+undersized+degraded, acting [272,252]

        pg 3.713 is active+undersized+degraded, acting [273,80]

        pg 3.730 is active+undersized+degraded, acting [235,212]

        pg 3.88f is active+undersized+degraded, acting [222,285]

        pg 3.8cb is active+undersized+degraded, acting [285,20]

        pg 3.9a0 is active+undersized+degraded, acting [240,200]

        pg 3.c19 is active+undersized+degraded, acting [165,276]

        pg 3.ec8 is active+undersized+degraded, acting [158,40]

        pg 3.1025 is active+undersized+degraded, acting [258,274]

        pg 3.1058 is active+undersized+degraded, acting [38,68]

        pg 3.14e4 is active+undersized+degraded, acting [185,39]

        pg 3.150c is active+undersized+degraded, acting [138,140]

        pg 3.1545 is active+undersized+degraded, acting [222,55]

        pg 3.15a6 is active+undersized+degraded, acting [242,272]

        pg 3.1620 is active+undersized+degraded, acting [200,164]

        pg 3.1710 is active+undersized+degraded, acting [176,285]

        pg 3.1792 is active+undersized+degraded, acting [190,11]

        pg 3.17bd is active+undersized+degraded, acting [207,15]

        pg 3.17da is active+undersized+degraded, acting [5,160]

        pg 3.183e is active+undersized+degraded, acting [273,136]

        pg 3.197d is active+undersized+degraded, acting [241,139]

        pg 3.1a3d is active+undersized+degraded, acting [184,121]

        pg 3.1ba6 is active+undersized+degraded, acting [47,249]

        pg 3.1c2b is active+undersized+degraded, acting [268,80]

        pg 3.1ca2 is active+undersized+degraded, acting [280,152]

        pg 3.1cd4 is active+undersized+degraded, acting [2,129]

        pg 3.1e13 is active+undersized+degraded, acting [247,114]

        pg 12.56 is active+undersized+degraded, acting [54]

        pg 18.8 is undersized+degraded+peered, acting [260]

        pg 21.9f is active+undersized+degraded, acting [215,201]
    --------------------------------------------------------------------------------------------------

    Thanks,
    Shain

    Shain Miley | Director of Platform and Infrastructure | Digital Media | smiley@xxxxxxx
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx