PGs stuck peering (looping?) after upgrade to Luminous.

Magnus Grönlund <magnus@xxxxxxxxxxx> · Wed, 11 Jul 2018 20:30:41 +0200

Hi,

Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous (12.2.6)

After upgrading and restarting the mons everything looked OK, the mons had quorum, all OSDs where up and in and all the PGs where active+clean.
But before I had time to start upgrading the OSDs it became obvious that something had gone terribly wrong. 
All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data was misplaced!

The mons appears OK and all OSDs are still up and in, but a few hours later there was still 1483 pgs stuck inactive, essentially all of them in peering!
Investigating one of the stuck PGs it appears to be looping between “inactive”, “remapped+peering” and “peering” and the epoch number is rising fast, see the attached pg query outputs.

We really can’t afford to loose the cluster or the data so any help or suggestions on how to debug or fix this issue would be very, very appreciated!

    health: HEALTH_ERR
            1483 pgs are stuck inactive for more than 60 seconds
            542 pgs backfill_wait
            14 pgs backfilling
            11 pgs degraded
            1402 pgs peering
            3 pgs recovery_wait
            11 pgs stuck degraded
            1483 pgs stuck inactive
            2042 pgs stuck unclean
            7 pgs stuck undersized
            7 pgs undersized
            111 requests are blocked > 32 sec
            10586 requests are blocked > 4096 sec
            recovery 9472/11120724 objects degraded (0.085%)
            recovery 1181567/11120724 objects misplaced (10.625%)
            noout flag(s) set
            mon.eselde02u32 low disk space

  services:
    mon: 3 daemons, quorum eselde02u32,eselde02u33,eselde02u34
    mgr: eselde02u32(active), standbys: eselde02u33, eselde02u34
    osd: 111 osds: 111 up, 111 in; 800 remapped pgs
         flags noout

  data:
    pools:   18 pools, 4104 pgs
    objects: 3620k objects, 13875 GB
    usage:   42254 GB used, 160 TB / 201 TB avail
    pgs:     1.876% pgs unknown
             34.259% pgs not active
             9472/11120724 objects degraded (0.085%)
             1181567/11120724 objects misplaced (10.625%)
             2062 active+clean
            1221 peering
             535  active+remapped+backfill_wait
             181  remapped+peering
             77   unknown
             13   active+remapped+backfilling
             7    active+undersized+degraded+remapped+backfill_wait
             4    remapped
             3    active+recovery_wait+degraded+remapped
             1    active+degraded+remapped+backfilling

  io:
    recovery: 298 MB/s, 77 objects/s
Attachment:
pg3.3c6.query

Description: Binary data
<<attachment: pg3.3c6-query2.zip>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com