Re: PGs stuck peering (looping?) after upgrade to Luminous.

Magnus Grönlund <magnus@xxxxxxxxxxx> · Wed, 11 Jul 2018 23:07:46 +0200

Hi Kevin,
Unfortunately restarting OSD don't appear to help, instead it seems to make it worse with PGs getting stuck degraded.

Best regards
/Magnus

2018-07-11 20:46 GMT+02:00 Kevin Olbrich <ko@xxxxxxx>:
Sounds a little bit like the problem I had on OSDs:
 Blocked requests activating+remapped after extending pg(p)_num   Kevin Olbrich Blocked requests activating+remapped afterextendingpg(p)_num   Burkhard Linke Blocked requests activating+remapped afterextendingpg(p)_num   Kevin Olbrich Blocked requests activating+remapped afterextendingpg(p)_num   Kevin Olbrich
 Blocked requests activating+remapped afterextendingpg(p)_num   Kevin Olbrich
 Blocked requests activating+remapped afterextendingpg(p)_num   Kevin Olbrich
 Blocked requests activating+remapped afterextendingpg(p)_num   Paul Emmerich
 Blocked requests activating+remapped afterextendingpg(p)_num   Kevin Olbrich
I ended up restarting the OSDs which were stuck in that state and they immediately fixed themselfs.
It should also work to just "out" the problem-OSDs and immeditly up them again to fix it.

- Kevin

2018-07-11 20:30 GMT+02:00 Magnus Grönlund <magnus@xxxxxxxxxxx>:
Hi,

Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous (12.2.6)

After upgrading and restarting the mons everything looked OK, the mons had quorum, all OSDs where up and in and all the PGs where active+clean.
But before I had time to start upgrading the OSDs it became obvious that something had gone terribly wrong. 
All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data was misplaced!

The mons appears OK and all OSDs are still up and in, but a few hours later there was still 1483 pgs stuck inactive, essentially all of them in peering!
Investigating one of the stuck PGs it appears to be looping between “inactive”, “remapped+peering” and “peering” and the epoch number is rising fast, see the attached pg query outputs.

We really can’t afford to loose the cluster or the data so any help or suggestions on how to debug or fix this issue would be very, very appreciated!

    health: HEALTH_ERR
            1483 pgs are stuck inactive for more than 60 seconds
            542 pgs backfill_wait
            14 pgs backfilling
            11 pgs degraded
            1402 pgs peering
            3 pgs recovery_wait
            11 pgs stuck degraded
            1483 pgs stuck inactive
            2042 pgs stuck unclean
            7 pgs stuck undersized
            7 pgs undersized
            111 requests are blocked > 32 sec
            10586 requests are blocked > 4096 sec
            recovery 9472/11120724 objects degraded (0.085%)
            recovery 1181567/11120724 objects misplaced (10.625%)
            noout flag(s) set
            mon.eselde02u32 low disk space

  services:
    mon: 3 daemons, quorum eselde02u32,eselde02u33,eselde02u34
    mgr: eselde02u32(active), standbys: eselde02u33, eselde02u34
    osd: 111 osds: 111 up, 111 in; 800 remapped pgs
         flags noout

  data:
    pools:   18 pools, 4104 pgs
    objects: 3620k objects, 13875 GB
    usage:   42254 GB used, 160 TB / 201 TB avail
    pgs:     1.876% pgs unknown
             34.259% pgs not active
             9472/11120724 objects degraded (0.085%)
             1181567/11120724 objects misplaced (10.625%)
             2062 active+clean
            1221 peering
             535  active+remapped+backfill_wait
             181  remapped+peering
             77   unknown
             13   active+remapped+backfilling
             7    active+undersized+degraded+remapped+backfill_wait
             4    remapped
             3    active+recovery_wait+degraded+remapped
             1    active+degraded+remapped+backfilling

  io:
    recovery: 298 MB/s, 77 objects/s

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com