PGs stuck peering (looping?) after upgrade to Luminous.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous (12.2.6)

After upgrading and restarting the mons everything looked OK, the mons had quorum, all OSDs where up and in and all the PGs where active+clean.
But before I had time to start upgrading the OSDs it became obvious that something had gone terribly wrong. 
All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data was misplaced!

The mons appears OK and all OSDs are still up and in, but a few hours later there was still 1483 pgs stuck inactive, essentially all of them in peering!
Investigating one of the stuck PGs it appears to be looping between “inactive”, “remapped+peering” and “peering” and the epoch number is rising fast, see the attached pg query outputs.

We really can’t afford to loose the cluster or the data so any help or suggestions on how to debug or fix this issue would be very, very appreciated!


    health: HEALTH_ERR
            1483 pgs are stuck inactive for more than 60 seconds
            542 pgs backfill_wait
            14 pgs backfilling
            11 pgs degraded
            1402 pgs peering
            3 pgs recovery_wait
            11 pgs stuck degraded
            1483 pgs stuck inactive
            2042 pgs stuck unclean
            7 pgs stuck undersized
            7 pgs undersized
            111 requests are blocked > 32 sec
            10586 requests are blocked > 4096 sec
            recovery 9472/11120724 objects degraded (0.085%)
            recovery 1181567/11120724 objects misplaced (10.625%)
            noout flag(s) set
            mon.eselde02u32 low disk space
 
  services:
    mon: 3 daemons, quorum eselde02u32,eselde02u33,eselde02u34
    mgr: eselde02u32(active), standbys: eselde02u33, eselde02u34
    osd: 111 osds: 111 up, 111 in; 800 remapped pgs
         flags noout
 
  data:
    pools:   18 pools, 4104 pgs
    objects: 3620k objects, 13875 GB
    usage:   42254 GB used, 160 TB / 201 TB avail
    pgs:     1.876% pgs unknown
             34.259% pgs not active
             9472/11120724 objects degraded (0.085%)
             1181567/11120724 objects misplaced (10.625%)
             2062 active+clean
            1221 peering
             535  active+remapped+backfill_wait
             181  remapped+peering
             77   unknown
             13   active+remapped+backfilling
             7    active+undersized+degraded+remapped+backfill_wait
             4    remapped
             3    active+recovery_wait+degraded+remapped
             1    active+degraded+remapped+backfilling
 
  io:
    recovery: 298 MB/s, 77 objects/s

Attachment: pg3.3c6.query
Description: Binary data

<<attachment: pg3.3c6-query2.zip>>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux