ceph mons de-synced from rest of cluster?

Chris Apsey <bitskrieg@xxxxxxxxxxxxx> · Sun, 11 Feb 2018 23:18:58 -0500

All,

Recently doubled the number of OSDs in our cluster, and towards the end 
of the rebalancing, I noticed that recovery IO fell to nothing and that 
the ceph mons eventually looked like this when I ran ceph -s

      cluster:
        id:     6a65c3d0-b84e-4c89-bbf7-a38a1966d780
        health: HEALTH_WARN
                34922/4329975 objects misplaced (0.807%)
                Reduced data availability: 542 pgs inactive, 49 pgs 
peering, 13502 pgs stale
                Degraded data redundancy: 248778/4329975 objects 
degraded (5.745%), 7319 pgs unclean, 2224 pgs degraded, 1817 pgs 
undersized

      services:
        mon: 3 daemons, quorum cephmon-0,cephmon-1,cephmon-2
        mgr: cephmon-0(active), standbys: cephmon-1, cephmon-2
        osd: 376 osds: 376 up, 376 in

      data:
        pools:   9 pools, 13952 pgs
        objects: 1409k objects, 5992 GB
        usage:   31528 GB used, 1673 TB / 1704 TB avail
        pgs:     3.225% pgs unknown
                 0.659% pgs not active
                 248778/4329975 objects degraded (5.745%)
                 34922/4329975 objects misplaced (0.807%)
                 6141 stale+active+clean
                 4537 stale+active+remapped+backfilling
                 1575 stale+active+undersized+degraded
                 489  stale+active+clean+remapped
                 450  unknown
                 396  stale+active+recovery_wait+degraded
                 216  
stale+active+undersized+degraded+remapped+backfilling
                 40   stale+peering
                 30   stale+activating
                 24   stale+active+undersized+remapped
                 22   stale+active+recovering+degraded
                 13   stale+activating+degraded
                 9    stale+remapped+peering
                 4    stale+active+remapped+backfill_wait
                 3    stale+active+clean+scrubbing+deep
                 2    
stale+active+undersized+degraded+remapped+backfill_wait
                 1    stale+active+remapped

The problem is, everything works fine.  If I run ceph health detail and 
do a pg query against one of the 'degraded' placement groups, it reports 
back as active-clean.  All clients in the cluster can write and read at 
normal speeds, but not IO information is ever reported in ceph -s.

From what I can see, everything in the cluster is working properly 
except the actual reporting on the status of the cluster.  Has anyone 
seen this before/know how to sync the mons up to what the OSDs are 
actually reporting?  I see no connectivity errors in the logs of the 
mons or the osds.

Thanks,

---
v/r

Chris Apsey
bitskrieg@xxxxxxxxxxxxx
https://www.bitskrieg.net
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com