Re: ceph mons de-synced from rest of cluster?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Feb 11, 2018 at 8:19 PM Chris Apsey <bitskrieg@xxxxxxxxxxxxx> wrote:
All,

Recently doubled the number of OSDs in our cluster, and towards the end
of the rebalancing, I noticed that recovery IO fell to nothing and that
the ceph mons eventually looked like this when I ran ceph -s

       cluster:
         id:     6a65c3d0-b84e-4c89-bbf7-a38a1966d780
         health: HEALTH_WARN
                 34922/4329975 objects misplaced (0.807%)
                 Reduced data availability: 542 pgs inactive, 49 pgs
peering, 13502 pgs stale
                 Degraded data redundancy: 248778/4329975 objects
degraded (5.745%), 7319 pgs unclean, 2224 pgs degraded, 1817 pgs
undersized

       services:
         mon: 3 daemons, quorum cephmon-0,cephmon-1,cephmon-2
         mgr: cephmon-0(active), standbys: cephmon-1, cephmon-2
         osd: 376 osds: 376 up, 376 in

       data:
         pools:   9 pools, 13952 pgs
         objects: 1409k objects, 5992 GB
         usage:   31528 GB used, 1673 TB / 1704 TB avail
         pgs:     3.225% pgs unknown
                  0.659% pgs not active
                  248778/4329975 objects degraded (5.745%)
                  34922/4329975 objects misplaced (0.807%)
                  6141 stale+active+clean
                  4537 stale+active+remapped+backfilling
                  1575 stale+active+undersized+degraded
                  489  stale+active+clean+remapped
                  450  unknown
                  396  stale+active+recovery_wait+degraded
                  216
stale+active+undersized+degraded+remapped+backfilling
                  40   stale+peering
                  30   stale+activating
                  24   stale+active+undersized+remapped
                  22   stale+active+recovering+degraded
                  13   stale+activating+degraded
                  9    stale+remapped+peering
                  4    stale+active+remapped+backfill_wait
                  3    stale+active+clean+scrubbing+deep
                  2
stale+active+undersized+degraded+remapped+backfill_wait
                  1    stale+active+remapped

The problem is, everything works fine.  If I run ceph health detail and
do a pg query against one of the 'degraded' placement groups, it reports
back as active-clean.  All clients in the cluster can write and read at
normal speeds, but not IO information is ever reported in ceph -s.

 From what I can see, everything in the cluster is working properly
except the actual reporting on the status of the cluster.  Has anyone
seen this before/know how to sync the mons up to what the OSDs are
actually reporting?  I see no connectivity errors in the logs of the
mons or the osds.

It sounds like the manager has gone stale somehow. You can probably fix it by restarting, though if you have logs it would be good to file a bug report at tracker.ceph.com.
-Greg
 

Thanks,

---
v/r

Chris Apsey
bitskrieg@xxxxxxxxxxxxx
https://www.bitskrieg.net
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux