Re: ceph mons de-synced from rest of cluster?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 12 Feb 2018 16:51:11 +0000

On Sun, Feb 11, 2018 at 8:19 PM Chris Apsey <bitskrieg@xxxxxxxxxxxxx> wrote:
All,

Recently doubled the number of OSDs in our cluster, and towards the end

of the rebalancing, I noticed that recovery IO fell to nothing and that

the ceph mons eventually looked like this when I ran ceph -s

       cluster:

         id:     6a65c3d0-b84e-4c89-bbf7-a38a1966d780

         health: HEALTH_WARN

                 34922/4329975 objects misplaced (0.807%)

                 Reduced data availability: 542 pgs inactive, 49 pgs

peering, 13502 pgs stale

                 Degraded data redundancy: 248778/4329975 objects

degraded (5.745%), 7319 pgs unclean, 2224 pgs degraded, 1817 pgs

undersized

       services:

         mon: 3 daemons, quorum cephmon-0,cephmon-1,cephmon-2

         mgr: cephmon-0(active), standbys: cephmon-1, cephmon-2

         osd: 376 osds: 376 up, 376 in

       data:

         pools:   9 pools, 13952 pgs

         objects: 1409k objects, 5992 GB

         usage:   31528 GB used, 1673 TB / 1704 TB avail

         pgs:     3.225% pgs unknown

                  0.659% pgs not active

                  248778/4329975 objects degraded (5.745%)

                  34922/4329975 objects misplaced (0.807%)

                  6141 stale+active+clean

                  4537 stale+active+remapped+backfilling

                  1575 stale+active+undersized+degraded

                  489  stale+active+clean+remapped

                  450  unknown

                  396  stale+active+recovery_wait+degraded

                  216

stale+active+undersized+degraded+remapped+backfilling

                  40   stale+peering

                  30   stale+activating

                  24   stale+active+undersized+remapped

                  22   stale+active+recovering+degraded

                  13   stale+activating+degraded

                  9    stale+remapped+peering

                  4    stale+active+remapped+backfill_wait

                  3    stale+active+clean+scrubbing+deep

                  2

stale+active+undersized+degraded+remapped+backfill_wait

                  1    stale+active+remapped

The problem is, everything works fine.  If I run ceph health detail and

do a pg query against one of the 'degraded' placement groups, it reports

back as active-clean.  All clients in the cluster can write and read at

normal speeds, but not IO information is ever reported in ceph -s.

 From what I can see, everything in the cluster is working properly

except the actual reporting on the status of the cluster.  Has anyone

seen this before/know how to sync the mons up to what the OSDs are

actually reporting?  I see no connectivity errors in the logs of the

mons or the osds.

It sounds like the manager has gone stale somehow. You can probably fix it by restarting, though if you have logs it would be good to file a bug report at tracker.ceph.com.
-Greg

Thanks,

---

v/r

Chris Apsey

bitskrieg@xxxxxxxxxxxxx

https://www.bitskrieg.net

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com