Re: Problems with statistics after upgrade to luminous

Sage Weil <sweil@xxxxxxxxxx> · Mon, 10 Jul 2017 20:00:31 +0000 (UTC)

On Mon, 10 Jul 2017, Ruben Kerkhof wrote:
> On Mon, Jul 10, 2017 at 7:44 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > On Mon, 10 Jul 2017, Gregory Farnum wrote:
> >> On Mon, Jul 10, 2017 at 12:57 AM Marc Roos <M.Roos@xxxxxxxxxxxxxxxxx> wrote:
> >>
> >>       I need a little help with fixing some errors I am having.
> >>
> >>       After upgrading from Kraken im getting incorrect values reported
> >>       on
> >>       placement groups etc. At first I thought it is because I was
> >>       changing
> >>       the public cluster ip address range and modifying the monmap
> >>       directly.
> >>       But after deleting and adding a monitor this ceph daemon dump is
> >>       still
> >>       incorrect.
> >>
> >>
> >>
> >>
> >>       ceph daemon mon.a perf dump cluster
> >>       {
> >>           "cluster": {
> >>               "num_mon": 3,
> >>               "num_mon_quorum": 3,
> >>               "num_osd": 6,
> >>               "num_osd_up": 6,
> >>               "num_osd_in": 6,
> >>               "osd_epoch": 3842,
> >>               "osd_bytes": 0,
> >>               "osd_bytes_used": 0,
> >>               "osd_bytes_avail": 0,
> >>               "num_pool": 0,
> >>               "num_pg": 0,
> >>               "num_pg_active_clean": 0,
> >>               "num_pg_active": 0,
> >>               "num_pg_peering": 0,
> >>               "num_object": 0,
> >>               "num_object_degraded": 0,
> >>               "num_object_misplaced": 0,
> >>               "num_object_unfound": 0,
> >>               "num_bytes": 0,
> >>               "num_mds_up": 1,
> >>               "num_mds_in": 1,
> >>               "num_mds_failed": 0,
> >>               "mds_epoch": 816
> >>           }
> >>
> >>       }
> >>
> >>
> >> Huh, I didn't know that existed.
> >>
> >> So, yep, most of those values aren't updated any more. From a grep, you can
> >> still trust:
> >> num_mon
> >> num_mon_quorum
> >> num_osd
> >> num_osd_up
> >> num_osd_in
> >> osd_epoch
> >> num_mds_up
> >> num_mds_in
> >> num_mds_failed
> >> mds_epoch
> >>
> >> We might be able to keep updating the others when we get reports from the
> >> manager, but it'd be simpler to just rip them out — I don't think the admin
> >> socket is really the right place to get cluster summary data like this.
> >> Sage, any thoughts?
> >
> > These were added to fill a gap when operators are collecting everything
> > via collectd or similar.
> 
> Indeed, this has been reported as
> https://github.com/collectd/collectd/issues/2345
> 
> > Getting the same cluster-level data from
> > multiple mons is redundant but it avoids having to code up a separate
> > collector that polls the CLI or something.
> >
> > I suspect once we're funneling everything through a mgr module this
> > problem will go away and we can remove this.
> 
> That would be great, having collectd running on each monitor always felt 
> a bit weird. If anyone wants to contribute patches to the collectd Ceph 
> plugin to support the mgr, we would really appreciate that.

To be clear, what we're currently working on right here is a *prometheus* 
module/plugin for mgr that will funnel the metrics for *all* ceph daemons 
through a single endpoint to prometheus.  I suspect we can easily 
include the cluster-level stats there.

I'm not sure what the situation looks like with collectd or if there is 
any interest or work with making mgr behavior like a proxy for all 
of the cluster and daemon stats.

> > Until then, these are easy
> > to fix by populating from PGMapDigest... my vote is we do that!
> 
> Yes please :)

I've added a ticket for luminous:

	http://tracker.ceph.com/issues/20563

sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com