Re: 答复: osd: fine-grained statistics for object space usage

Igor Fedotov <ifedotov@xxxxxxx> · Tue, 5 Dec 2017 23:48:49 +0300

On 12/5/2017 1:15 AM, Gregory Farnum wrote:
On Mon, Dec 4, 2017 at 6:24 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
It's pretty straightforward to maintain collection-level metadata in the
common case, but I don't see how we can *also* support an O(1) split
operation.
You're right we can't know the exact answer, but we already solve this
problem for PG object counts and things by doing a fuzzy estimate
(just dividing the PG values in two) until a scrub happens. I don't
think having to do the same here is a reason to avoid it entirely.

This is why I suggested per-pool metadata.  Pool-level
information will still let us roll things up into a 'ceph df' type summary
of how well data in a particular pool is compressing, how sparse it is,
and so on, which should be sufficient for capacity planning purposes.
We'll also have per-OSD (by pool) information, which will tell us how
efficient, e.g., FileStore vs BlueStore is for a given data set (pool).

What we don't get is per-PG granularity.  I don't think this matters much,
which a user doesn't really care about individual PGs anyway.

We also don't get perfect accuracy when the cluster is degraded.  If
one or more PGs in a pool is undergoing backfill or whatever, the
OSD-level summations will be off.  We can *probably* figure out how to
correct for that by scaling the result based on what we know about the PG
recovery progress (e.g., how far along backfill on a PG is, and ignoring
the log-based recovery as an insignificant).
Users don't care much about per-PG granularity in general, but as you
note it breaks down in recovery. More than that, our *balancers* care
very much about exactly what's in each PG, don't they?

- PG instance at each OSD node retrieves collection statistics from OS when
needed or tracks  it in RAM only.
- Two statistics reports  to be distinguished:
   a. Cluster-wide PG report - processing OSD retrieves statistics from both
local and remote PGs and sums it on per-PG basis. E.g. total per-PG physical
space usage can be obtained this way.
   b. OSD-wide PG report (or just simple OSD summary report) - OSD collects PG
statistics from local PGs only. E.g. logical/physical space usage at specific
OSD can be examined this way.
...and if we're talking about OSD-level stats, then I don't think any
different update path is needed.  We would just statfs() to return a pool
summation for each pool that exists on the OSD as well as the current
osd_stat_t (or whatever it is).

Does that seem reasonable?
I'm saying it's a "replay" mechanism or a two-phase commit, but I
really don't think having delayed stat updates would take much doing.
We can modify our in-memory state as soon as the ObjectStore replies
back to us, and add a new "stats-persisted-thru" value to the pg_info.
On any subsequent writes, we update the pg stats according to what we
already know. Then on OSD boot, we compare that value to the last pg
write, and query any objects which changed in the unaccounted pg log
entries. It's a short, easy pass, right? And we're not talking new
blocking queues or anything.
That's what I was thinking about too. Here is a very immature POC for 
this approach, seems doable so far:

https://github.com/ceph/ceph/pull/19350

-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html