Re: 答复: osd: fine-grained statistics for object space usage

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 5 Dec 2017 13:53:38 -0800

On Tue, Dec 5, 2017 at 1:35 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Tue, 5 Dec 2017, Gregory Farnum wrote:
>> On Tue, Dec 5, 2017 at 12:48 PM, Igor Fedotov <ifedotov@xxxxxxx> wrote:
>> >
>> >
>> > On 12/5/2017 1:15 AM, Gregory Farnum wrote:
>> >>
>> >> On Mon, Dec 4, 2017 at 6:24 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>> >>>
>> >>> It's pretty straightforward to maintain collection-level metadata in the
>> >>> common case, but I don't see how we can *also* support an O(1) split
>> >>> operation.
>> >>
>> >> You're right we can't know the exact answer, but we already solve this
>> >> problem for PG object counts and things by doing a fuzzy estimate
>> >> (just dividing the PG values in two) until a scrub happens. I don't
>> >> think having to do the same here is a reason to avoid it entirely.
>
> Oh, right, I forgot about that.
>
>> >>> This is why I suggested per-pool metadata.  Pool-level
>> >>> information will still let us roll things up into a 'ceph df' type
>> >>> summary
>> >>> of how well data in a particular pool is compressing, how sparse it is,
>> >>> and so on, which should be sufficient for capacity planning purposes.
>> >>> We'll also have per-OSD (by pool) information, which will tell us how
>> >>> efficient, e.g., FileStore vs BlueStore is for a given data set (pool).
>> >>>
>> >>> What we don't get is per-PG granularity.  I don't think this matters
>> >>> much,
>> >>> which a user doesn't really care about individual PGs anyway.
>> >>>
>> >>> We also don't get perfect accuracy when the cluster is degraded.  If
>> >>> one or more PGs in a pool is undergoing backfill or whatever, the
>> >>> OSD-level summations will be off.  We can *probably* figure out how to
>> >>> correct for that by scaling the result based on what we know about the PG
>> >>> recovery progress (e.g., how far along backfill on a PG is, and ignoring
>> >>> the log-based recovery as an insignificant).
>> >>
>> >> Users don't care much about per-PG granularity in general, but as you
>> >> note it breaks down in recovery. More than that, our *balancers* care
>> >> very much about exactly what's in each PG, don't they?
>
> The balancer is hands-off if there is any recovery going on (and throttles
> itself to limit the amount of misplaced/rebalancing).

Even if the cluster's clean, if it doesn't know the sizes of PGs, it
doesn't know which ones it should shift around, right? Right now I
think it's just going on the summed logical HEAD object sizes, but
there are obvious problems with that in some scenarios.

>> >>>> - PG instance at each OSD node retrieves collection statistics from OS
>> >>>> when
>> >>>> needed or tracks  it in RAM only.
>> >>>> - Two statistics reports  to be distinguished:
>> >>>>    a. Cluster-wide PG report - processing OSD retrieves statistics from
>> >>>> both
>> >>>> local and remote PGs and sums it on per-PG basis. E.g. total per-PG
>> >>>> physical
>> >>>> space usage can be obtained this way.
>> >>>>    b. OSD-wide PG report (or just simple OSD summary report) - OSD
>> >>>> collects PG
>> >>>> statistics from local PGs only. E.g. logical/physical space usage at
>> >>>> specific
>> >>>> OSD can be examined this way.
>> >>>
>> >>> ...and if we're talking about OSD-level stats, then I don't think any
>> >>> different update path is needed.  We would just statfs() to return a pool
>> >>> summation for each pool that exists on the OSD as well as the current
>> >>> osd_stat_t (or whatever it is).
>> >>>
>> >>> Does that seem reasonable?
>> >>
>> >> I'm saying it's a "replay" mechanism or a two-phase commit, but I
>> >> really don't think having delayed stat updates would take much doing.
>> >> We can modify our in-memory state as soon as the ObjectStore replies
>> >> back to us, and add a new "stats-persisted-thru" value to the pg_info.
>> >> On any subsequent writes, we update the pg stats according to what we
>> >> already know. Then on OSD boot, we compare that value to the last pg
>> >> write, and query any objects which changed in the unaccounted pg log
>> >> entries. It's a short, easy pass, right? And we're not talking new
>> >> blocking queues or anything.
>> >
>> > That's what I was thinking about too. Here is a very immature POC for this
>> > approach, seems doable so far:
>> >
>> > https://github.com/ceph/ceph/pull/19350
>>
>> To evaluate this usefully I think we'd need to see how these updates
>> get committed if the OSD crashes before they're persisted? I expect
>> that requires some kind of query interface...which, hrm, is actually a
>> little more complicated if this is the model.
>> I was just thinking we'd compare the on-disk allocation info for an
>> object to what we've persisted, but we actually only keep per-PG
>> stats, right? That's not great. :/
>
> This direction makes me very nervous.
>
> Can we figure out what problem the complex approach solves that the simple
> approach doesn't?

Maybe you can explain more clearly how this would work. I'm not really
seeing how to implement it efficiently in FileStore. Maybe maintain
size summations for each collection, and update them whenever we do
clones or truncate/append to files? But I think that would work just
as well for exposing PG-level space stats. So maybe we should do that.

> I think the values are something like:
>
>                           master     pool-proposal    2pc-pg-update
>  per-object sparseness      x             ?                ?
>  per-pg sparseness                                         x
>  per-pool sparseness                      x                x
>
> I put ? because for a single object we can just query the backend with
> a stat equivalent (like the fiemap ObjectStore method).  This is what,
> say, rbd or cephfs would need to get a st_blocks value.
>
> For a 'ceph df' column, the pool summation is what you need--not a pg
> value.
>
> Is there another user of this information I'm missing?
>
> AFAICS the only real benefit to the 2pc complexity is a value that remains
> perfectly accurate during backfill etc, whereas the pool-level summation
> will drift slightly in that case.  Doesn't seem worth it to me?

Hmm, it would also be great to resolve the "omaps don't count" thing,
which I don't think we have any other solutions for right now? Not
that this really helps much with that — we could add up the size of
input keys and values, but I don't see any way to efficiently support
omap deletes...
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html