Re: 答复: osd: fine-grained statistics for object space usage

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 5 Dec 2017 13:18:20 -0800



On Tue, Dec 5, 2017 at 12:48 PM, Igor Fedotov <ifedotov@xxxxxxx> wrote:
>
>
> On 12/5/2017 1:15 AM, Gregory Farnum wrote:
>>
>> On Mon, Dec 4, 2017 at 6:24 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>>
>>> It's pretty straightforward to maintain collection-level metadata in the
>>> common case, but I don't see how we can *also* support an O(1) split
>>> operation.
>>
>> You're right we can't know the exact answer, but we already solve this
>> problem for PG object counts and things by doing a fuzzy estimate
>> (just dividing the PG values in two) until a scrub happens. I don't
>> think having to do the same here is a reason to avoid it entirely.
>>
>>
>>> This is why I suggested per-pool metadata.  Pool-level
>>> information will still let us roll things up into a 'ceph df' type
>>> summary
>>> of how well data in a particular pool is compressing, how sparse it is,
>>> and so on, which should be sufficient for capacity planning purposes.
>>> We'll also have per-OSD (by pool) information, which will tell us how
>>> efficient, e.g., FileStore vs BlueStore is for a given data set (pool).
>>>
>>> What we don't get is per-PG granularity.  I don't think this matters
>>> much,
>>> which a user doesn't really care about individual PGs anyway.
>>>
>>> We also don't get perfect accuracy when the cluster is degraded.  If
>>> one or more PGs in a pool is undergoing backfill or whatever, the
>>> OSD-level summations will be off.  We can *probably* figure out how to
>>> correct for that by scaling the result based on what we know about the PG
>>> recovery progress (e.g., how far along backfill on a PG is, and ignoring
>>> the log-based recovery as an insignificant).
>>
>> Users don't care much about per-PG granularity in general, but as you
>> note it breaks down in recovery. More than that, our *balancers* care
>> very much about exactly what's in each PG, don't they?
>>
>>>> - PG instance at each OSD node retrieves collection statistics from OS
>>>> when
>>>> needed or tracks  it in RAM only.
>>>> - Two statistics reports  to be distinguished:
>>>>    a. Cluster-wide PG report - processing OSD retrieves statistics from
>>>> both
>>>> local and remote PGs and sums it on per-PG basis. E.g. total per-PG
>>>> physical
>>>> space usage can be obtained this way.
>>>>    b. OSD-wide PG report (or just simple OSD summary report) - OSD
>>>> collects PG
>>>> statistics from local PGs only. E.g. logical/physical space usage at
>>>> specific
>>>> OSD can be examined this way.
>>>
>>> ...and if we're talking about OSD-level stats, then I don't think any
>>> different update path is needed.  We would just statfs() to return a pool
>>> summation for each pool that exists on the OSD as well as the current
>>> osd_stat_t (or whatever it is).
>>>
>>> Does that seem reasonable?
>>
>> I'm saying it's a "replay" mechanism or a two-phase commit, but I
>> really don't think having delayed stat updates would take much doing.
>> We can modify our in-memory state as soon as the ObjectStore replies
>> back to us, and add a new "stats-persisted-thru" value to the pg_info.
>> On any subsequent writes, we update the pg stats according to what we
>> already know. Then on OSD boot, we compare that value to the last pg
>> write, and query any objects which changed in the unaccounted pg log
>> entries. It's a short, easy pass, right? And we're not talking new
>> blocking queues or anything.
>
> That's what I was thinking about too. Here is a very immature POC for this
> approach, seems doable so far:
>
> https://github.com/ceph/ceph/pull/19350

To evaluate this usefully I think we'd need to see how these updates
get committed if the OSD crashes before they're persisted? I expect
that requires some kind of query interface...which, hrm, is actually a
little more complicated if this is the model.
I was just thinking we'd compare the on-disk allocation info for an
object to what we've persisted, but we actually only keep per-PG
stats, right? That's not great. :/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html