On Mon, Dec 4, 2017 at 6:24 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: > It's pretty straightforward to maintain collection-level metadata in the > common case, but I don't see how we can *also* support an O(1) split > operation. You're right we can't know the exact answer, but we already solve this problem for PG object counts and things by doing a fuzzy estimate (just dividing the PG values in two) until a scrub happens. I don't think having to do the same here is a reason to avoid it entirely. > This is why I suggested per-pool metadata. Pool-level > information will still let us roll things up into a 'ceph df' type summary > of how well data in a particular pool is compressing, how sparse it is, > and so on, which should be sufficient for capacity planning purposes. > We'll also have per-OSD (by pool) information, which will tell us how > efficient, e.g., FileStore vs BlueStore is for a given data set (pool). > > What we don't get is per-PG granularity. I don't think this matters much, > which a user doesn't really care about individual PGs anyway. > > We also don't get perfect accuracy when the cluster is degraded. If > one or more PGs in a pool is undergoing backfill or whatever, the > OSD-level summations will be off. We can *probably* figure out how to > correct for that by scaling the result based on what we know about the PG > recovery progress (e.g., how far along backfill on a PG is, and ignoring > the log-based recovery as an insignificant). Users don't care much about per-PG granularity in general, but as you note it breaks down in recovery. More than that, our *balancers* care very much about exactly what's in each PG, don't they? > >> - PG instance at each OSD node retrieves collection statistics from OS when >> needed or tracks it in RAM only. >> - Two statistics reports to be distinguished: >> a. Cluster-wide PG report - processing OSD retrieves statistics from both >> local and remote PGs and sums it on per-PG basis. E.g. total per-PG physical >> space usage can be obtained this way. >> b. OSD-wide PG report (or just simple OSD summary report) - OSD collects PG >> statistics from local PGs only. E.g. logical/physical space usage at specific >> OSD can be examined this way. > > ...and if we're talking about OSD-level stats, then I don't think any > different update path is needed. We would just statfs() to return a pool > summation for each pool that exists on the OSD as well as the current > osd_stat_t (or whatever it is). > > Does that seem reasonable? I'm saying it's a "replay" mechanism or a two-phase commit, but I really don't think having delayed stat updates would take much doing. We can modify our in-memory state as soon as the ObjectStore replies back to us, and add a new "stats-persisted-thru" value to the pg_info. On any subsequent writes, we update the pg stats according to what we already know. Then on OSD boot, we compare that value to the last pg write, and query any objects which changed in the unaccounted pg log entries. It's a short, easy pass, right? And we're not talking new blocking queues or anything. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html