On Tue, 19 Mar 2019, Patrick Donnelly wrote: > > > This would effectively be hacking an IOP bias in for omap-centric > > > pools, on the data size-versus-heat tradeoff we've always had to deal > > > with. A more generic solution would be for the balancer to explicitly > > > account for both data size and IO activity. Right? > > > > Yeah, that's a better way to frame it! > > > > > Now, this is definitely not necessarily a better solution for the > > > needs we have, but are we comfortable taking the quick hack instead? > > > Especially since the omap data is often on a different set of drives, > > > I'm not totally sure we need a size-based equalizer... > > > > So, instead of a configurable for an omap multiplier, perhaps instead we > > have a per-pool property that is an IOPS bias (e.g., 10x in this case). I > > think this is a situation where we don't/can't automagically determine > > that bias by measuring workload because workload and heat is ephemeral > > while placement and rebalancing are hugely expensive. We wouldn't want to > > adjust placement automatically. > > > > How about pool property pg_autoscale_bias, and we have rgw and cephfs set > > those automatically somewhere on the appropriate pools? > > I'm wondering if the bias is really necessary if we can just set the > pg_num_min at file system metadata pool / rgw index pool creation (or > before turning on the autoscaler)? I would think that the difference > between e.g. 32 PGs versus 64 PGs will not be significant for a > metadata pool in terms of recovery or performance when we're looking > at only a hundred or so gigabytes of omap data. The difference between > 4 PGs and 32 PGs *is* significant though. So, maybe setting a > reasonable min is enough? Do you mean a min of 32 for *any* pool? That would be a problem for pools like device_health_metrics. If we set a PG min on omap pools like rgw index and cephfs metadata, then it's the same amount of "work" as setting a multiplier for those pools. But I think you might be right that a min of 32 may make more sense in those cases since we don't tend to have a zillion of them and we want good distribution out of the gate when they are empty. I'm somewhat inclined to still have a multiplier, though, so that they also continue to scale up when they get big... sage