On Tue, 19 Mar 2019, Gregory Farnum wrote: > On Tue, Mar 19, 2019 at 7:07 AM Sage Weil <sweil@xxxxxxxxxx> wrote: > > > > Hi Patrick, Casey, everyone, > > > > The new PG autoscaler uses the 'USED' value you see in 'ceph df' or 'ceph > > osd pool autoscale-status' to decide how many PGs out of the cluster total > > each pool should get. Basically we have a target pg count per OSD, each > > pool as some replication/ec multiplier, and data is proportionally > > distributed among pools (or the admin has fed in those ratios based on > > expected usage). > > > > This is fine and good, except that I see on the lab cluster this: > > > > sage@reesi001:~$ sudo ceph osd pool autoscale-status > > POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO PG_NUM NEW PG_NUM AUTOSCALE > > device_health_metrics 0 3.0 431.3T 0.0000 1 warn > > default.rgw.buckets.non-ec 0 3.0 431.3T 0.0000 8 warn > > default.rgw.meta 1336 3.0 431.3T 0.0000 8 warn > > default.rgw.buckets.index 0 3.0 431.3T 0.0000 8 warn > > default.rgw.control 0 3.0 431.3T 0.0000 8 warn > > default.rgw.buckets.data 743.5G 3.0 431.3T 0.0050 32 on > > .rgw.root 1113 3.0 431.3T 0.0000 8 warn > > djf_tmp 879.1G 3.0 431.3T 0.0060 4096 32 off > > libvirt-pool 2328M 3.0 431.3T 0.0000 3000 4 off > > data 75679G 3.0 431.3T 0.5140 4096 warn > > default.rgw.log 7713k 3.0 431.3T 0.0000 8 warn > > metadata 62481M 4.0 431.3T 0.0006 64 4 off > > > > Notice 'metadata' (for cephfs) is ~64 GB, but the autoscaler thinks it > > should only have 4 PGs (the default minimum; it probably thinks less than > > that). That's because it's 1/1000th the size of the data pool (75 TB). > > > > But... I think collapsing all of that metadata into so few PGs and OSDs > > will be bad for performance, and since omap is more expensive ot recover > > than data, those PGs will be more sticky. > > > > My current thought is that we could have a configurable multipler for omap > > bytes when calculating the relative "size" of the pool, maybe default to > > 10x or something. In the above example, that would make metadata look > > more like 1/100th the size of data, which would give it more like ~40 PGs. > > (Actually the autoscaler always picks a power of 2, and only initiates a > > change if we're off by the 'optimum' value by > 3x, so anything from 32 > > to 128 in this example would still result in no change and a happy > > cluster.) > > > > I see the same thing on my home cluster: > > > > POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO PG_NUM NEW PG_NUM AUTOSCALE > > ar_metadata 40307M 4.0 84769G 0.0019 64 4 off > > device_health_metrics 136.7M 3.0 84769G 0.0000 4 on > > ar_data 6753G 1.5 52162G 0.1942 512 on > > foo 1408M 3.0 84769G 0.0000 4 on > > ar_data_cache 2275G 4.0 84769G 0.1074 4 128 on > > > > I assume the rgw index pool will suffer from teh same issue on big RGW > > clusters, although we don't have much RGW data in the lab so I haven't > > seen it. > > > > WDYT? > > sage > > This would effectively be hacking an IOP bias in for omap-centric > pools, on the data size-versus-heat tradeoff we've always had to deal > with. A more generic solution would be for the balancer to explicitly > account for both data size and IO activity. Right? Yeah, that's a better way to frame it! > Now, this is definitely not necessarily a better solution for the > needs we have, but are we comfortable taking the quick hack instead? > Especially since the omap data is often on a different set of drives, > I'm not totally sure we need a size-based equalizer... So, instead of a configurable for an omap multiplier, perhaps instead we have a per-pool property that is an IOPS bias (e.g., 10x in this case). I think this is a situation where we don't/can't automagically determine that bias by measuring workload because workload and heat is ephemeral while placement and rebalancing are hugely expensive. We wouldn't want to adjust placement automatically. How about pool property pg_autoscale_bias, and we have rgw and cephfs set those automatically somewhere on the appropriate pools? sage