On Tue, Mar 19, 2019 at 7:07 AM Sage Weil <sweil@xxxxxxxxxx> wrote: > > Hi Patrick, Casey, everyone, > > The new PG autoscaler uses the 'USED' value you see in 'ceph df' or 'ceph > osd pool autoscale-status' to decide how many PGs out of the cluster total > each pool should get. Basically we have a target pg count per OSD, each > pool as some replication/ec multiplier, and data is proportionally > distributed among pools (or the admin has fed in those ratios based on > expected usage). > > This is fine and good, except that I see on the lab cluster this: > > sage@reesi001:~$ sudo ceph osd pool autoscale-status > POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO PG_NUM NEW PG_NUM AUTOSCALE > device_health_metrics 0 3.0 431.3T 0.0000 1 warn > default.rgw.buckets.non-ec 0 3.0 431.3T 0.0000 8 warn > default.rgw.meta 1336 3.0 431.3T 0.0000 8 warn > default.rgw.buckets.index 0 3.0 431.3T 0.0000 8 warn > default.rgw.control 0 3.0 431.3T 0.0000 8 warn > default.rgw.buckets.data 743.5G 3.0 431.3T 0.0050 32 on > .rgw.root 1113 3.0 431.3T 0.0000 8 warn > djf_tmp 879.1G 3.0 431.3T 0.0060 4096 32 off > libvirt-pool 2328M 3.0 431.3T 0.0000 3000 4 off > data 75679G 3.0 431.3T 0.5140 4096 warn > default.rgw.log 7713k 3.0 431.3T 0.0000 8 warn > metadata 62481M 4.0 431.3T 0.0006 64 4 off > > Notice 'metadata' (for cephfs) is ~64 GB, but the autoscaler thinks it > should only have 4 PGs (the default minimum; it probably thinks less than > that). That's because it's 1/1000th the size of the data pool (75 TB). > > But... I think collapsing all of that metadata into so few PGs and OSDs > will be bad for performance, and since omap is more expensive ot recover > than data, those PGs will be more sticky. > > My current thought is that we could have a configurable multipler for omap > bytes when calculating the relative "size" of the pool, maybe default to > 10x or something. In the above example, that would make metadata look > more like 1/100th the size of data, which would give it more like ~40 PGs. > (Actually the autoscaler always picks a power of 2, and only initiates a > change if we're off by the 'optimum' value by > 3x, so anything from 32 > to 128 in this example would still result in no change and a happy > cluster.) > > I see the same thing on my home cluster: > > POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO PG_NUM NEW PG_NUM AUTOSCALE > ar_metadata 40307M 4.0 84769G 0.0019 64 4 off > device_health_metrics 136.7M 3.0 84769G 0.0000 4 on > ar_data 6753G 1.5 52162G 0.1942 512 on > foo 1408M 3.0 84769G 0.0000 4 on > ar_data_cache 2275G 4.0 84769G 0.1074 4 128 on > > I assume the rgw index pool will suffer from teh same issue on big RGW > clusters, although we don't have much RGW data in the lab so I haven't > seen it. > > WDYT? > sage This would effectively be hacking an IOP bias in for omap-centric pools, on the data size-versus-heat tradeoff we've always had to deal with. A more generic solution would be for the balancer to explicitly account for both data size and IO activity. Right? Now, this is definitely not necessarily a better solution for the needs we have, but are we comfortable taking the quick hack instead? Especially since the omap data is often on a different set of drives, I'm not totally sure we need a size-based equalizer... -Greg