On Mon, Mar 18, 2019 at 11:21 PM Sage Weil <sweil@xxxxxxxxxx> wrote: > > On Tue, 19 Mar 2019, Gregory Farnum wrote: > > On Tue, Mar 19, 2019 at 7:07 AM Sage Weil <sweil@xxxxxxxxxx> wrote: > > > > > > Hi Patrick, Casey, everyone, > > > > > > The new PG autoscaler uses the 'USED' value you see in 'ceph df' or 'ceph > > > osd pool autoscale-status' to decide how many PGs out of the cluster total > > > each pool should get. Basically we have a target pg count per OSD, each > > > pool as some replication/ec multiplier, and data is proportionally > > > distributed among pools (or the admin has fed in those ratios based on > > > expected usage). > > > > > > This is fine and good, except that I see on the lab cluster this: > > > > > > sage@reesi001:~$ sudo ceph osd pool autoscale-status > > > POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO PG_NUM NEW PG_NUM AUTOSCALE > > > device_health_metrics 0 3.0 431.3T 0.0000 1 warn > > > default.rgw.buckets.non-ec 0 3.0 431.3T 0.0000 8 warn > > > default.rgw.meta 1336 3.0 431.3T 0.0000 8 warn > > > default.rgw.buckets.index 0 3.0 431.3T 0.0000 8 warn > > > default.rgw.control 0 3.0 431.3T 0.0000 8 warn > > > default.rgw.buckets.data 743.5G 3.0 431.3T 0.0050 32 on > > > .rgw.root 1113 3.0 431.3T 0.0000 8 warn > > > djf_tmp 879.1G 3.0 431.3T 0.0060 4096 32 off > > > libvirt-pool 2328M 3.0 431.3T 0.0000 3000 4 off > > > data 75679G 3.0 431.3T 0.5140 4096 warn > > > default.rgw.log 7713k 3.0 431.3T 0.0000 8 warn > > > metadata 62481M 4.0 431.3T 0.0006 64 4 off > > > > > > Notice 'metadata' (for cephfs) is ~64 GB, but the autoscaler thinks it > > > should only have 4 PGs (the default minimum; it probably thinks less than > > > that). That's because it's 1/1000th the size of the data pool (75 TB). > > > > > > But... I think collapsing all of that metadata into so few PGs and OSDs > > > will be bad for performance, and since omap is more expensive ot recover > > > than data, those PGs will be more sticky. > > > > > > My current thought is that we could have a configurable multipler for omap > > > bytes when calculating the relative "size" of the pool, maybe default to > > > 10x or something. In the above example, that would make metadata look > > > more like 1/100th the size of data, which would give it more like ~40 PGs. > > > (Actually the autoscaler always picks a power of 2, and only initiates a > > > change if we're off by the 'optimum' value by > 3x, so anything from 32 > > > to 128 in this example would still result in no change and a happy > > > cluster.) > > > > > > I see the same thing on my home cluster: > > > > > > POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO PG_NUM NEW PG_NUM AUTOSCALE > > > ar_metadata 40307M 4.0 84769G 0.0019 64 4 off > > > device_health_metrics 136.7M 3.0 84769G 0.0000 4 on > > > ar_data 6753G 1.5 52162G 0.1942 512 on > > > foo 1408M 3.0 84769G 0.0000 4 on > > > ar_data_cache 2275G 4.0 84769G 0.1074 4 128 on > > > > > > I assume the rgw index pool will suffer from teh same issue on big RGW > > > clusters, although we don't have much RGW data in the lab so I haven't > > > seen it. > > > > > > WDYT? > > > sage > > > > This would effectively be hacking an IOP bias in for omap-centric > > pools, on the data size-versus-heat tradeoff we've always had to deal > > with. A more generic solution would be for the balancer to explicitly > > account for both data size and IO activity. Right? > > Yeah, that's a better way to frame it! > > > Now, this is definitely not necessarily a better solution for the > > needs we have, but are we comfortable taking the quick hack instead? > > Especially since the omap data is often on a different set of drives, > > I'm not totally sure we need a size-based equalizer... > > So, instead of a configurable for an omap multiplier, perhaps instead we > have a per-pool property that is an IOPS bias (e.g., 10x in this case). I > think this is a situation where we don't/can't automagically determine > that bias by measuring workload because workload and heat is ephemeral > while placement and rebalancing are hugely expensive. We wouldn't want to > adjust placement automatically. > > How about pool property pg_autoscale_bias, and we have rgw and cephfs set > those automatically somewhere on the appropriate pools? I'm wondering if the bias is really necessary if we can just set the pg_num_min at file system metadata pool / rgw index pool creation (or before turning on the autoscaler)? I would think that the difference between e.g. 32 PGs versus 64 PGs will not be significant for a metadata pool in terms of recovery or performance when we're looking at only a hundred or so gigabytes of omap data. The difference between 4 PGs and 32 PGs *is* significant though. So, maybe setting a reasonable min is enough? -- Patrick Donnelly