On Tue, 19 Mar 2019, Gregory Farnum wrote: > On Tue, Mar 19, 2019 at 11:51 AM Sage Weil <sweil@xxxxxxxxxx> wrote: > > > > On Tue, 19 Mar 2019, Gregory Farnum wrote: > > > On Tue, Mar 19, 2019 at 7:07 AM Sage Weil <sweil@xxxxxxxxxx> wrote: > > > > > > > > Hi Patrick, Casey, everyone, > > > > > > > > The new PG autoscaler uses the 'USED' value you see in 'ceph df' or 'ceph > > > > osd pool autoscale-status' to decide how many PGs out of the cluster total > > > > each pool should get. Basically we have a target pg count per OSD, each > > > > pool as some replication/ec multiplier, and data is proportionally > > > > distributed among pools (or the admin has fed in those ratios based on > > > > expected usage). > > > > > > > > This is fine and good, except that I see on the lab cluster this: > > > > > > > > sage@reesi001:~$ sudo ceph osd pool autoscale-status > > > > POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO PG_NUM NEW PG_NUM AUTOSCALE > > > > device_health_metrics 0 3.0 431.3T 0.0000 1 warn > > > > default.rgw.buckets.non-ec 0 3.0 431.3T 0.0000 8 warn > > > > default.rgw.meta 1336 3.0 431.3T 0.0000 8 warn > > > > default.rgw.buckets.index 0 3.0 431.3T 0.0000 8 warn > > > > default.rgw.control 0 3.0 431.3T 0.0000 8 warn > > > > default.rgw.buckets.data 743.5G 3.0 431.3T 0.0050 32 on > > > > .rgw.root 1113 3.0 431.3T 0.0000 8 warn > > > > djf_tmp 879.1G 3.0 431.3T 0.0060 4096 32 off > > > > libvirt-pool 2328M 3.0 431.3T 0.0000 3000 4 off > > > > data 75679G 3.0 431.3T 0.5140 4096 warn > > > > default.rgw.log 7713k 3.0 431.3T 0.0000 8 warn > > > > metadata 62481M 4.0 431.3T 0.0006 64 4 off > > > > > > > > Notice 'metadata' (for cephfs) is ~64 GB, but the autoscaler thinks it > > > > should only have 4 PGs (the default minimum; it probably thinks less than > > > > that). That's because it's 1/1000th the size of the data pool (75 TB). > > > > > > > > But... I think collapsing all of that metadata into so few PGs and OSDs > > > > will be bad for performance, and since omap is more expensive ot recover > > > > than data, those PGs will be more sticky. > > > > > > > > My current thought is that we could have a configurable multipler for omap > > > > bytes when calculating the relative "size" of the pool, maybe default to > > > > 10x or something. In the above example, that would make metadata look > > > > more like 1/100th the size of data, which would give it more like ~40 PGs. > > > > (Actually the autoscaler always picks a power of 2, and only initiates a > > > > change if we're off by the 'optimum' value by > 3x, so anything from 32 > > > > to 128 in this example would still result in no change and a happy > > > > cluster.) > > > > > > > > I see the same thing on my home cluster: > > > > > > > > POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO PG_NUM NEW PG_NUM AUTOSCALE > > > > ar_metadata 40307M 4.0 84769G 0.0019 64 4 off > > > > device_health_metrics 136.7M 3.0 84769G 0.0000 4 on > > > > ar_data 6753G 1.5 52162G 0.1942 512 on > > > > foo 1408M 3.0 84769G 0.0000 4 on > > > > ar_data_cache 2275G 4.0 84769G 0.1074 4 128 on > > > > > > > > I assume the rgw index pool will suffer from teh same issue on big RGW > > > > clusters, although we don't have much RGW data in the lab so I haven't > > > > seen it. > > > > > > > > WDYT? > > > > sage > > > > > > This would effectively be hacking an IOP bias in for omap-centric > > > pools, on the data size-versus-heat tradeoff we've always had to deal > > > with. A more generic solution would be for the balancer to explicitly > > > account for both data size and IO activity. Right? > > > > Yeah, that's a better way to frame it! > > > > > Now, this is definitely not necessarily a better solution for the > > > needs we have, but are we comfortable taking the quick hack instead? > > > Especially since the omap data is often on a different set of drives, > > > I'm not totally sure we need a size-based equalizer... > > > > So, instead of a configurable for an omap multiplier, perhaps instead we > > have a per-pool property that is an IOPS bias (e.g., 10x in this case). I > > think this is a situation where we don't/can't automagically determine > > that bias by measuring workload because workload and heat is ephemeral > > while placement and rebalancing are hugely expensive. We wouldn't want to > > adjust placement automatically. > > I'm not sure that follows. We know the all-time num_rd and num_wr on > PGs/pools and can set limits on how quickly we react to changes. > > I'm thinking about this specifically because for a big RGW install I'd > expect users to want the balancer to set them up with a pretty large > pg size on their RGW bucket index pool (a small multiple on their > total involved OSDs, at least), whereas in some service environments > you may see an omap pool being created for every tenant who wants an > FS or an RGW install, and plenty of those will be very small and > low-IO... > > > > > How about pool property pg_autoscale_bias, and we have rgw and cephfs set > > those automatically somewhere on the appropriate pools? > > But if we don't want to try and account for IO explicitly, yeah, this > seems perfectly reasonable to me! Yeah, my concern is just that for every heuristic we propose, we can easily construct a workload counter-example where it does the wrong thing. So to start, let's just do the simple thing that's sufficient for now and leave the magic for future work. sage