Hi Patrick, Casey, everyone, The new PG autoscaler uses the 'USED' value you see in 'ceph df' or 'ceph osd pool autoscale-status' to decide how many PGs out of the cluster total each pool should get. Basically we have a target pg count per OSD, each pool as some replication/ec multiplier, and data is proportionally distributed among pools (or the admin has fed in those ratios based on expected usage). This is fine and good, except that I see on the lab cluster this: sage@reesi001:~$ sudo ceph osd pool autoscale-status POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO PG_NUM NEW PG_NUM AUTOSCALE device_health_metrics 0 3.0 431.3T 0.0000 1 warn default.rgw.buckets.non-ec 0 3.0 431.3T 0.0000 8 warn default.rgw.meta 1336 3.0 431.3T 0.0000 8 warn default.rgw.buckets.index 0 3.0 431.3T 0.0000 8 warn default.rgw.control 0 3.0 431.3T 0.0000 8 warn default.rgw.buckets.data 743.5G 3.0 431.3T 0.0050 32 on .rgw.root 1113 3.0 431.3T 0.0000 8 warn djf_tmp 879.1G 3.0 431.3T 0.0060 4096 32 off libvirt-pool 2328M 3.0 431.3T 0.0000 3000 4 off data 75679G 3.0 431.3T 0.5140 4096 warn default.rgw.log 7713k 3.0 431.3T 0.0000 8 warn metadata 62481M 4.0 431.3T 0.0006 64 4 off Notice 'metadata' (for cephfs) is ~64 GB, but the autoscaler thinks it should only have 4 PGs (the default minimum; it probably thinks less than that). That's because it's 1/1000th the size of the data pool (75 TB). But... I think collapsing all of that metadata into so few PGs and OSDs will be bad for performance, and since omap is more expensive ot recover than data, those PGs will be more sticky. My current thought is that we could have a configurable multipler for omap bytes when calculating the relative "size" of the pool, maybe default to 10x or something. In the above example, that would make metadata look more like 1/100th the size of data, which would give it more like ~40 PGs. (Actually the autoscaler always picks a power of 2, and only initiates a change if we're off by the 'optimum' value by > 3x, so anything from 32 to 128 in this example would still result in no change and a happy cluster.) I see the same thing on my home cluster: POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO PG_NUM NEW PG_NUM AUTOSCALE ar_metadata 40307M 4.0 84769G 0.0019 64 4 off device_health_metrics 136.7M 3.0 84769G 0.0000 4 on ar_data 6753G 1.5 52162G 0.1942 512 on foo 1408M 3.0 84769G 0.0000 4 on ar_data_cache 2275G 4.0 84769G 0.1074 4 128 on I assume the rgw index pool will suffer from teh same issue on big RGW clusters, although we don't have much RGW data in the lab so I haven't seen it. WDYT? sage