pg autoscaler vs omap pools

Sage Weil <sweil@xxxxxxxxxx> · Tue, 19 Mar 2019 01:37:04 +0000 (UTC)

Hi Patrick, Casey, everyone,

The new PG autoscaler uses the 'USED' value you see in 'ceph df' or 'ceph 
osd pool autoscale-status' to decide how many PGs out of the cluster total 
each pool should get.  Basically we have a target pg count per OSD, each 
pool as some replication/ec multiplier, and data is proportionally 
distributed among pools (or the admin has fed in those ratios based on 
expected usage).

This is fine and good, except that I see on the lab cluster this:

sage@reesi001:~$ sudo ceph osd pool autoscale-status
 POOL                          SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  PG_NUM  NEW PG_NUM  AUTOSCALE 
 device_health_metrics           0                 3.0        431.3T  0.0000                     1              warn      
 default.rgw.buckets.non-ec      0                 3.0        431.3T  0.0000                     8              warn      
 default.rgw.meta             1336                 3.0        431.3T  0.0000                     8              warn      
 default.rgw.buckets.index       0                 3.0        431.3T  0.0000                     8              warn      
 default.rgw.control             0                 3.0        431.3T  0.0000                     8              warn      
 default.rgw.buckets.data    743.5G                3.0        431.3T  0.0050                    32              on        
 .rgw.root                    1113                 3.0        431.3T  0.0000                     8              warn      
 djf_tmp                     879.1G                3.0        431.3T  0.0060                  4096          32  off       
 libvirt-pool                 2328M                3.0        431.3T  0.0000                  3000           4  off       
 data                        75679G                3.0        431.3T  0.5140                  4096              warn      
 default.rgw.log              7713k                3.0        431.3T  0.0000                     8              warn      
 metadata                    62481M                4.0        431.3T  0.0006                    64           4  off       

Notice 'metadata' (for cephfs) is ~64 GB, but the autoscaler thinks it 
should only have 4 PGs (the default minimum; it probably thinks less than 
that).  That's because it's 1/1000th the size of the data pool (75 TB).

But... I think collapsing all of that metadata into so few PGs and OSDs 
will be bad for performance, and since omap is more expensive ot recover 
than data, those PGs will be more sticky.

My current thought is that we could have a configurable multipler for omap 
bytes when calculating the relative "size" of the pool, maybe default to 
10x or something.  In the above example, that would make metadata look 
more like 1/100th the size of data, which would give it more like ~40 PGs.  
(Actually the autoscaler always picks a power of 2, and only initiates a 
change if we're off by the 'optimum' value by > 3x, so anything from 32
to 128 in this example would still result in no change and a happy 
cluster.)

I see the same thing on my home cluster:

 POOL                     SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  PG_NUM  NEW PG_NUM  AUTOSCALE 
 ar_metadata            40307M                4.0        84769G  0.0019                    64           4  off       
 device_health_metrics  136.7M                3.0        84769G  0.0000                     4              on        
 ar_data                 6753G                1.5        52162G  0.1942                   512              on        
 foo                     1408M                3.0        84769G  0.0000                     4              on        
 ar_data_cache           2275G                4.0        84769G  0.1074                     4         128  on        

I assume the rgw index pool will suffer from teh same issue on big RGW 
clusters, although we don't have much RGW data in the lab so I haven't 
seen it.

WDYT?
sage