Re: pg autoscaler vs omap pools

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 19 Mar 2019 09:23:21 +0530



On Tue, Mar 19, 2019 at 7:07 AM Sage Weil <sweil@xxxxxxxxxx> wrote:
>
> Hi Patrick, Casey, everyone,
>
> The new PG autoscaler uses the 'USED' value you see in 'ceph df' or 'ceph
> osd pool autoscale-status' to decide how many PGs out of the cluster total
> each pool should get.  Basically we have a target pg count per OSD, each
> pool as some replication/ec multiplier, and data is proportionally
> distributed among pools (or the admin has fed in those ratios based on
> expected usage).
>
> This is fine and good, except that I see on the lab cluster this:
>
> sage@reesi001:~$ sudo ceph osd pool autoscale-status
>  POOL                          SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  PG_NUM  NEW PG_NUM  AUTOSCALE
>  device_health_metrics           0                 3.0        431.3T  0.0000                     1              warn
>  default.rgw.buckets.non-ec      0                 3.0        431.3T  0.0000                     8              warn
>  default.rgw.meta             1336                 3.0        431.3T  0.0000                     8              warn
>  default.rgw.buckets.index       0                 3.0        431.3T  0.0000                     8              warn
>  default.rgw.control             0                 3.0        431.3T  0.0000                     8              warn
>  default.rgw.buckets.data    743.5G                3.0        431.3T  0.0050                    32              on
>  .rgw.root                    1113                 3.0        431.3T  0.0000                     8              warn
>  djf_tmp                     879.1G                3.0        431.3T  0.0060                  4096          32  off
>  libvirt-pool                 2328M                3.0        431.3T  0.0000                  3000           4  off
>  data                        75679G                3.0        431.3T  0.5140                  4096              warn
>  default.rgw.log              7713k                3.0        431.3T  0.0000                     8              warn
>  metadata                    62481M                4.0        431.3T  0.0006                    64           4  off
>
> Notice 'metadata' (for cephfs) is ~64 GB, but the autoscaler thinks it
> should only have 4 PGs (the default minimum; it probably thinks less than
> that).  That's because it's 1/1000th the size of the data pool (75 TB).
>
> But... I think collapsing all of that metadata into so few PGs and OSDs
> will be bad for performance, and since omap is more expensive ot recover
> than data, those PGs will be more sticky.
>
> My current thought is that we could have a configurable multipler for omap
> bytes when calculating the relative "size" of the pool, maybe default to
> 10x or something.  In the above example, that would make metadata look
> more like 1/100th the size of data, which would give it more like ~40 PGs.
> (Actually the autoscaler always picks a power of 2, and only initiates a
> change if we're off by the 'optimum' value by > 3x, so anything from 32
> to 128 in this example would still result in no change and a happy
> cluster.)
>
> I see the same thing on my home cluster:
>
>  POOL                     SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  PG_NUM  NEW PG_NUM  AUTOSCALE
>  ar_metadata            40307M                4.0        84769G  0.0019                    64           4  off
>  device_health_metrics  136.7M                3.0        84769G  0.0000                     4              on
>  ar_data                 6753G                1.5        52162G  0.1942                   512              on
>  foo                     1408M                3.0        84769G  0.0000                     4              on
>  ar_data_cache           2275G                4.0        84769G  0.1074                     4         128  on
>
> I assume the rgw index pool will suffer from teh same issue on big RGW
> clusters, although we don't have much RGW data in the lab so I haven't
> seen it.
>
> WDYT?
> sage

This would effectively be hacking an IOP bias in for omap-centric
pools, on the data size-versus-heat tradeoff we've always had to deal
with. A more generic solution would be for the balancer to explicitly
account for both data size and IO activity. Right?
Now, this is definitely not necessarily a better solution for the
needs we have, but are we comfortable taking the quick hack instead?
Especially since the omap data is often on a different set of drives,
I'm not totally sure we need a size-based equalizer...
-Greg