Re: pg autoscaler vs omap pools

Sage Weil <sweil@xxxxxxxxxx> · Tue, 19 Mar 2019 06:21:00 +0000 (UTC)

On Tue, 19 Mar 2019, Gregory Farnum wrote:
> On Tue, Mar 19, 2019 at 7:07 AM Sage Weil <sweil@xxxxxxxxxx> wrote:
> >
> > Hi Patrick, Casey, everyone,
> >
> > The new PG autoscaler uses the 'USED' value you see in 'ceph df' or 'ceph
> > osd pool autoscale-status' to decide how many PGs out of the cluster total
> > each pool should get.  Basically we have a target pg count per OSD, each
> > pool as some replication/ec multiplier, and data is proportionally
> > distributed among pools (or the admin has fed in those ratios based on
> > expected usage).
> >
> > This is fine and good, except that I see on the lab cluster this:
> >
> > sage@reesi001:~$ sudo ceph osd pool autoscale-status
> >  POOL                          SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  PG_NUM  NEW PG_NUM  AUTOSCALE
> >  device_health_metrics           0                 3.0        431.3T  0.0000                     1              warn
> >  default.rgw.buckets.non-ec      0                 3.0        431.3T  0.0000                     8              warn
> >  default.rgw.meta             1336                 3.0        431.3T  0.0000                     8              warn
> >  default.rgw.buckets.index       0                 3.0        431.3T  0.0000                     8              warn
> >  default.rgw.control             0                 3.0        431.3T  0.0000                     8              warn
> >  default.rgw.buckets.data    743.5G                3.0        431.3T  0.0050                    32              on
> >  .rgw.root                    1113                 3.0        431.3T  0.0000                     8              warn
> >  djf_tmp                     879.1G                3.0        431.3T  0.0060                  4096          32  off
> >  libvirt-pool                 2328M                3.0        431.3T  0.0000                  3000           4  off
> >  data                        75679G                3.0        431.3T  0.5140                  4096              warn
> >  default.rgw.log              7713k                3.0        431.3T  0.0000                     8              warn
> >  metadata                    62481M                4.0        431.3T  0.0006                    64           4  off
> >
> > Notice 'metadata' (for cephfs) is ~64 GB, but the autoscaler thinks it
> > should only have 4 PGs (the default minimum; it probably thinks less than
> > that).  That's because it's 1/1000th the size of the data pool (75 TB).
> >
> > But... I think collapsing all of that metadata into so few PGs and OSDs
> > will be bad for performance, and since omap is more expensive ot recover
> > than data, those PGs will be more sticky.
> >
> > My current thought is that we could have a configurable multipler for omap
> > bytes when calculating the relative "size" of the pool, maybe default to
> > 10x or something.  In the above example, that would make metadata look
> > more like 1/100th the size of data, which would give it more like ~40 PGs.
> > (Actually the autoscaler always picks a power of 2, and only initiates a
> > change if we're off by the 'optimum' value by > 3x, so anything from 32
> > to 128 in this example would still result in no change and a happy
> > cluster.)
> >
> > I see the same thing on my home cluster:
> >
> >  POOL                     SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  PG_NUM  NEW PG_NUM  AUTOSCALE
> >  ar_metadata            40307M                4.0        84769G  0.0019                    64           4  off
> >  device_health_metrics  136.7M                3.0        84769G  0.0000                     4              on
> >  ar_data                 6753G                1.5        52162G  0.1942                   512              on
> >  foo                     1408M                3.0        84769G  0.0000                     4              on
> >  ar_data_cache           2275G                4.0        84769G  0.1074                     4         128  on
> >
> > I assume the rgw index pool will suffer from teh same issue on big RGW
> > clusters, although we don't have much RGW data in the lab so I haven't
> > seen it.
> >
> > WDYT?
> > sage
> 
> This would effectively be hacking an IOP bias in for omap-centric
> pools, on the data size-versus-heat tradeoff we've always had to deal
> with. A more generic solution would be for the balancer to explicitly
> account for both data size and IO activity. Right?

Yeah, that's a better way to frame it!

> Now, this is definitely not necessarily a better solution for the
> needs we have, but are we comfortable taking the quick hack instead?
> Especially since the omap data is often on a different set of drives,
> I'm not totally sure we need a size-based equalizer...

So, instead of a configurable for an omap multiplier, perhaps instead we 
have a per-pool property that is an IOPS bias (e.g., 10x in this case).  I 
think this is a situation where we don't/can't automagically determine 
that bias by measuring workload because workload and heat is ephemeral 
while placement and rebalancing are hugely expensive.  We wouldn't want to 
adjust placement automatically.

How about pool property pg_autoscale_bias, and we have rgw and cephfs set 
those automatically somewhere on the appropriate pools? 

sage