Re: pg autoscaler vs omap pools

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Tue, 19 Mar 2019 11:17:40 -0700

On Mon, Mar 18, 2019 at 11:21 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
>
> On Tue, 19 Mar 2019, Gregory Farnum wrote:
> > On Tue, Mar 19, 2019 at 7:07 AM Sage Weil <sweil@xxxxxxxxxx> wrote:
> > >
> > > Hi Patrick, Casey, everyone,
> > >
> > > The new PG autoscaler uses the 'USED' value you see in 'ceph df' or 'ceph
> > > osd pool autoscale-status' to decide how many PGs out of the cluster total
> > > each pool should get.  Basically we have a target pg count per OSD, each
> > > pool as some replication/ec multiplier, and data is proportionally
> > > distributed among pools (or the admin has fed in those ratios based on
> > > expected usage).
> > >
> > > This is fine and good, except that I see on the lab cluster this:
> > >
> > > sage@reesi001:~$ sudo ceph osd pool autoscale-status
> > >  POOL                          SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  PG_NUM  NEW PG_NUM  AUTOSCALE
> > >  device_health_metrics           0                 3.0        431.3T  0.0000                     1              warn
> > >  default.rgw.buckets.non-ec      0                 3.0        431.3T  0.0000                     8              warn
> > >  default.rgw.meta             1336                 3.0        431.3T  0.0000                     8              warn
> > >  default.rgw.buckets.index       0                 3.0        431.3T  0.0000                     8              warn
> > >  default.rgw.control             0                 3.0        431.3T  0.0000                     8              warn
> > >  default.rgw.buckets.data    743.5G                3.0        431.3T  0.0050                    32              on
> > >  .rgw.root                    1113                 3.0        431.3T  0.0000                     8              warn
> > >  djf_tmp                     879.1G                3.0        431.3T  0.0060                  4096          32  off
> > >  libvirt-pool                 2328M                3.0        431.3T  0.0000                  3000           4  off
> > >  data                        75679G                3.0        431.3T  0.5140                  4096              warn
> > >  default.rgw.log              7713k                3.0        431.3T  0.0000                     8              warn
> > >  metadata                    62481M                4.0        431.3T  0.0006                    64           4  off
> > >
> > > Notice 'metadata' (for cephfs) is ~64 GB, but the autoscaler thinks it
> > > should only have 4 PGs (the default minimum; it probably thinks less than
> > > that).  That's because it's 1/1000th the size of the data pool (75 TB).
> > >
> > > But... I think collapsing all of that metadata into so few PGs and OSDs
> > > will be bad for performance, and since omap is more expensive ot recover
> > > than data, those PGs will be more sticky.
> > >
> > > My current thought is that we could have a configurable multipler for omap
> > > bytes when calculating the relative "size" of the pool, maybe default to
> > > 10x or something.  In the above example, that would make metadata look
> > > more like 1/100th the size of data, which would give it more like ~40 PGs.
> > > (Actually the autoscaler always picks a power of 2, and only initiates a
> > > change if we're off by the 'optimum' value by > 3x, so anything from 32
> > > to 128 in this example would still result in no change and a happy
> > > cluster.)
> > >
> > > I see the same thing on my home cluster:
> > >
> > >  POOL                     SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  PG_NUM  NEW PG_NUM  AUTOSCALE
> > >  ar_metadata            40307M                4.0        84769G  0.0019                    64           4  off
> > >  device_health_metrics  136.7M                3.0        84769G  0.0000                     4              on
> > >  ar_data                 6753G                1.5        52162G  0.1942                   512              on
> > >  foo                     1408M                3.0        84769G  0.0000                     4              on
> > >  ar_data_cache           2275G                4.0        84769G  0.1074                     4         128  on
> > >
> > > I assume the rgw index pool will suffer from teh same issue on big RGW
> > > clusters, although we don't have much RGW data in the lab so I haven't
> > > seen it.
> > >
> > > WDYT?
> > > sage
> >
> > This would effectively be hacking an IOP bias in for omap-centric
> > pools, on the data size-versus-heat tradeoff we've always had to deal
> > with. A more generic solution would be for the balancer to explicitly
> > account for both data size and IO activity. Right?
>
> Yeah, that's a better way to frame it!
>
> > Now, this is definitely not necessarily a better solution for the
> > needs we have, but are we comfortable taking the quick hack instead?
> > Especially since the omap data is often on a different set of drives,
> > I'm not totally sure we need a size-based equalizer...
>
> So, instead of a configurable for an omap multiplier, perhaps instead we
> have a per-pool property that is an IOPS bias (e.g., 10x in this case).  I
> think this is a situation where we don't/can't automagically determine
> that bias by measuring workload because workload and heat is ephemeral
> while placement and rebalancing are hugely expensive.  We wouldn't want to
> adjust placement automatically.
>
> How about pool property pg_autoscale_bias, and we have rgw and cephfs set
> those automatically somewhere on the appropriate pools?

I'm wondering if the bias is really necessary if we can just set the
pg_num_min at file system metadata pool / rgw index pool creation (or
before turning on the autoscaler)? I would think that the difference
between e.g. 32 PGs versus 64 PGs will not be significant for a
metadata pool in terms of recovery or performance when we're looking
at only a hundred or so gigabytes of omap data. The difference between
4 PGs and 32 PGs *is* significant though. So, maybe setting a
reasonable min is enough?

-- 
Patrick Donnelly