Re: pg autoscaler vs omap pools

Sage Weil <sweil@xxxxxxxxxx> · Tue, 19 Mar 2019 06:52:55 +0000 (UTC)

On Tue, 19 Mar 2019, Gregory Farnum wrote:
> On Tue, Mar 19, 2019 at 11:51 AM Sage Weil <sweil@xxxxxxxxxx> wrote:
> >
> > On Tue, 19 Mar 2019, Gregory Farnum wrote:
> > > On Tue, Mar 19, 2019 at 7:07 AM Sage Weil <sweil@xxxxxxxxxx> wrote:
> > > >
> > > > Hi Patrick, Casey, everyone,
> > > >
> > > > The new PG autoscaler uses the 'USED' value you see in 'ceph df' or 'ceph
> > > > osd pool autoscale-status' to decide how many PGs out of the cluster total
> > > > each pool should get.  Basically we have a target pg count per OSD, each
> > > > pool as some replication/ec multiplier, and data is proportionally
> > > > distributed among pools (or the admin has fed in those ratios based on
> > > > expected usage).
> > > >
> > > > This is fine and good, except that I see on the lab cluster this:
> > > >
> > > > sage@reesi001:~$ sudo ceph osd pool autoscale-status
> > > >  POOL                          SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  PG_NUM  NEW PG_NUM  AUTOSCALE
> > > >  device_health_metrics           0                 3.0        431.3T  0.0000                     1              warn
> > > >  default.rgw.buckets.non-ec      0                 3.0        431.3T  0.0000                     8              warn
> > > >  default.rgw.meta             1336                 3.0        431.3T  0.0000                     8              warn
> > > >  default.rgw.buckets.index       0                 3.0        431.3T  0.0000                     8              warn
> > > >  default.rgw.control             0                 3.0        431.3T  0.0000                     8              warn
> > > >  default.rgw.buckets.data    743.5G                3.0        431.3T  0.0050                    32              on
> > > >  .rgw.root                    1113                 3.0        431.3T  0.0000                     8              warn
> > > >  djf_tmp                     879.1G                3.0        431.3T  0.0060                  4096          32  off
> > > >  libvirt-pool                 2328M                3.0        431.3T  0.0000                  3000           4  off
> > > >  data                        75679G                3.0        431.3T  0.5140                  4096              warn
> > > >  default.rgw.log              7713k                3.0        431.3T  0.0000                     8              warn
> > > >  metadata                    62481M                4.0        431.3T  0.0006                    64           4  off
> > > >
> > > > Notice 'metadata' (for cephfs) is ~64 GB, but the autoscaler thinks it
> > > > should only have 4 PGs (the default minimum; it probably thinks less than
> > > > that).  That's because it's 1/1000th the size of the data pool (75 TB).
> > > >
> > > > But... I think collapsing all of that metadata into so few PGs and OSDs
> > > > will be bad for performance, and since omap is more expensive ot recover
> > > > than data, those PGs will be more sticky.
> > > >
> > > > My current thought is that we could have a configurable multipler for omap
> > > > bytes when calculating the relative "size" of the pool, maybe default to
> > > > 10x or something.  In the above example, that would make metadata look
> > > > more like 1/100th the size of data, which would give it more like ~40 PGs.
> > > > (Actually the autoscaler always picks a power of 2, and only initiates a
> > > > change if we're off by the 'optimum' value by > 3x, so anything from 32
> > > > to 128 in this example would still result in no change and a happy
> > > > cluster.)
> > > >
> > > > I see the same thing on my home cluster:
> > > >
> > > >  POOL                     SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  PG_NUM  NEW PG_NUM  AUTOSCALE
> > > >  ar_metadata            40307M                4.0        84769G  0.0019                    64           4  off
> > > >  device_health_metrics  136.7M                3.0        84769G  0.0000                     4              on
> > > >  ar_data                 6753G                1.5        52162G  0.1942                   512              on
> > > >  foo                     1408M                3.0        84769G  0.0000                     4              on
> > > >  ar_data_cache           2275G                4.0        84769G  0.1074                     4         128  on
> > > >
> > > > I assume the rgw index pool will suffer from teh same issue on big RGW
> > > > clusters, although we don't have much RGW data in the lab so I haven't
> > > > seen it.
> > > >
> > > > WDYT?
> > > > sage
> > >
> > > This would effectively be hacking an IOP bias in for omap-centric
> > > pools, on the data size-versus-heat tradeoff we've always had to deal
> > > with. A more generic solution would be for the balancer to explicitly
> > > account for both data size and IO activity. Right?
> >
> > Yeah, that's a better way to frame it!
> >
> > > Now, this is definitely not necessarily a better solution for the
> > > needs we have, but are we comfortable taking the quick hack instead?
> > > Especially since the omap data is often on a different set of drives,
> > > I'm not totally sure we need a size-based equalizer...
> >
> > So, instead of a configurable for an omap multiplier, perhaps instead we
> > have a per-pool property that is an IOPS bias (e.g., 10x in this case).  I
> > think this is a situation where we don't/can't automagically determine
> > that bias by measuring workload because workload and heat is ephemeral
> > while placement and rebalancing are hugely expensive.  We wouldn't want to
> > adjust placement automatically.
> 
> I'm not sure that follows. We know the all-time num_rd and num_wr on
> PGs/pools and can set limits on how quickly we react to changes.
> 
> I'm thinking about this specifically because for a big RGW install I'd
> expect users to want the balancer to set them up with a pretty large
> pg size on their RGW bucket index pool (a small multiple on their
> total involved OSDs, at least), whereas in some service environments
> you may see an omap pool being created for every tenant who wants an
> FS or an RGW install, and plenty of those will be very small and
> low-IO...
>
> >
> > How about pool property pg_autoscale_bias, and we have rgw and cephfs set
> > those automatically somewhere on the appropriate pools?
> 
> But if we don't want to try and account for IO explicitly, yeah, this
> seems perfectly reasonable to me!

Yeah, my concern is just that for every heuristic we propose, we can 
easily construct a workload counter-example where it does the wrong thing.  
So to start, let's just do the simple thing that's sufficient for now and 
leave the magic for future work.

sage