Re: pg autoscaler vs omap pools

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Wed, 20 Mar 2019 14:17:59 -0700

On Tue, Mar 19, 2019 at 10:17 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
>
> On Tue, 19 Mar 2019, Patrick Donnelly wrote:
> > > > This would effectively be hacking an IOP bias in for omap-centric
> > > > pools, on the data size-versus-heat tradeoff we've always had to deal
> > > > with. A more generic solution would be for the balancer to explicitly
> > > > account for both data size and IO activity. Right?
> > >
> > > Yeah, that's a better way to frame it!
> > >
> > > > Now, this is definitely not necessarily a better solution for the
> > > > needs we have, but are we comfortable taking the quick hack instead?
> > > > Especially since the omap data is often on a different set of drives,
> > > > I'm not totally sure we need a size-based equalizer...
> > >
> > > So, instead of a configurable for an omap multiplier, perhaps instead we
> > > have a per-pool property that is an IOPS bias (e.g., 10x in this case).  I
> > > think this is a situation where we don't/can't automagically determine
> > > that bias by measuring workload because workload and heat is ephemeral
> > > while placement and rebalancing are hugely expensive.  We wouldn't want to
> > > adjust placement automatically.
> > >
> > > How about pool property pg_autoscale_bias, and we have rgw and cephfs set
> > > those automatically somewhere on the appropriate pools?
> >
> > I'm wondering if the bias is really necessary if we can just set the
> > pg_num_min at file system metadata pool / rgw index pool creation (or
> > before turning on the autoscaler)? I would think that the difference
> > between e.g. 32 PGs versus 64 PGs will not be significant for a
> > metadata pool in terms of recovery or performance when we're looking
> > at only a hundred or so gigabytes of omap data. The difference between
> > 4 PGs and 32 PGs *is* significant though. So, maybe setting a
> > reasonable min is enough?
>
> Do you mean a min of 32 for *any* pool?  That would be a problem for pools
> like device_health_metrics.

No, just omap-heavy pools.

> If we set a PG min on omap pools like rgw index and cephfs metadata, then
> it's the same amount of "work" as setting a multiplier for those pools.
> But I think you might be right that a min of 32 may make more sense
> in those cases since we don't tend to have a zillion of them and we want
> good distribution out of the gate when they are empty.
>
> I'm somewhat inclined to still have a multiplier, though, so that they
> also continue to scale up when they get big...

The only question is whether it's really necessary for the multiplier
to cause a small omap-heavy pool to go from e.g. 32 to 64 pgs after
reaching the next byte threshold. Would going from 32 to 64 PGs have
some real benefit? If so, then the multiplier should be chosen to
maximize that benefit against cost of adding more PGs to a small pool.

-- 
Patrick Donnelly