Re: pg autoscaler vs omap pools

Mark Nelson <mark.a.nelson@xxxxxxxxx> · Tue, 19 Mar 2019 13:47:22 -0500

On 3/19/19 1:17 PM, Patrick Donnelly wrote:
On Mon, Mar 18, 2019 at 11:21 PM Sage Weil <sweil@xxxxxxxxxx> wrote:

On Tue, 19 Mar 2019, Gregory Farnum wrote:
On Tue, Mar 19, 2019 at 7:07 AM Sage Weil <sweil@xxxxxxxxxx> wrote:

Hi Patrick, Casey, everyone,

The new PG autoscaler uses the 'USED' value you see in 'ceph df' or 'ceph
osd pool autoscale-status' to decide how many PGs out of the cluster total
each pool should get.  Basically we have a target pg count per OSD, each
pool as some replication/ec multiplier, and data is proportionally
distributed among pools (or the admin has fed in those ratios based on
expected usage).

This is fine and good, except that I see on the lab cluster this:

sage@reesi001:~$ sudo ceph osd pool autoscale-status
  POOL                          SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  PG_NUM  NEW PG_NUM  AUTOSCALE
  device_health_metrics           0                 3.0        431.3T  0.0000                     1              warn
  default.rgw.buckets.non-ec      0                 3.0        431.3T  0.0000                     8              warn
  default.rgw.meta             1336                 3.0        431.3T  0.0000                     8              warn
  default.rgw.buckets.index       0                 3.0        431.3T  0.0000                     8              warn
  default.rgw.control             0                 3.0        431.3T  0.0000                     8              warn
  default.rgw.buckets.data    743.5G                3.0        431.3T  0.0050                    32              on
  .rgw.root                    1113                 3.0        431.3T  0.0000                     8              warn
  djf_tmp                     879.1G                3.0        431.3T  0.0060                  4096          32  off
  libvirt-pool                 2328M                3.0        431.3T  0.0000                  3000           4  off
  data                        75679G                3.0        431.3T  0.5140                  4096              warn
  default.rgw.log              7713k                3.0        431.3T  0.0000                     8              warn
  metadata                    62481M                4.0        431.3T  0.0006                    64           4  off

Notice 'metadata' (for cephfs) is ~64 GB, but the autoscaler thinks it
should only have 4 PGs (the default minimum; it probably thinks less than
that).  That's because it's 1/1000th the size of the data pool (75 TB).

But... I think collapsing all of that metadata into so few PGs and OSDs
will be bad for performance, and since omap is more expensive ot recover
than data, those PGs will be more sticky.

My current thought is that we could have a configurable multipler for omap
bytes when calculating the relative "size" of the pool, maybe default to
10x or something.  In the above example, that would make metadata look
more like 1/100th the size of data, which would give it more like ~40 PGs.
(Actually the autoscaler always picks a power of 2, and only initiates a
change if we're off by the 'optimum' value by > 3x, so anything from 32
to 128 in this example would still result in no change and a happy
cluster.)

I see the same thing on my home cluster:

  POOL                     SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  PG_NUM  NEW PG_NUM  AUTOSCALE
  ar_metadata            40307M                4.0        84769G  0.0019                    64           4  off
  device_health_metrics  136.7M                3.0        84769G  0.0000                     4              on
  ar_data                 6753G                1.5        52162G  0.1942                   512              on
  foo                     1408M                3.0        84769G  0.0000                     4              on
  ar_data_cache           2275G                4.0        84769G  0.1074                     4         128  on

I assume the rgw index pool will suffer from teh same issue on big RGW
clusters, although we don't have much RGW data in the lab so I haven't
seen it.

WDYT?
sage

This would effectively be hacking an IOP bias in for omap-centric
pools, on the data size-versus-heat tradeoff we've always had to deal
with. A more generic solution would be for the balancer to explicitly
account for both data size and IO activity. Right?

Yeah, that's a better way to frame it!

Now, this is definitely not necessarily a better solution for the
needs we have, but are we comfortable taking the quick hack instead?
Especially since the omap data is often on a different set of drives,
I'm not totally sure we need a size-based equalizer...

So, instead of a configurable for an omap multiplier, perhaps instead we
have a per-pool property that is an IOPS bias (e.g., 10x in this case).  I
think this is a situation where we don't/can't automagically determine
that bias by measuring workload because workload and heat is ephemeral
while placement and rebalancing are hugely expensive.  We wouldn't want to
adjust placement automatically.

How about pool property pg_autoscale_bias, and we have rgw and cephfs set
those automatically somewhere on the appropriate pools?

I'm wondering if the bias is really necessary if we can just set the
pg_num_min at file system metadata pool / rgw index pool creation (or
before turning on the autoscaler)? I would think that the difference
between e.g. 32 PGs versus 64 PGs will not be significant for a
metadata pool in terms of recovery or performance when we're looking
at only a hundred or so gigabytes of omap data. The difference between
4 PGs and 32 PGs *is* significant though. So, maybe setting a
reasonable min is enough?

From a performance perspective, lock contention may play a significant 
role with 32 PGs.

Mark