On 3/19/19 1:17 PM, Patrick Donnelly wrote:
On Mon, Mar 18, 2019 at 11:21 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
On Tue, 19 Mar 2019, Gregory Farnum wrote:
On Tue, Mar 19, 2019 at 7:07 AM Sage Weil <sweil@xxxxxxxxxx> wrote:
Hi Patrick, Casey, everyone,
The new PG autoscaler uses the 'USED' value you see in 'ceph df' or 'ceph
osd pool autoscale-status' to decide how many PGs out of the cluster total
each pool should get. Basically we have a target pg count per OSD, each
pool as some replication/ec multiplier, and data is proportionally
distributed among pools (or the admin has fed in those ratios based on
expected usage).
This is fine and good, except that I see on the lab cluster this:
sage@reesi001:~$ sudo ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO PG_NUM NEW PG_NUM AUTOSCALE
device_health_metrics 0 3.0 431.3T 0.0000 1 warn
default.rgw.buckets.non-ec 0 3.0 431.3T 0.0000 8 warn
default.rgw.meta 1336 3.0 431.3T 0.0000 8 warn
default.rgw.buckets.index 0 3.0 431.3T 0.0000 8 warn
default.rgw.control 0 3.0 431.3T 0.0000 8 warn
default.rgw.buckets.data 743.5G 3.0 431.3T 0.0050 32 on
.rgw.root 1113 3.0 431.3T 0.0000 8 warn
djf_tmp 879.1G 3.0 431.3T 0.0060 4096 32 off
libvirt-pool 2328M 3.0 431.3T 0.0000 3000 4 off
data 75679G 3.0 431.3T 0.5140 4096 warn
default.rgw.log 7713k 3.0 431.3T 0.0000 8 warn
metadata 62481M 4.0 431.3T 0.0006 64 4 off
Notice 'metadata' (for cephfs) is ~64 GB, but the autoscaler thinks it
should only have 4 PGs (the default minimum; it probably thinks less than
that). That's because it's 1/1000th the size of the data pool (75 TB).
But... I think collapsing all of that metadata into so few PGs and OSDs
will be bad for performance, and since omap is more expensive ot recover
than data, those PGs will be more sticky.
My current thought is that we could have a configurable multipler for omap
bytes when calculating the relative "size" of the pool, maybe default to
10x or something. In the above example, that would make metadata look
more like 1/100th the size of data, which would give it more like ~40 PGs.
(Actually the autoscaler always picks a power of 2, and only initiates a
change if we're off by the 'optimum' value by > 3x, so anything from 32
to 128 in this example would still result in no change and a happy
cluster.)
I see the same thing on my home cluster:
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO PG_NUM NEW PG_NUM AUTOSCALE
ar_metadata 40307M 4.0 84769G 0.0019 64 4 off
device_health_metrics 136.7M 3.0 84769G 0.0000 4 on
ar_data 6753G 1.5 52162G 0.1942 512 on
foo 1408M 3.0 84769G 0.0000 4 on
ar_data_cache 2275G 4.0 84769G 0.1074 4 128 on
I assume the rgw index pool will suffer from teh same issue on big RGW
clusters, although we don't have much RGW data in the lab so I haven't
seen it.
WDYT?
sage
This would effectively be hacking an IOP bias in for omap-centric
pools, on the data size-versus-heat tradeoff we've always had to deal
with. A more generic solution would be for the balancer to explicitly
account for both data size and IO activity. Right?
Yeah, that's a better way to frame it!
Now, this is definitely not necessarily a better solution for the
needs we have, but are we comfortable taking the quick hack instead?
Especially since the omap data is often on a different set of drives,
I'm not totally sure we need a size-based equalizer...
So, instead of a configurable for an omap multiplier, perhaps instead we
have a per-pool property that is an IOPS bias (e.g., 10x in this case). I
think this is a situation where we don't/can't automagically determine
that bias by measuring workload because workload and heat is ephemeral
while placement and rebalancing are hugely expensive. We wouldn't want to
adjust placement automatically.
How about pool property pg_autoscale_bias, and we have rgw and cephfs set
those automatically somewhere on the appropriate pools?
I'm wondering if the bias is really necessary if we can just set the
pg_num_min at file system metadata pool / rgw index pool creation (or
before turning on the autoscaler)? I would think that the difference
between e.g. 32 PGs versus 64 PGs will not be significant for a
metadata pool in terms of recovery or performance when we're looking
at only a hundred or so gigabytes of omap data. The difference between
4 PGs and 32 PGs *is* significant though. So, maybe setting a
reasonable min is enough?
From a performance perspective, lock contention may play a significant
role with 32 PGs.
Mark