Re: Impact of large PG splits

Gregory Orange <gregory.orange@xxxxxxxxxxxxx> · Wed, 10 Apr 2024 15:12:34 +0800

We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs with 
NVME RocksDB, used exclusively for RGWs, holding about 60b objects. We 
are splitting for the same reason as you - improved balance. We also 
thought long and hard before we began, concerned about impact, stability 
etc.

We set target_max_misplaced_ratio to 0.1% initially, so we could retain 
some control and stop it again fairly quickly if we weren't happy with 
the behaviour. It also serves to limit the performance impact on the 
cluster, but unfortunately it also makes the whole process slower.

We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No 
issues with the cluster. We could go higher, but are not in a rush at 
this point. Sometimes nearfull osd warnings get high and MAX AVAIL on 
the data pool in `ceph df` gets low enough that we want to interrupt it. 
So, we set pg_num to whatever the current value is (ceph osd pool ls 
detail), and let it stabilise. Then the balancer gets to work once the 
misplaced objects drop below the ratio, and things balance out. Nearfull 
osds drop usually to zero, and MAX AVAIL goes up again.

The above behaviour is because while they share the same threshold 
setting, the autoscaler only runs every minute, and it won't run when 
misplaced are over the threshold. Meanwhile, checks for the next PG to 
split happen much more frequently, so the balancer never wins that race.

We didn't know how long to expect it all to take, but decided that any 
improvement in PG size was worth starting. We now estimate it will take 
another 2-3 weeks to complete, for a total of 4-5 weeks total.

We have lost a drive or two during the process, and of course degraded 
objects went up, and more backfilling work got going. We paused splits 
for at least one of those, to make sure the degraded objects were sorted 
out as quick as possible. We can't be sure it went any faster though - 
there's always a long tail on that sort of thing.

Inconsistent objects are found at least a couple of times a week, and to 
get them repairing we disable scrubs, wait until they're stopped, then 
set the repair going and reenable scrubs. I don't know if this is 
special to the current higher splitting load, but we haven't noticed it 
before.

HTH,
Greg.

On 10/4/24 14:42, Eugen Block wrote:
Thank you, Janne.
I believe the default 5% target_max_misplaced_ratio would work as well, 
we've had good experience with that in the past, without the autoscaler. 
I just haven't dealt with such large PGs, I've been warning them for two 
years (when the PGs were only almost half this size) and now they 
finally started to listen. Well, they would still ignore it if it 
wouldn't impact all kinds of things now. ;-)

Thanks,
Eugen

Zitat von Janne Johansson <icepic.dz@xxxxxxxxx>:

Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block <eblock@xxxxxx>:
I'm trying to estimate the possible impact when large PGs are
splitted. Here's one example of such a PG:

PG_STAT  OBJECTS  BYTES         OMAP_BYTES*  OMAP_KEYS*  LOG   
DISK_LOG    UP
86.3ff    277708  414403098409            0           0  3092
3092
[187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111]

If you ask for small increases of pg_num, it will only split that many
PGs at a time, so while there will be a lot of data movement, (50% due
to half of the data needs to go to another newly made PG, and on top
of that, PGs per OSD will change, but also the balancing can now work
better) it will not be affecting the whole cluster if you increase
with say, 8 pg_nums at a time. As per the other reply, if you bump the
number with a small amount - wait for HEALTH_OK - bump some more it
will take a lot of calendar time, but have rather small impact. My
view of it is basically that this will be far less impactful than if
you lose a whole OSD, and hopefully your cluster can survive this
event, so it should be able to handle a slow trickle of PG splits too.

You can set a target number for the pool and let the autoscaler run a
few splits at a time, there are some settings to look at on how
aggressive the autoscaler will be, so it doesn't have to be
manual/scripted, but it's not very hard to script it if you are unsure
about the amount of work the autoscaler will start at any given time.

--
May the most significant bit of your life be positive.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Gregory Orange

System Administrator, Scientific Platforms Team
Pawsey Supercomputing Centre, CSIRO
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx