Re: Impact of large PG splits

Eugen Block <eblock@xxxxxx> · Fri, 12 Apr 2024 12:57:46 +0000

Thanks for chiming in.
They are on version 16.2.13 (I was already made aware of the bug you  
mentioned, thanks!) with wpq.
Until now I haven't got an emergency call so I assume everything is  
calm (I hope). New hardware has been ordered but it will take a couple  
of weeks until it's delivered, installed and integrated, that's why we  
decided to take action now.
I'll update the thread when I know more.

Thanks again!
Eugen

Zitat von Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>:

Oh, and yeah, considering "The fullest OSD is already at 85% usage"  
best move for now would be to add new hardware/OSDs (to avoid  
reaching the backfill too full limit), prior to start the splitting  
PGs before or after enabling upmap balancer depending on how the PGs  
got rebalanced (well enough or not) after adding new OSDs.

BTW, what ceph version is this? You should make sure you're running  
v16.2.11+ or v17.2.4+ before splitting PGs to avoid this nasty bug:  
https://tracker.ceph.com/issues/53729

Cheers,
Frédéric.

----- Le 12 Avr 24, à 10:41, Frédéric Nass  
frederic.nass@xxxxxxxxxxxxxxxx a écrit :

Hello Eugen,

Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph  
daemon osd.0
config show | grep osd_op_queue)

If WPQ, you might want to tune osd_recovery_sleep* values as they  
do have a real
impact on the recovery/backfilling speed. Just lower osd_max_backfills to 1
before doing that.
If mClock scheduler then you might want to use a specific mClock profile as
suggested by Gregory (as osd_recovery_sleep* are not considered when using
mClock).

Since each PG involves reads/writes from/to apparently 18 OSDs (!) and this
cluster only has 240, increasing osd_max_backfills to any values higher than
2-3 will not help much with the recovery/backfilling speed.

All the way, you'll have to be patient. :-)

Cheers,
Frédéric.

----- Le 10 Avr 24, à 12:54, Eugen Block eblock@xxxxxx a écrit :

Thank you for input!
We started the split with max_backfills = 1 and watched for a few
minutes, then gradually increased it to 8. Now it's backfilling with
around 180 MB/s, not really much but since client impact has to be
avoided if possible, we decided to let that run for a couple of hours.
Then reevaluate the situation and maybe increase the backfills a bit
more.

Thanks!

Zitat von Gregory Orange <gregory.orange@xxxxxxxxxxxxx>:

We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
with NVME RocksDB, used exclusively for RGWs, holding about 60b
objects. We are splitting for the same reason as you - improved
balance. We also thought long and hard before we began, concerned
about impact, stability etc.

We set target_max_misplaced_ratio to 0.1% initially, so we could
retain some control and stop it again fairly quickly if we weren't
happy with the behaviour. It also serves to limit the performance
impact on the cluster, but unfortunately it also makes the whole
process slower.

We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
issues with the cluster. We could go higher, but are not in a rush
at this point. Sometimes nearfull osd warnings get high and MAX
AVAIL on the data pool in `ceph df` gets low enough that we want to
interrupt it. So, we set pg_num to whatever the current value is
(ceph osd pool ls detail), and let it stabilise. Then the balancer
gets to work once the misplaced objects drop below the ratio, and
things balance out. Nearfull osds drop usually to zero, and MAX
AVAIL goes up again.

The above behaviour is because while they share the same threshold
setting, the autoscaler only runs every minute, and it won't run
when misplaced are over the threshold. Meanwhile, checks for the
next PG to split happen much more frequently, so the balancer never
wins that race.

We didn't know how long to expect it all to take, but decided that
any improvement in PG size was worth starting. We now estimate it
will take another 2-3 weeks to complete, for a total of 4-5 weeks
total.

We have lost a drive or two during the process, and of course
degraded objects went up, and more backfilling work got going. We
paused splits for at least one of those, to make sure the degraded
objects were sorted out as quick as possible. We can't be sure it
went any faster though - there's always a long tail on that sort of
thing.

Inconsistent objects are found at least a couple of times a week,
and to get them repairing we disable scrubs, wait until they're
stopped, then set the repair going and reenable scrubs. I don't know
if this is special to the current higher splitting load, but we
haven't noticed it before.

HTH,
Greg.

On 10/4/24 14:42, Eugen Block wrote:
Thank you, Janne.
I believe the default 5% target_max_misplaced_ratio would work as
well, we've had good experience with that in the past, without the
autoscaler. I just haven't dealt with such large PGs, I've been
warning them for two years (when the PGs were only almost half this
size) and now they finally started to listen. Well, they would
still ignore it if it wouldn't impact all kinds of things now. ;-)

Thanks,
Eugen

Zitat von Janne Johansson <icepic.dz@xxxxxxxxx>:

Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block <eblock@xxxxxx>:
I'm trying to estimate the possible impact when large PGs are
splitted. Here's one example of such a PG:

PG_STAT  OBJECTS  BYTES         OMAP_BYTES*  OMAP_KEYS*  LOG
DISK_LOG    UP
86.3ff    277708  414403098409            0           0  3092
3092
[187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111]

If you ask for small increases of pg_num, it will only split that many
PGs at a time, so while there will be a lot of data movement, (50% due
to half of the data needs to go to another newly made PG, and on top
of that, PGs per OSD will change, but also the balancing can now work
better) it will not be affecting the whole cluster if you increase
with say, 8 pg_nums at a time. As per the other reply, if you bump the
number with a small amount - wait for HEALTH_OK - bump some more it
will take a lot of calendar time, but have rather small impact. My
view of it is basically that this will be far less impactful than if
you lose a whole OSD, and hopefully your cluster can survive this
event, so it should be able to handle a slow trickle of PG splits too.

You can set a target number for the pool and let the autoscaler run a
few splits at a time, there are some settings to look at on how
aggressive the autoscaler will be, so it doesn't have to be
manual/scripted, but it's not very hard to script it if you are unsure
about the amount of work the autoscaler will start at any given time.

--
May the most significant bit of your life be positive.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Gregory Orange

System Administrator, Scientific Platforms Team
Pawsey Supercomputing Centre, CSIRO

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx