Re: Impact of large PG splits

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Fri, 12 Apr 2024 10:41:47 +0200 (CEST)

Hello Eugen,

Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph daemon osd.0 config show | grep osd_op_queue)

If WPQ, you might want to tune osd_recovery_sleep* values as they do have a real impact on the recovery/backfilling speed. Just lower osd_max_backfills to 1 before doing that.
If mClock scheduler then you might want to use a specific mClock profile as suggested by Gregory (as osd_recovery_sleep* are not considered when using mClock).

Since each PG involves reads/writes from/to apparently 18 OSDs (!) and this cluster only has 240, increasing osd_max_backfills to any values higher than 2-3 will not help much with the recovery/backfilling speed.

All the way, you'll have to be patient. :-)

Cheers,
Frédéric.

----- Le 10 Avr 24, à 12:54, Eugen Block eblock@xxxxxx a écrit :

> Thank you for input!
> We started the split with max_backfills = 1 and watched for a few
> minutes, then gradually increased it to 8. Now it's backfilling with
> around 180 MB/s, not really much but since client impact has to be
> avoided if possible, we decided to let that run for a couple of hours.
> Then reevaluate the situation and maybe increase the backfills a bit
> more.
> 
> Thanks!
> 
> Zitat von Gregory Orange <gregory.orange@xxxxxxxxxxxxx>:
> 
>> We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
>> with NVME RocksDB, used exclusively for RGWs, holding about 60b
>> objects. We are splitting for the same reason as you - improved
>> balance. We also thought long and hard before we began, concerned
>> about impact, stability etc.
>>
>> We set target_max_misplaced_ratio to 0.1% initially, so we could
>> retain some control and stop it again fairly quickly if we weren't
>> happy with the behaviour. It also serves to limit the performance
>> impact on the cluster, but unfortunately it also makes the whole
>> process slower.
>>
>> We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
>> issues with the cluster. We could go higher, but are not in a rush
>> at this point. Sometimes nearfull osd warnings get high and MAX
>> AVAIL on the data pool in `ceph df` gets low enough that we want to
>> interrupt it. So, we set pg_num to whatever the current value is
>> (ceph osd pool ls detail), and let it stabilise. Then the balancer
>> gets to work once the misplaced objects drop below the ratio, and
>> things balance out. Nearfull osds drop usually to zero, and MAX
>> AVAIL goes up again.
>>
>> The above behaviour is because while they share the same threshold
>> setting, the autoscaler only runs every minute, and it won't run
>> when misplaced are over the threshold. Meanwhile, checks for the
>> next PG to split happen much more frequently, so the balancer never
>> wins that race.
>>
>>
>> We didn't know how long to expect it all to take, but decided that
>> any improvement in PG size was worth starting. We now estimate it
>> will take another 2-3 weeks to complete, for a total of 4-5 weeks
>> total.
>>
>> We have lost a drive or two during the process, and of course
>> degraded objects went up, and more backfilling work got going. We
>> paused splits for at least one of those, to make sure the degraded
>> objects were sorted out as quick as possible. We can't be sure it
>> went any faster though - there's always a long tail on that sort of
>> thing.
>>
>> Inconsistent objects are found at least a couple of times a week,
>> and to get them repairing we disable scrubs, wait until they're
>> stopped, then set the repair going and reenable scrubs. I don't know
>> if this is special to the current higher splitting load, but we
>> haven't noticed it before.
>>
>> HTH,
>> Greg.
>>
>>
>> On 10/4/24 14:42, Eugen Block wrote:
>>> Thank you, Janne.
>>> I believe the default 5% target_max_misplaced_ratio would work as
>>> well, we've had good experience with that in the past, without the
>>> autoscaler. I just haven't dealt with such large PGs, I've been
>>> warning them for two years (when the PGs were only almost half this
>>> size) and now they finally started to listen. Well, they would
>>> still ignore it if it wouldn't impact all kinds of things now. ;-)
>>>
>>> Thanks,
>>> Eugen
>>>
>>> Zitat von Janne Johansson <icepic.dz@xxxxxxxxx>:
>>>
>>>> Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block <eblock@xxxxxx>:
>>>>> I'm trying to estimate the possible impact when large PGs are
>>>>> splitted. Here's one example of such a PG:
>>>>>
>>>>> PG_STAT  OBJECTS  BYTES         OMAP_BYTES*  OMAP_KEYS*  LOG
>>>>> DISK_LOG    UP
>>>>> 86.3ff    277708  414403098409            0           0  3092
>>>>> 3092
>>>>> [187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111]
>>>>
>>>> If you ask for small increases of pg_num, it will only split that many
>>>> PGs at a time, so while there will be a lot of data movement, (50% due
>>>> to half of the data needs to go to another newly made PG, and on top
>>>> of that, PGs per OSD will change, but also the balancing can now work
>>>> better) it will not be affecting the whole cluster if you increase
>>>> with say, 8 pg_nums at a time. As per the other reply, if you bump the
>>>> number with a small amount - wait for HEALTH_OK - bump some more it
>>>> will take a lot of calendar time, but have rather small impact. My
>>>> view of it is basically that this will be far less impactful than if
>>>> you lose a whole OSD, and hopefully your cluster can survive this
>>>> event, so it should be able to handle a slow trickle of PG splits too.
>>>>
>>>> You can set a target number for the pool and let the autoscaler run a
>>>> few splits at a time, there are some settings to look at on how
>>>> aggressive the autoscaler will be, so it doesn't have to be
>>>> manual/scripted, but it's not very hard to script it if you are unsure
>>>> about the amount of work the autoscaler will start at any given time.
>>>>
>>>>
>>>>
>>>> --
>>>> May the most significant bit of your life be positive.
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>> --
>> Gregory Orange
>>
>> System Administrator, Scientific Platforms Team
>> Pawsey Supercomputing Centre, CSIRO
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx