Hello Eugen, Thanks for sharing the good news. Did you have to raise mon_osd_nearfull_ratio temporarily? Frédéric. ----- Le 25 Avr 24, à 12:35, Eugen Block eblock@xxxxxx a écrit : > For those interested, just a short update: the split process is > approaching its end, two days ago there were around 230 PGs left > (target are 4096 PGs). So far there were no complaints, no cluster > impact was reported (the cluster load is quite moderate, but still > sensitive). Every now and then a single OSD (not the same) reaches 85% > nearfull ratio, but that was expected since the first nearfull OSD was > the root cause of this operation. I expect the balancer to kick in as > soon as the backfill has completed or when there are less than 5% > misplaced objects. > > Zitat von Anthony D'Atri <anthony.datri@xxxxxxxxx>: > >> One can up the ratios temporarily but it's all too easy to forget to >> reduce them later, or think that it's okay to run all the time with >> reduced headroom. >> >> Until a host blows up and you don't have enough space to recover into. >> >>> On Apr 12, 2024, at 05:01, Frédéric Nass >>> <frederic.nass@xxxxxxxxxxxxxxxx> wrote: >>> >>> >>> Oh, and yeah, considering "The fullest OSD is already at 85% usage" >>> best move for now would be to add new hardware/OSDs (to avoid >>> reaching the backfill too full limit), prior to start the splitting >>> PGs before or after enabling upmap balancer depending on how the >>> PGs got rebalanced (well enough or not) after adding new OSDs. >>> >>> BTW, what ceph version is this? You should make sure you're running >>> v16.2.11+ or v17.2.4+ before splitting PGs to avoid this nasty bug: >>> https://tracker.ceph.com/issues/53729 >>> >>> Cheers, >>> Frédéric. >>> >>> ----- Le 12 Avr 24, à 10:41, Frédéric Nass >>> frederic.nass@xxxxxxxxxxxxxxxx a écrit : >>> >>>> Hello Eugen, >>>> >>>> Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph >>>> daemon osd.0 >>>> config show | grep osd_op_queue) >>>> >>>> If WPQ, you might want to tune osd_recovery_sleep* values as they >>>> do have a real >>>> impact on the recovery/backfilling speed. Just lower osd_max_backfills to 1 >>>> before doing that. >>>> If mClock scheduler then you might want to use a specific mClock profile as >>>> suggested by Gregory (as osd_recovery_sleep* are not considered when using >>>> mClock). >>>> >>>> Since each PG involves reads/writes from/to apparently 18 OSDs (!) and this >>>> cluster only has 240, increasing osd_max_backfills to any values >>>> higher than >>>> 2-3 will not help much with the recovery/backfilling speed. >>>> >>>> All the way, you'll have to be patient. :-) >>>> >>>> Cheers, >>>> Frédéric. >>>> >>>> ----- Le 10 Avr 24, à 12:54, Eugen Block eblock@xxxxxx a écrit : >>>> >>>>> Thank you for input! >>>>> We started the split with max_backfills = 1 and watched for a few >>>>> minutes, then gradually increased it to 8. Now it's backfilling with >>>>> around 180 MB/s, not really much but since client impact has to be >>>>> avoided if possible, we decided to let that run for a couple of hours. >>>>> Then reevaluate the situation and maybe increase the backfills a bit >>>>> more. >>>>> >>>>> Thanks! >>>>> >>>>> Zitat von Gregory Orange <gregory.orange@xxxxxxxxxxxxx>: >>>>> >>>>>> We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs >>>>>> with NVME RocksDB, used exclusively for RGWs, holding about 60b >>>>>> objects. We are splitting for the same reason as you - improved >>>>>> balance. We also thought long and hard before we began, concerned >>>>>> about impact, stability etc. >>>>>> >>>>>> We set target_max_misplaced_ratio to 0.1% initially, so we could >>>>>> retain some control and stop it again fairly quickly if we weren't >>>>>> happy with the behaviour. It also serves to limit the performance >>>>>> impact on the cluster, but unfortunately it also makes the whole >>>>>> process slower. >>>>>> >>>>>> We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No >>>>>> issues with the cluster. We could go higher, but are not in a rush >>>>>> at this point. Sometimes nearfull osd warnings get high and MAX >>>>>> AVAIL on the data pool in `ceph df` gets low enough that we want to >>>>>> interrupt it. So, we set pg_num to whatever the current value is >>>>>> (ceph osd pool ls detail), and let it stabilise. Then the balancer >>>>>> gets to work once the misplaced objects drop below the ratio, and >>>>>> things balance out. Nearfull osds drop usually to zero, and MAX >>>>>> AVAIL goes up again. >>>>>> >>>>>> The above behaviour is because while they share the same threshold >>>>>> setting, the autoscaler only runs every minute, and it won't run >>>>>> when misplaced are over the threshold. Meanwhile, checks for the >>>>>> next PG to split happen much more frequently, so the balancer never >>>>>> wins that race. >>>>>> >>>>>> >>>>>> We didn't know how long to expect it all to take, but decided that >>>>>> any improvement in PG size was worth starting. We now estimate it >>>>>> will take another 2-3 weeks to complete, for a total of 4-5 weeks >>>>>> total. >>>>>> >>>>>> We have lost a drive or two during the process, and of course >>>>>> degraded objects went up, and more backfilling work got going. We >>>>>> paused splits for at least one of those, to make sure the degraded >>>>>> objects were sorted out as quick as possible. We can't be sure it >>>>>> went any faster though - there's always a long tail on that sort of >>>>>> thing. >>>>>> >>>>>> Inconsistent objects are found at least a couple of times a week, >>>>>> and to get them repairing we disable scrubs, wait until they're >>>>>> stopped, then set the repair going and reenable scrubs. I don't know >>>>>> if this is special to the current higher splitting load, but we >>>>>> haven't noticed it before. >>>>>> >>>>>> HTH, >>>>>> Greg. >>>>>> >>>>>> >>>>>> On 10/4/24 14:42, Eugen Block wrote: >>>>>>> Thank you, Janne. >>>>>>> I believe the default 5% target_max_misplaced_ratio would work as >>>>>>> well, we've had good experience with that in the past, without the >>>>>>> autoscaler. I just haven't dealt with such large PGs, I've been >>>>>>> warning them for two years (when the PGs were only almost half this >>>>>>> size) and now they finally started to listen. Well, they would >>>>>>> still ignore it if it wouldn't impact all kinds of things now. ;-) >>>>>>> >>>>>>> Thanks, >>>>>>> Eugen >>>>>>> >>>>>>> Zitat von Janne Johansson <icepic.dz@xxxxxxxxx>: >>>>>>> >>>>>>>> Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block <eblock@xxxxxx>: >>>>>>>>> I'm trying to estimate the possible impact when large PGs are >>>>>>>>> splitted. Here's one example of such a PG: >>>>>>>>> >>>>>>>>> PG_STAT OBJECTS BYTES OMAP_BYTES* OMAP_KEYS* LOG >>>>>>>>> DISK_LOG UP >>>>>>>>> 86.3ff 277708 414403098409 0 0 3092 >>>>>>>>> 3092 >>>>>>>>> [187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111] >>>>>>>> >>>>>>>> If you ask for small increases of pg_num, it will only split that many >>>>>>>> PGs at a time, so while there will be a lot of data movement, (50% due >>>>>>>> to half of the data needs to go to another newly made PG, and on top >>>>>>>> of that, PGs per OSD will change, but also the balancing can now work >>>>>>>> better) it will not be affecting the whole cluster if you increase >>>>>>>> with say, 8 pg_nums at a time. As per the other reply, if you bump the >>>>>>>> number with a small amount - wait for HEALTH_OK - bump some more it >>>>>>>> will take a lot of calendar time, but have rather small impact. My >>>>>>>> view of it is basically that this will be far less impactful than if >>>>>>>> you lose a whole OSD, and hopefully your cluster can survive this >>>>>>>> event, so it should be able to handle a slow trickle of PG splits too. >>>>>>>> >>>>>>>> You can set a target number for the pool and let the autoscaler run a >>>>>>>> few splits at a time, there are some settings to look at on how >>>>>>>> aggressive the autoscaler will be, so it doesn't have to be >>>>>>>> manual/scripted, but it's not very hard to script it if you are unsure >>>>>>>> about the amount of work the autoscaler will start at any given time. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> May the most significant bit of your life be positive. >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>> >>>>>> -- >>>>>> Gregory Orange >>>>>> >>>>>> System Administrator, Scientific Platforms Team >>>>>> Pawsey Supercomputing Centre, CSIRO >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx