Hi Lokendra,
Now it looks more like a good ceph deployment :)
io recovery shows TOTAL recovery rate at that very moment. At the moment 19 PGs (please familiarize yourself with the concept of PG at https://docs.ceph.com/en/latest/rados/operations/placement-groups/ ) backfilling - meaning there 19 sets of OSDs actively exchanging data to recover. This process goes in parallel - so that's good!
The settings max backfills and recovery max active are per OSD - so with 15 OSDs you may have well over max backfills in progress. That's why you can see 19 backfills while your max backfills is at 8. Simply not every single possible PG needs to be moved and thus it moves all 19 misplaced PGs at once. I generally would not recommend to increase backfills in any form because they require CPU resources on each OSD and CPU is finite resource. If you have 15 OSDs and have just default max backfills of 1 you still may end up with 15 PGs moving.
In terms of changing settings on the fly: even though it says "change may require restart" these two settings are actually applied. Restarting OSD will undo these temporary settings and will use settings which are persistent in config. So if you would elect to increase backfills for any reason (there no good reason actually) leave them and when OSD will eventually restart these settings will revert to normal.
And now I will reiterate: Ceph is NOT RAID. You simply cannot expect that you just fail OSD and suddenly you going to get all resources thrown at recovery so your rig will start to smoke. It just does not work this way. Please read https://docs.ceph.com/en/latest/dev/osd_internals/backfill_reservation/ . Ceph has very fine tuned algorithm which deals with recoveries. Why plural? Because there's a huge difference between backfilling remapped PGs (ie data is safe but located not where it is expected) and backfilling degraded PGs (when PG does not have all copies available but still OK to be used) and recovery (when number of copies is critical). In your case it shows 19 active+remapped+backfilling - data is safe, have all copies expected, there just no need to rush the move of the data. If you start to increase recovery speed you will pay for it dearly by degraded performance - hardly something you want in production especially if there no benefit from quick "recovery".
I hazard a guess that your pools are replicated with default size 3. It means that losing one OSD will cause some PGs to go degraded and recovery will start in 10 minutes if OSD would not boot up back. Then you may expect recovery to start at higher rate than you see during remapped+backfilling - you don't have enough copies and so Ceph will start put more resources at recovery of this situation. Then you lose another OSD and some PGs (those who have first and second failed OSD in its OSD set) will have just one copy left - obviously this situation cannot be tolerated and thus Ceph will go into recovery (but only for PGs which are at risk). It will throw a lot of resources to recover these PGs as fast as it can and you may notice performance degradation for users at this time. Again what is better to have: users who complain about slow service or users who complain about lost data? Answer is clear: ceph needs to attend to data recovery immediately. But in the same time PGs which have lost only one OSD will not get the same treatment - they will be processed as degraded PGs at lower priority. PGs which now need to move to other OSDs due to rebalance but have all copies intact will get even less priority.
So the issue you have is that your expectations are simplified: you expect recovery to run at maximum theoretical limit without any care to the rest of the system. But you don't want to run at 100% speed your hardware can achieve. You want to ensure that your data durability is upheld, your clients receive good IO rate, recoveries actually taking back seat and not interfere with primary purpose of the storage system. So all you need to do is: a) stop changing these settings - they are there for a reason - it was tested and re-tested by real-life use. b) put some trust to ceph community and accept that its performance is well tuned to provide good end user performance and safety. End users really do not give a care about "speed of recovery". Neither should you.
Regards,
Vladimir
Hi Vladimir,i have reconfigured the setup to 15 OSD Now,Every 1.0s: sudo ceph -s Fri Oct 29 10:21:07 2021
cluster:id: 1a8bfc8a-ad9d-4a06-9963-5e84e7ce80eehealth: HEALTH_OK
services:mon: 3 daemons, quorum storagenode1,storagenode2,storagenode3 (age 2h)mgr: storagenode3(active, since 16h), standbys: storagenode2, storagenode1mds: cephfs:1 {0=storagenode3=up:active} 2 up:standbyosd: 15 osds: 15 up (since 5m), 15 in (since 16h); 19 remapped pgsrgw: 3 daemons active (storagenode1.rgw0, storagenode2.rgw0, storagenode3.rgw0)
task status:scrub status:mds.storagenode3: idle
data:pools: 7 pools, 265 pgsobjects: 4.13M objects, 1.9 TiBusage: 6.1 TiB used, 7.0 TiB / 13 TiB availpgs: 662670/12381873 objects misplaced (5.352%)246 active+clean19 active+remapped+backfilling
io:recovery: 114 MiB/s, 173 objects/sI see the recovery as around 140 MiB/s so is this per OSD or this is in total, from the message you have sent i could see that it is per OSD.
Also from the command "ceph tell 'osd.*' injectargs --osd-max-backfills=2 --osd-recovery-max-active=6" i do not see much visible difference. Is this that we have to restart the OSD service as because after running this command I see as :[ansible@storagenode1 ~]$ sudo ceph tell 'osd.*' injectargs --osd-max-backfills=8 --osd-recovery-max-active=12osd.0: osd_recovery_max_active = '12' (not observed, change may require restart)osd.0: {}osd.1: osd_recovery_max_active = '12' (not observed, change may require restart)osd.1: {}osd.2: osd_recovery_max_active = '12' (not observed, change may require restart)osd.2: {}osd.3: osd_recovery_max_active = '12' (not observed, change may require restart)osd.3: {}osd.4: osd_recovery_max_active = '12' (not observed, change may require restart)osd.4: {}osd.5: osd_recovery_max_active = '12' (not observed, change may require restart)osd.5: {}osd.6: osd_recovery_max_active = '12' (not observed, change may require restart)osd.6: {}osd.7: osd_recovery_max_active = '12' (not observed, change may require restart)osd.7: {}osd.8: osd_recovery_max_active = '12' (not observed, change may require restart)osd.8: {}osd.9: osd_recovery_max_active = '12' (not observed, change may require restart)osd.9: {}osd.10: osd_recovery_max_active = '12' (not observed, change may require restart)osd.10: {}osd.11: osd_recovery_max_active = '12' (not observed, change may require restart)osd.11: {}osd.12: osd_recovery_max_active = '12' (not observed, change may require restart)osd.12: {}osd.13: osd_recovery_max_active = '12' (not observed, change may require restart)osd.13: {}osd.14: osd_recovery_max_active = '12' (not observed, change may require restart)osd.14: {}it says change may require restart, but even after restart no impact w.r.t to recovery rate change.
thanks,Lokendra
On Thu, Oct 28, 2021 at 1:53 PM Vladimir Bashkirtsev <vladimir@xxxxxxxxxxxxxxx> wrote:
1. You can do:
ceph tell 'osd.*' injectargs --osd-max-backfills=2 --osd-recovery-max-active=6
This will change these settings on the fly but they will be reset on OSD restart (each OSD will get it and will remeber until its own restart - you may have OSDs running with different settings).
2. Nothing to do with threads: it is scenario which I have covered in my previous response. If you have more than 3 OSDs you can have OSDs to pair up for data transfers thus (theoretically) 10 OSD cluster can have 5 pairs to transfer data in parallel at 150MB/s achieving total 750MB/s recovery speed.
Regards,
Vladimir
On 28 October 2021 7:11:31 pm AEDT, Lokendra Rathour <lokendrarathour@xxxxxxxxx> wrote:Hey Johansson,thanks for the update here. two things in line with your response.
- for now, I am able to change these values via ceph.conf and restart the osd service, so are there any runtime commands as well to do so ? I am using Ceph Pacific or Octopus version installed using ceph-ansible.
- what do you mean by " allow more parallelism" - are you referring to modifying threads with this config "osd recovery threads" or please help elaborate.
Thanks once again for your help.
On Thu, Oct 28, 2021 at 1:05 PM Janne Johansson <icepic.dz@xxxxxxxxx> wrote:
--Den tors 28 okt. 2021 kl 09:09 skrev Lokendra Rathour <lokendrarathour@xxxxxxxxx>:
Hi,we have been trying to test a scenario on ceph with the following configuration:cluster:
id: cc0ba1e4-68b9-4237-bc81-40b38455f713
health: HEALTH_OK
services:
mon: 3 daemons, quorum storagenode1,storagenode2,storagenode3 (age 4h)
mgr: storagenode2(active, since 22h), standbys: storagenode1, storagenode3
mds: cephfs:1 {0=storagenode1=up:active} 2 up:standby
osd: 3 osds: 3 up (since 4m), 3 in (since 4h)
rgw: 3 daemons active (storagenode1.rgw0, storagenode2.rgw0, storagenode3.rgw0)
task status:
scrub status:
mds.storagenode1: idle
data:
pools: 7 pools, 169 pgs
objects: 1.06M objects, 1.3 TiB
usage: 3.9 TiB used, 9.2 TiB / 13 TiB avail
pgs: 169 active+clean
io:
client: 43 KiB/s wr, 0 op/s rd, 3 op/s wr
recovery: 154 MiB/s, 98 objects/s
We have network links of 10GiG for all the networks used in Ceph. MTU is configured as 9000. But the Transfer rate as can be seen above is max 154 MiB/s which I feel is way low than possible.
Test Case:We removed one node and added it back to the Ceph Cluster after reinstalling the OS. During this time of activity, Ceph has around 1.3 TB to rebalance in the newly added node. The time taken in such a case is approximate: 4 hours.
Considering this as the production-grade setup with all production-grade infra, this time is too much.
Query:
- Is there a way to optimize the recovery/rebalancing and i/o rate of Ceph?
- we found a few suggestions on the internet that we can modify the below parameters to achieve a good rate, but is this advisable
- osd max backfills, osd recovery max active, osd recovery max single start
- we have dedicated 10gig n/w infra so can we have any ideal value to reach max rate of recovery.
Any input would be helpful, we are really blocked here.
If this is one spinning drive receiving data, then those figures look ok. If you instead had a large cluster with more drives, the sum of the recovery traffic would be more if you allow more parallelism. Looking at osd_max_backfills to see how many parallel backfills you will allow and looking at posts and guides like this:might also help.
May the most significant bit of your life be positive.
--
skype: lokendrarathour
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.
--
~ Lokendra
_______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx