I am going to poke into the thin air but it appears that you have 3x4TB drives and probably they are SATA spinners - if it is the case than ~150MB/s is what is considered normal for them. Just try to copy 1.3TB from drive to drive: at 150MB/s peak transfer it will take around 7 seconds per GB and you need to transfer 1300 of them. In best case scenario, going full speed, from one drive, to another drive it will take 2.5 hours while nothing else would be able to run (you will just saturate your interfaces completely with recovery traffic). Clearly Ceph not in a business denying service while it deals with a failure. Hence Ceph recovery speed is lower than direct drive-to-drive. But even then is not extremely more - definitely not "too much".
So now to your query:
1. It is already optimized to keep good balance between recovery and client io given the hardware available. In your scenario it is unlikely that anything else would be able go any faster by noticeable margin while allowing clients to actually use the storage as nothing happened. You just hitting hardware limits.
2. Max backfills and max recovery would be relevant only if you would have multiple OSDs. Just think: good OSD 0 reads 150MB/s, good OSD 1 reads 150MB/s too. Together delivering 300MB/s to failed OSD 2. Do you think OSD 2 will be able to ingest data at 300MB/s? Certainly not (provided they are same class hardware). So increasing max recovery in case of a single OSD failure does not make any sense - you will start pouring data from two good OSDs at once but their speed will be halved by the speed limit of failed OSD. If you say have 4 OSDs and performing rebalance then OSD 0 may send data to OSD 1, and OSD 2 may send data to OSD 3 with each stream limited by hardware at 150MB/s but total recovery speed then would be 300MB/s. So put it bluntly: no increase of simultaneous data transfers will increase the speed of your HDD.
3. Having 10gbit network is good but if you have slow drives you will be limited by the speed of slowest element.
As it was already mentioned: if you need RAID - you need to look elsewhere. If you need a solution which you can grow without hassle and downtime, have configurable durability for different classes of data you keep, have ability to trade speed for capacity (and vice versa), don't ever worry about repartitioning and data corruption then Ceph is your choice.
Regards,
Vladimir
On 28 October 2021 6:09:24 pm AEDT, Lokendra Rathour <lokendrarathour@xxxxxxxxx> wrote:
Hi,we have been trying to test a scenario on ceph with the following configuration:cluster:
id: cc0ba1e4-68b9-4237-bc81-40b38455f713
health: HEALTH_OK
services:
mon: 3 daemons, quorum storagenode1,storagenode2,storagenode3 (age 4h)
mgr: storagenode2(active, since 22h), standbys: storagenode1, storagenode3
mds: cephfs:1 {0=storagenode1=up:active} 2 up:standby
osd: 3 osds: 3 up (since 4m), 3 in (since 4h)
rgw: 3 daemons active (storagenode1.rgw0, storagenode2.rgw0, storagenode3.rgw0)
task status:
scrub status:
mds.storagenode1: idle
data:
pools: 7 pools, 169 pgs
objects: 1.06M objects, 1.3 TiB
usage: 3.9 TiB used, 9.2 TiB / 13 TiB avail
pgs: 169 active+clean
io:
client: 43 KiB/s wr, 0 op/s rd, 3 op/s wr
recovery: 154 MiB/s, 98 objects/s
We have network links of 10GiG for all the networks used in Ceph. MTU is configured as 9000. But the Transfer rate as can be seen above is max 154 MiB/s which I feel is way low than possible.Test Case:We removed one node and added it back to the Ceph Cluster after reinstalling the OS. During this time of activity, Ceph has around 1.3 TB to rebalance in the newly added node. The time taken in such a case is approximate: 4 hours.Considering this as the production-grade setup with all production-grade infra, this time is too much.Query:
- Is there a way to optimize the recovery/rebalancing and i/o rate of Ceph?
- we found a few suggestions on the internet that we can modify the below parameters to achieve a good rate, but is this advisable
- osd max backfills, osd recovery max active, osd recovery max single start
- we have dedicated 10gig n/w infra so can we have any ideal value to reach max rate of recovery.
Any input would be helpful, we are really blocked here.--~ Lokendraskype: lokendrarathour
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Sent from my Android device with K-9 Mail. Please excuse my brevity.
_______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx