Re: [Ceph] Recovery is very Slow

Vladimir Bashkirtsev <vladimir@xxxxxxxxxxxxxxx> · Thu, 28 Oct 2021 19:17:09 +1100

Ceph is not RAID. Having just 3 OSDs is just asking for trouble. Ceph, of course, will run with such setup but it is intended to be something much bigger than RAID (both in terms of capacity and flexibility).

I am going to poke into the thin air but it appears that you have 3x4TB drives and probably they are SATA spinners - if it is the case than ~150MB/s is what is considered normal for them. Just try to copy 1.3TB from drive to drive: at 150MB/s peak transfer it will take around 7 seconds per GB and you need to transfer 1300 of them. In best case scenario, going full speed, from one drive, to another drive it will take 2.5 hours while nothing else would be able to run (you will just saturate your interfaces completely with recovery traffic). Clearly Ceph not in a business denying service while it deals with a failure. Hence Ceph recovery speed is lower than direct drive-to-drive. But even then is not extremely more - definitely not "too much".

So now to your query:
1. It is already optimized to keep good balance between recovery and client io given the hardware available. In your scenario it is unlikely that anything else would be able go any faster by noticeable margin while allowing clients to actually use the storage as nothing happened. You just hitting hardware limits.
2. Max backfills and max recovery would be relevant only if you would have multiple OSDs. Just think: good OSD 0 reads 150MB/s, good OSD 1 reads 150MB/s too. Together delivering 300MB/s to failed OSD 2. Do you think OSD 2 will be able to ingest data at 300MB/s? Certainly not (provided they are same class hardware). So increasing max recovery in case of a single OSD  failure does not make any sense - you will start pouring data from two good OSDs at once but their speed will be halved by the speed limit of failed OSD. If you say have 4 OSDs and performing rebalance then OSD 0 may send data to OSD 1, and OSD 2 may send data to OSD 3 with each stream limited by hardware at 150MB/s but total recovery speed then would be 300MB/s. So put it bluntly: no increase of simultaneous data transfers will increase the speed of your HDD.
3. Having 10gbit network is good but if you have slow drives you will be limited by the speed of slowest element.

As it was already mentioned: if you need RAID - you need to look elsewhere. If you need a solution which you can grow without hassle and downtime, have configurable durability for different classes of data you keep, have ability to trade speed for capacity (and vice versa), don't ever worry about repartitioning and data corruption then Ceph is your choice.

Regards,
Vladimir

On 28 October 2021 6:09:24 pm AEDT, Lokendra Rathour <lokendrarathour@xxxxxxxxx> wrote:
Hi,we have been trying to test  a scenario on ceph with the following configuration:
 cluster:
    id:     cc0ba1e4-68b9-4237-bc81-40b38455f713
    health: HEALTH_OK
  services:
    mon: 3 daemons, quorum storagenode1,storagenode2,storagenode3 (age 4h)
    mgr: storagenode2(active, since 22h), standbys: storagenode1, storagenode3
    mds: cephfs:1 {0=storagenode1=up:active} 2 up:standby
    osd: 3 osds: 3 up (since 4m), 3 in (since 4h)
    rgw: 3 daemons active (storagenode1.rgw0, storagenode2.rgw0, storagenode3.rgw0)
  task status:
    scrub status:
        mds.storagenode1: idle
  data:
    pools:   7 pools, 169 pgs
    objects: 1.06M objects, 1.3 TiB
    usage:   3.9 TiB used, 9.2 TiB / 13 TiB avail
    pgs:     169 active+clean
  io:
    client:   43 KiB/s wr, 0 op/s rd, 3 op/s wr
    recovery: 154 MiB/s, 98 objects/s

We have network links of 10GiG for all the networks used in Ceph. MTU is configured as 9000. But the Transfer rate as can be seen above is max 154 MiB/s which I feel is way low than possible. 

Test Case:
We removed one node and added it back to the Ceph Cluster after reinstalling the OS. During this time of activity, Ceph has around 1.3 TB to rebalance in the newly added node. The time taken in such a case is approximate: 4 hours. 

Considering this as the production-grade setup with all production-grade infra, this time is too much.

Query:
Is there a way to optimize the recovery/rebalancing and i/o rate of Ceph?
we found a few suggestions on the internet that we can modify the below parameters to achieve a good rate, but is this advisable
  osd max backfills, osd recovery max active, osd recovery max single start 
we have dedicated 10gig n/w infra so can we have any ideal value to reach max rate of recovery.

Any input would be helpful, we are really blocked here.

-- 
~ Lokendra
skype: lokendrarathour

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx