Re: [Ceph] Recovery is very Slow

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Vladimir,
were you able to check my response? 
Thanks once again,
-Lokendra


On Fri, Nov 12, 2021 at 6:21 PM Lokendra Rathour <lokendrarathour@xxxxxxxxx> wrote:
Hi Vladimir,
please find the response inline. Apologies for too many questions and thank you for your detailed support.

On Fri, Nov 12, 2021 at 5:34 PM Vladimir Bashkirtsev <vladimir@xxxxxxxxxxxxxxx> wrote:

Hi Lokendra,


1. Would you please confirm: this host has 16 CPU cores? If it is the case it appears to be maxed out.

[Loke] : 48 Cores per Node.  

2. Which failure domain is used for the pool which keeps your data?

[Loke] : it is host
[root@storagenode1 ~]# sudo ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "ruleset": 0,
        "type": 1,
        "min_size": 1,
        "max_size": 10,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]

Something tells me it is OSD. Having 5 OSDs on the same host may mean that your recovery happens within the host - there simply no need to transfer data for recovery across the network (physical network that is) and data is transferred over local IP stack - iperf is not going to show it at all as this data does not go through physical interface of the host.

[Loke]: I think recovery is happening across nodes, as storagenode1 having 5 nodes are wiped off and reintroduced. and crush rule also states that.

If you want to have a fair assessment then you need to have replicated pool with size 3 and failure domain host: this way copies of data will reside on different hosts, not just on different OSDs. Therefore during recovery the only way to get the lost copy recovered would be actually copying it across the network - and it will push your bandwidth use up (but don't expect 100% use of the network - it is way too fast for your hosts). On top recovery would involve other hosts and thus recovery CPU load will be shared amongst them (however CPU of the host containing failed OSD still will be a bottleneck).

[Loke]: We have replicated the pool with size 3 and failure domain host. we are expecting at least 50+ % of what we have now (10%) . 10% is actually less if we see our infrastructure. 

3. At step three you have just 10 OSDs. Feels like you did not remove OSDs but shutdown whole host - 5 OSDs are out. However mons are still active - I assume that host actually is still up. Did you shut down 5 OSDs on a single host? It appears to be the case. You have two hosts exchanging data and both hosts now have OSDs which should receive data - thus both hosts should have their CPUs used and you should have decent recovery speed (225MB/s). Then apparently you start 5 OSDs and only these OSDs now need to receive data - leaving you with just one host doing recovery heavy lifting. So no surprise that recovery speed is halved. You have 3 hosts but only one actually is busy with recovery.

[Loke] We are actually killing the 5 OSD's on one host and then bringing it back to cluster. MON/MGR/MDS are not touched during this activity. The initial recovery (225MB/s) was when 5 OSD's were already out. Ceph tries to maintain copy during host fail if ceph finds spare OSDs to copy at.

4. Tweaking these settings up when your CPU is already maxed out only will make things worse (speed wise will not do much damage but will require extra RAM to do so risking OSD crash). It recovers with more PGs doing recovery simultaneously. But as I said: CPU is finite resource and there no difference in having 4 PGs each recovering at 25% of your CPU capacity or having 25 PGs each recovering at 4% of your CPU capacity. Increase in number of PGs performing recovery will just slow the rate at which PGs will be completing recovery and will increase RAM use. 


As I stated before "Even with fastest available hardware you probably not going to max out your network anyway - you will max out your CPU first." - I cannot understand why you still looking at data speed on your network interfaces when your hosts just UNABLE to saturate it when CPU is used to transfer data.

[Loke] - from the top commands I see that we still have enough resources before max out. though will revalidate. this part again.


So to answer your questions:

1. Shall we preassume that the recovery rate is not impacted by changing these variables?

No, recovery rate IS impacted by these settings. However it appears you already were going at max rate before you started to tweak them. Changing these settings up will not magically get you extra CPUs running in your hosts - you will need to spend some $$$ for that.

[Loke]: As stated we do have enough CPU even after tweaking the values. and in the default state(i.e without tweaking) at the recovery time, we see CPU at around 15% or something.

2. We also understand that recovery is something that we should not worry about, but as we try to understand more on the recovery side of the Ceph, so what else can be tweaked to modify and increase the recovery rate. ?

You need to understand how Ceph is architected. In particular concept of PG. Each PG has as set of OSDs it will use to store the data. In case of replicated size 3 it will have only 3 OSDs. It means that only 3 out 15 OSDs will be involved in recovery of that PG. If it happens that these OSDs belong to the same host then you will see three things: 1. Data will not be traveling through external network interface. 2. Your host CPU will be doing work for both receiving and sending OSDs - naturally it will be slower. 3. If your SSDs are connected to the same controller then of course data transfer will be slower because the controller will do more work as well.


[Loke]: This is very much true, in point 3 we do have the setup which has 5 OSDs / 1 MON/ 1 MGR/ 1MDS  on each host, so is that a problem or non recommend for a production? 
 

From what I see in your message ONLY tweak which will get your scenario going faster is more CPU cores in your hosts.


Regards,

Vladimir

On 12/11/21 22:04, Lokendra Rathour wrote:
Hi Vladimir,
Thanks again for the help and support.
 We have checked by tweaking the values as suggested. With respect to the same please note:
  1. Before starting the activity we first validated(using iperf) the max bandwidth possible with our n/w and that as approx 10GiG ( as tested).
  2. The normal load that we see because of Ceph is around 10% of CPU.
  3. Now as we go ahead and only remove OSD's Ceph health becomes as
    1. #### after the OSD is removed the ceph health##

      Every 2.0s: sudo ceph -s                                                                              storagenode1: Fri Nov 12 09:42:22 2021

        cluster:
          id:     51f15167-f3fa-4e04-b468-b11abc9838a6
          health: HEALTH_WARN
                  Degraded data redundancy: 724889/2821107 objects degraded (25.695%), 69 pgs degraded, 105 pgs undersized

        services:
          mon: 3 daemons, quorum storagenode1,storagenode2,storagenode3 (age 4h)
          mgr: storagenode3(active, since 15h), standbys: storagenode1, storagenode2
          mds: cephfs:1 {0=storagenode2=up:active} 2 up:standby
          osd: 10 osds: 10 up (since 3m), 10 in (since 4m); 78 remapped pgs
          rgw: 3 daemons active (storagenode1.rgw0, storagenode2.rgw0, storagenode3.rgw0)

        task status:
          scrub status:
              mds.storagenode2: idle

        data:
          pools:   7 pools, 169 pgs
          objects: 940.37k objects, 447 GiB
          usage:   943 GiB used, 7.8 TiB / 8.7 TiB avail
          pgs:     724889/2821107 objects degraded (25.695%)
                   476383/2821107 objects misplaced (16.886%)
                   59 active+clean+remapped
                   55 active+undersized+degraded
                   36 active+undersized
                   10 active+undersized+degraded+remapped+backfill_wait
                   5  active+clean
                   4  active+undersized+degraded+remapped+backfilling

        io:
          recovery: 225 MiB/s, 316 objects/s

    2. checking the load at the port at this point on other two nodes shows approx: 1.07 GBit/s
    3. Soon after adding back the OSD's the recovery starts as below:
      1. #after adding all nodes:
        Every 2.0s: sudo ceph -s                                                                              storagenode1: Fri Nov 12 10:05:42 2021

          cluster:
            id:     51f15167-f3fa-4e04-b468-b11abc9838a6
            health: HEALTH_WARN
                    Degraded data redundancy: 420477/2821107 objects degraded (14.905%), 22 pgs degraded, 22 pgs undersized

          services:
            mon: 3 daemons, quorum storagenode1,storagenode2,storagenode3 (age 4h)
            mgr: storagenode3(active, since 15h), standbys: storagenode1, storagenode2
            mds: cephfs:1 {0=storagenode2=up:active} 2 up:standby
            osd: 15 osds: 15 up (since 13m), 15 in (since 13m); 42 remapped pgs
            rgw: 3 daemons active (storagenode1.rgw0, storagenode2.rgw0, storagenode3.rgw0)

          task status:
            scrub status:
                mds.storagenode2: idle

          data:
            pools:   7 pools, 169 pgs
            objects: 940.37k objects, 447 GiB
            usage:   1.1 TiB used, 12 TiB / 13 TiB avail
            pgs:     420477/2821107 objects degraded (14.905%)
                     234438/2821107 objects misplaced (8.310%)
                     127 active+clean
                     20  active+remapped+backfill_wait
                     18  active+undersized+degraded+remapped+backfill_wait
                     4   active+undersized+degraded+remapped+backfilling

          io:
            recovery: 122 MiB/s, 170 objects/s

    4. Checking above the rate as 122MiB/s i.e x 8 = 976 MBits/s approx 1GBits/s which 10% of the capactiy.
    5. Now as you suggested we have tried tweaking the parameters, but to my surprise, we did not see any change in the Recovery rate also the CPU because 500% approx.
      1. Parameters we changed:
        1. sudo ceph tell 'osd.*' injectargs  --osd-client-op-priority=1 --osd-recovery-op-priority=63 --osd-max-backfills=12 --osd-recovery-max-active=12 --osd-recovery-max-active-ssd=40 --osd-recovery-max-single-start=12 --osd-recovery-priority=10
      2. check the top load for the CPU
        MiB Swap:   4096.0 total,   4096.0 free,      0.0 used. 111654.3 avail Mem

            PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         167848 ceph      20   0 2328160   1.2g  31800 S 341.5   1.0  20:52.75 ceph-osd
         166793 ceph      20   0 2139452   1.1g  32000 S 329.7   0.9  30:44.87 ceph-osd
         167321 ceph      20   0 2562100   1.5g  32436 S 229.1   1.2  24:36.25 ceph-osd
         168937 ceph      20   0 2469276   1.3g  32592 S 211.9   1.0  23:30.17 ceph-osd
         168388 ceph      20   0 2130008   1.1g  32312 S 202.0   0.9  27:38.68 ceph-osd
          72757 ceph      20   0 1536232 990284  27032 S 141.5   0.8  53:35.05 ceph-mon
      3. Recovery Rate:
        1. Every 2.0s: sudo ceph -s                                                                              storagenode1: Fri Nov 12 10:22:13 2021

            cluster:
              id:     51f15167-f3fa-4e04-b468-b11abc9838a6
              health: HEALTH_WARN
                      Degraded data redundancy: 272034/2821107 objects degraded (9.643%), 18 pgs degraded, 18 pgs undersized

            services:
              mon: 3 daemons, quorum storagenode1,storagenode2,storagenode3 (age 4h)
              mgr: storagenode3(active, since 16h), standbys: storagenode1, storagenode2
              mds: cephfs:1 {0=storagenode2=up:active} 2 up:standby
              osd: 15 osds: 15 up (since 2m), 15 in (since 29m); 25 remapped pgs
              rgw: 3 daemons active (storagenode1.rgw0, storagenode2.rgw0, storagenode3.rgw0)

            task status:
              scrub status:
                  mds.storagenode2: idle

            data:
              pools:   7 pools, 169 pgs
              objects: 940.37k objects, 447 GiB
              usage:   1.2 TiB used, 12 TiB / 13 TiB avail
              pgs:     272034/2821107 objects degraded (9.643%)
                       79826/2821107 objects misplaced (2.830%)
                       144 active+clean
                       18  active+undersized+degraded+remapped+backfilling
                       7   active+remapped+backfilling

            io:
              recovery: 125 MiB/s, 75 keys/s, 214 objects/s
Questions:
  • Shall we preassume that the recovery rate is not impacted by changing these variables? 
  • We also understand that recovery is something that we should not worry about, but as we try to understand more on the recovery side of the Ceph, so what else can be tweaked to modify and increase the recovery rate. ?
  • Any conclusion on the observation as shared above?
Thanks once again for the support. 

-Lokendra


On Wed, Nov 3, 2021 at 8:49 AM Vladimir Bashkirtsev <vladimir@xxxxxxxxxxxxxxx> wrote:

Hi Lokendra,


In the list of operations you are about to undertake I fail to see "wiping out of Bluestore volume". If you keep this volume (which I believe is something you should do) then when you will reinstall your OS and bring back your services ceph-volume will regenerate your /var/lib/ceph/osd folder with data from Bluestore volume and it will continue on normally. So it really irrelevant how you going to upgrade/update your system as long as Bluestore remains intact (and there not reason to destroy it) you will be able to set noout, do whatever you need with your system, restart it, unset noout and continue on without data backfill/rebalance (just catch up recovery to perform). You only need backfill/recovery if your hard drive has failed and you need to replace it with a new one - and of course such failures happen without any regard to maintenance windows and thus you will need commence it ASAP when new drive is in but still serve your client during recovery. Rebalancing should not bother you at all - just let it run when it is needed - your data is safe anyway and no need to burn plenty of clock cycles for nothing.

Well... I see that you want to max out your system against all good advice. You are now warned that it is bad idea and you should not do it. So here we go:

1. You need to keep in mind that Ceph does not write data once: it does it twice. Once to WAL (write ahead log), once to actual persistent storage plus additional record to DB. It is done for safety reasons in case you would get a power failure during write. So it means that you max write speed to BlueStore is just under half of what your SSD can do.

2. I strongly recommend you to use ceph tell 'osd.*' injectargs command to change these settings - this way they will be used temporarily and will revert back to normal on OSD restart. Also please note that you may easily exceed OSD host capabilities by turning these settings way up - it may run out of memory and will get OOMed. OSD will restart but will not be using the settings which have brought it down. So setting it up temporarily will act as your safety net.

3. Also please note that Ceph needs CPU for recovery (especially if you have any erasure coded pools or have any compression turned on). As I understand you have 5 OSDs running on a single host: every running OSD will need CPU and you may find that your host just does not cope with computational load. So when you will change settings to speed up recovery and it still does not run as fast as you think it should please check top and iotop to see if your CPU or SSD is a bottleneck. To put it in numbers: 10GBit network is capable to deliver 1 gigabyte of data per second. That's not a small amount of data by anyone standards. And while it is possible to transfer such amount of data at such speed using DMA when CPU gets involved you cannot expect such sort of performance even from fastest available CPUs.

4. You also should look at https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/ . Basically you look at increase of osd_op_num_shards_ssd (this will increase number of processing queues), decrease of osd_client_op_priority to 3 and increase osd_recovery_op_priority to 63 (it will not have any effect if there no client io present - as you said you will do it while client is not accessing the data), increase osd_max_backfills, osd_recovery_max_active, osd_recovery_max_active_ssd (if osd_recovery_max_active is higher than osd_recovery_max_active_ssd), osd_recovery_max_single_start (to allow more recovery operations), osd_recovery_priority. Start to increase these settings and keep an eye on CPU/memory/IO of your host (one which has recovering OSD). Even with fastest available hardware you probably not going to max out your network anyway - you will max out your CPU first.


Regards,

Vladimir

On 2/11/21 19:56, Lokendra Rathour wrote:
Hi Vladimir,
Thanks once again for the information. Just one update in my statement. we are not upgrading the Centos OS by any DNF upgrade command, instead, we would reinstall the centos 8 after removing the CentOS 7. A node that would be upgraded this way would have 1 MON, 1 MDS, 1 MGR, 5 OSD, which would eventually be deleted on shrinking. ( we are using ceph-ansible to deploy/scale ceph). 
So we follow ( Using Ceph-Ansible)
  • Shrinking/removing MON, MGR, MDS, OSDs
  • reinstall OS
  • add all services back MON, MGR, MDS, OSDs
Another Query:
we have 10Gig link and we have verified that from the hardware side we have enough bandwidth, so is there a way to configure in Ceph to use max possible bandwidth? In our case, this hardware is dedicated to Ceph only so assume around 10GiG Capacity usage is what we wish to use for all communication?
please advise.
thanks,
Lokendra

On Mon, Nov 1, 2021 at 7:38 PM Vladimir Bashkirtsev <vladimir@xxxxxxxxxxxxxxx> wrote:

Hi Lokendra,


First of all: Ceph is designed to allow production use even when it undergoing recovery activities. It basically throws away the concept of planned maintenance window. Ceph is exabyte capable system: with such size you just cannot expect to have any sort of maintenance windows. So my suggestion is quite simple: don't worry about recoveries, allow Ceph to manage it at default settings - you will find that your customer will not notice any recovery/rebalance happening.


Second: I guess it is worth to note that Ceph also can do erasure coding which may allow you good redundancy levels while actually using way less of disk space. Put it simply: with your 15 OSDs of 1TB each with replication of size 3 you may store 5TB of data while your system may survive simultaneous loss of 2 OSDs (disks). Or you can have K8M2 erasure pool which would allow you to store 12TB of data, all while having ability to survive a loss of 2 OSDs (disks). As you can see you can store considerably more data with the same safety margin. However trade-off is that writing will be much slower and reading (generally) will be faster. So it depends on type of data customer stores. Just something for you to consider. And the beauty is that you may have both types of pools in the same Ceph cluster.


In terms of updates: you can update the system and Ceph itself independently from the data store. You don't really need to shrink the cluster in order to upgrade. Just set noout flag (so cluster will not start rebalancing/recovery during prolonged OSD outage), upgrade your system, boot it back with original backing store (I hope you use Bluestore?), unset noout flag, let updated OSD to catch up with changes which happened while it was offline. I also would like to note that generally there's a good order of Ceph upgrade: monitors first, followed by OSDs and then other daemons (MDS and RGW) - just keep it in mind.


Also it appears that you have 3 nodes with 5 SSDs each. In this case I would recommend setting up a CRUSH tree correctly before doing anything else: this way you would be able to switch off whole node with 5 OSDs on it without loss of the service. This will happen because Ceph will ensure that no two copies are kept in a single node.


So again: there no need for super quick recovery. Quick recovery, especially with large amount of data, is your enemy - it needs a lot of resources and thus you would want to spread it around thinly to reduce an impact on system performance. Your interest is to have 100% uptime while you can upgrade/maintain your systems independently. That's what well designed Ceph deployment allows you to have.


While it is possible to reconfigure Ceph in a way which will max out your IO on "failed" OSD I would consider it as misconfiguration.


Regards,

Vladimir

On 1/11/21 23:57, Lokendra Rathour wrote:
Hi Vladimir,
Thanks for the rich inputs and learning details. 
The main idea behind this change that we could think of is to get back the Ceph to health ok before the Maintenance window is over. 
Let me tell you the complete use case: Ceph OS Upgrade from Octopus to Pacific, which also needs OS Upgrade from Centos 7 to Centos 8.

So the customer would anyways plan this activity in a maintenance window  ( duration of which can be planned). 
Considering we have 30% Data of total possible in the system ( note we have replication factor as 3) i.e of 13TB we can have max user data of approx 4 TB.
Out of three nodes, we plan to take one node at a time and follow the below steps in each run
  • Considering the maintenance window, we can assume i/o to be minimum or near zero. 
  • we will first shrink one node from the cluster.
  • Install Centos 8 on that node.
  • Add the Node back to cluster. 
  • Post which recovery/backfilling should start. 
  • So with 33% of data of 30% available ( approx 1 TB) would be a size that will actually be backfilled/rebalanced. 
This is the main reason why we are looking for a quick recovery. Any advice/inputs would help us build this use case better.
-Lokendra

On Fri, Oct 29, 2021 at 2:56 PM Vladimir Bashkirtsev <vladimir@xxxxxxxxxxxxxxx> wrote:

Hi Lokendra,

Now it looks more like a good ceph deployment :)

io recovery shows TOTAL recovery rate at that very moment. At the moment 19 PGs (please familiarize yourself with the concept of PG at https://docs.ceph.com/en/latest/rados/operations/placement-groups/ ) backfilling - meaning there 19 sets of OSDs actively exchanging data to recover. This process goes in parallel - so that's good!

The settings max backfills and recovery max active are per OSD - so with 15 OSDs you may have well over max backfills in progress. That's why you can see 19 backfills while your max backfills is at 8. Simply not every single possible PG needs to be moved and thus it moves all 19 misplaced PGs at once. I generally would not recommend to increase backfills in any form because they require CPU resources on each OSD and CPU is finite resource. If you have 15 OSDs and have just default max backfills of 1 you still may end up with 15 PGs moving.

In terms of changing settings on the fly: even though it says "change may require restart" these two settings are actually applied. Restarting OSD will undo these temporary settings and will use settings which are persistent in config. So if you would elect to increase backfills for any reason (there no good reason actually) leave them and when OSD will eventually restart these settings will revert to normal.

And now I will reiterate: Ceph is NOT RAID. You simply cannot expect that you just fail OSD and suddenly you going to get all resources thrown at recovery so your rig will start to smoke. It just does not work this way. Please read https://docs.ceph.com/en/latest/dev/osd_internals/backfill_reservation/ . Ceph has very fine tuned algorithm which deals with recoveries. Why plural? Because there's a huge difference between backfilling remapped PGs (ie data is safe but located not where it is expected) and backfilling degraded PGs (when PG does not have all copies available but still OK to be used) and recovery (when number of copies is critical). In your case it shows 19 active+remapped+backfilling - data is safe, have all copies expected, there just no need to rush the move of the data. If you start to increase recovery speed you will pay for it dearly by degraded performance - hardly something you want in production especially if there no benefit from quick "recovery".

I hazard a guess that your pools are replicated with default size 3. It means that losing one OSD will cause some PGs to go degraded and recovery will start in 10 minutes if OSD would not boot up back. Then you may expect recovery to start at higher rate than you see during remapped+backfilling - you don't have enough copies and so Ceph will start put more resources at recovery of this situation. Then you lose another OSD and some PGs (those who have first and second failed OSD in its OSD set) will have just one copy left - obviously this situation cannot be tolerated and thus Ceph will go into recovery (but only for PGs which are at risk). It will throw a lot of resources to recover these PGs as fast as it can and you may notice performance degradation for users at this time. Again what is better to have: users who complain about slow service or users who complain about lost data? Answer is clear: ceph needs to attend to data recovery immediately. But in the same time PGs which have lost only one OSD will not get the same treatment - they will be processed as degraded PGs at lower priority. PGs which now need to move to other OSDs due to rebalance but have all copies intact will get even less priority.

So the issue you have is that your expectations are simplified: you expect recovery to run at maximum theoretical limit without any care to the rest of the system. But you don't want to run at 100% speed your hardware can achieve. You want to ensure that your data durability is upheld, your clients receive good IO rate, recoveries actually taking back seat and not interfere with primary purpose of the storage system. So all you need to do is: a) stop changing these settings - they are there for a reason - it was tested and re-tested by real-life use. b) put some trust to ceph community and accept that its performance is well tuned to provide good end user performance and safety. End users really do not give a care about "speed of recovery". Neither should you.


Regards,

Vladimir

On 29/10/21 16:07, Lokendra Rathour wrote:
Hi Vladimir,
i have reconfigured the setup to 15 OSD Now,
Every 1.0s: sudo ceph -s                                                                          Fri Oct 29 10:21:07 2021

  cluster:
    id:     1a8bfc8a-ad9d-4a06-9963-5e84e7ce80ee
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum storagenode1,storagenode2,storagenode3 (age 2h)
    mgr: storagenode3(active, since 16h), standbys: storagenode2, storagenode1
    mds: cephfs:1 {0=storagenode3=up:active} 2 up:standby
    osd: 15 osds: 15 up (since 5m), 15 in (since 16h); 19 remapped pgs
    rgw: 3 daemons active (storagenode1.rgw0, storagenode2.rgw0, storagenode3.rgw0)

  task status:
    scrub status:
        mds.storagenode3: idle

  data:
    pools:   7 pools, 265 pgs
    objects: 4.13M objects, 1.9 TiB
    usage:   6.1 TiB used, 7.0 TiB / 13 TiB avail
    pgs:     662670/12381873 objects misplaced (5.352%)
             246 active+clean
             19  active+remapped+backfilling

  io:
    recovery: 114 MiB/s, 173 objects/s

I see the recovery as around 140 MiB/s so is this per OSD or this is in total, from the message you have sent i could see that it is per OSD. 
Also from the command "ceph tell 'osd.*' injectargs --osd-max-backfills=2 --osd-recovery-max-active=6" i do not see much visible difference. Is this that we have to restart the OSD service as because after running this command I see as :
[ansible@storagenode1 ~]$ sudo ceph tell 'osd.*' injectargs --osd-max-backfills=8 --osd-recovery-max-active=12
osd.0: osd_recovery_max_active = '12' (not observed, change may require restart)
osd.0: {}
osd.1: osd_recovery_max_active = '12' (not observed, change may require restart)
osd.1: {}
osd.2: osd_recovery_max_active = '12' (not observed, change may require restart)
osd.2: {}
osd.3: osd_recovery_max_active = '12' (not observed, change may require restart)
osd.3: {}
osd.4: osd_recovery_max_active = '12' (not observed, change may require restart)
osd.4: {}
osd.5: osd_recovery_max_active = '12' (not observed, change may require restart)
osd.5: {}
osd.6: osd_recovery_max_active = '12' (not observed, change may require restart)
osd.6: {}
osd.7: osd_recovery_max_active = '12' (not observed, change may require restart)
osd.7: {}
osd.8: osd_recovery_max_active = '12' (not observed, change may require restart)
osd.8: {}
osd.9: osd_recovery_max_active = '12' (not observed, change may require restart)
osd.9: {}
osd.10: osd_recovery_max_active = '12' (not observed, change may require restart)
osd.10: {}
osd.11: osd_recovery_max_active = '12' (not observed, change may require restart)
osd.11: {}
osd.12: osd_recovery_max_active = '12' (not observed, change may require restart)
osd.12: {}
osd.13: osd_recovery_max_active = '12' (not observed, change may require restart)
osd.13: {}
osd.14: osd_recovery_max_active = '12' (not observed, change may require restart)
osd.14: {}
it says change may require restart, but even after restart no impact w.r.t to recovery  rate change. 

thanks,
Lokendra


On Thu, Oct 28, 2021 at 1:53 PM Vladimir Bashkirtsev <vladimir@xxxxxxxxxxxxxxx> wrote:
1. You can do:
ceph tell 'osd.*' injectargs --osd-max-backfills=2 --osd-recovery-max-active=6
This will change these settings on the fly but they will be reset on OSD restart (each OSD will get it and will remeber until its own restart - you may have OSDs running with different settings).
2. Nothing to do with threads: it is scenario which I have covered in my previous response. If you have more than 3 OSDs you can have OSDs to pair up for data transfers thus (theoretically) 10 OSD cluster can have 5 pairs to transfer data in parallel at 150MB/s achieving total 750MB/s recovery speed.

Regards,
Vladimir

On 28 October 2021 7:11:31 pm AEDT, Lokendra Rathour <lokendrarathour@xxxxxxxxx> wrote:
Hey Johansson,
thanks for the update here. two things in line with your response.
  1. for now, I am able to change these values via ceph.conf and restart the osd service, so are there any runtime commands as well to do so ? I am using Ceph Pacific or Octopus version installed using ceph-ansible.
  2. what do you mean by " allow more parallelism" - are you referring to modifying threads with this config  "osd recovery threads" or please help elaborate.
Thanks once again for your help.



On Thu, Oct 28, 2021 at 1:05 PM Janne Johansson <icepic.dz@xxxxxxxxx> wrote:


Den tors 28 okt. 2021 kl 09:09 skrev Lokendra Rathour <lokendrarathour@xxxxxxxxx>:
Hi,
we have been trying to test  a scenario on ceph with the following configuration:
 cluster:
    id:     cc0ba1e4-68b9-4237-bc81-40b38455f713
    health: HEALTH_OK
  services:
    mon: 3 daemons, quorum storagenode1,storagenode2,storagenode3 (age 4h)
    mgr: storagenode2(active, since 22h), standbys: storagenode1, storagenode3
    mds: cephfs:1 {0=storagenode1=up:active} 2 up:standby
    osd: 3 osds: 3 up (since 4m), 3 in (since 4h)
    rgw: 3 daemons active (storagenode1.rgw0, storagenode2.rgw0, storagenode3.rgw0)
  task status:
    scrub status:
        mds.storagenode1: idle
  data:
    pools:   7 pools, 169 pgs
    objects: 1.06M objects, 1.3 TiB
    usage:   3.9 TiB used, 9.2 TiB / 13 TiB avail
    pgs:     169 active+clean
  io:
    client:   43 KiB/s wr, 0 op/s rd, 3 op/s wr
    recovery: 154 MiB/s, 98 objects/s
 
We have network links of 10GiG for all the networks used in Ceph. MTU is configured as 9000. But the Transfer rate as can be seen above is max 154 MiB/s which I feel is way low than possible. 

Test Case:
We removed one node and added it back to the Ceph Cluster after reinstalling the OS. During this time of activity, Ceph has around 1.3 TB to rebalance in the newly added node. The time taken in such a case is approximate: 4 hours. 

Considering this as the production-grade setup with all production-grade infra, this time is too much.

Query:
  • Is there a way to optimize the recovery/rebalancing and i/o rate of Ceph?
  • we found a few suggestions on the internet that we can modify the below parameters to achieve a good rate, but is this advisable
    •   osd max backfills, osd recovery max active, osd recovery max single start 
  • we have dedicated 10gig n/w infra so can we have any ideal value to reach max rate of recovery.

Any input would be helpful, we are really blocked here.


If this is one spinning drive receiving data, then those figures look ok. If you instead had a large cluster with more drives, the sum of the recovery traffic would be more if you allow more parallelism. Looking at osd_max_backfills to see how many parallel backfills you will allow and looking at posts and guides like this:
might also help.



--
May the most significant bit of your life be positive.


--
skype: lokendrarathour


-- Sent from my Android device with K-9 Mail. Please excuse my brevity.


--
~ Lokendra










_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux