Re: [Ceph] Recovery is very Slow

Christian Wuerdig <christian.wuerdig@xxxxxxxxx> · Tue, 2 Nov 2021 07:13:54 +1300

I would add that with the suggested EC 8+2 setup you need at least 10 OSD hosts or you have to go for failure domain OSD - however in that case the best you can achieve is 3 shards in 2 hosts and 4 shards on one by carefully crafting your crush rules and you cluster will block I/O during a host down-time since you have lost at least 3 shards out of 10 and 7 shards is insufficient. In fact if one of your hosts gets blown away completely for some reason you just lost all data in your cluster. If you need to stick to 3 hosts you can do EC 4+2 and put 2 shards on each host - it would still give 10TB usable space so twice as much as with repl 3x (well - since you shouldn't run your cluster more than 80% full is practically 8TB). That way you can survive a total host failure. Also keep in mind that EC can come with significant space amplification depending on your use case. 

On Tue, 2 Nov 2021 at 03:08, Vladimir Bashkirtsev <vladimir@xxxxxxxxxxxxxxx> wrote:

    Hi Lokendra,

    First of all: Ceph is designed to allow production use even when
      it undergoing recovery activities. It basically throws away the
      concept of planned maintenance window. Ceph is exabyte capable
      system: with such size you just cannot expect to have any sort of
      maintenance windows. So my suggestion is quite simple: don't worry
      about recoveries, allow Ceph to manage it at default settings -
      you will find that your customer will not notice any
      recovery/rebalance happening.

    Second: I guess it is worth to note that Ceph also can do erasure
      coding which may allow you good redundancy levels while actually
      using way less of disk space. Put it simply: with your 15 OSDs of
      1TB each with replication of size 3 you may store 5TB of data
      while your system may survive simultaneous loss of 2 OSDs (disks).
      Or you can have K8M2 erasure pool which would allow you to store
      12TB of data, all while having ability to survive a loss of 2 OSDs
      (disks). As you can see you can store considerably more data with
      the same safety margin. However trade-off is that writing will be
      much slower and reading (generally) will be faster. So it depends
      on type of data customer stores. Just something for you to
      consider. And the beauty is that you may have both types of pools
      in the same Ceph cluster.

    In terms of updates: you can update the system and Ceph itself
      independently from the data store. You don't really need to shrink
      the cluster in order to upgrade. Just set noout flag (so cluster
      will not start rebalancing/recovery during prolonged OSD outage),
      upgrade your system, boot it back with original backing store (I
      hope you use Bluestore?), unset noout flag, let updated OSD to
      catch up with changes which happened while it was offline. I also
      would like to note that generally there's a good order of Ceph
      upgrade: monitors first, followed by OSDs and then other daemons
      (MDS and RGW) - just keep it in mind.

    Also it appears that you have 3 nodes with 5 SSDs each. In this
      case I would recommend setting up a CRUSH tree correctly before
      doing anything else: this way you would be able to switch off
      whole node with 5 OSDs on it without loss of the service. This
      will happen because Ceph will ensure that no two copies are kept
      in a single node.

    So again: there no need for super quick recovery. Quick recovery,
      especially with large amount of data, is your enemy - it needs a
      lot of resources and thus you would want to spread it around
      thinly to reduce an impact on system performance. Your interest is
      to have 100% uptime while you can upgrade/maintain your systems
      independently. That's what well designed Ceph deployment allows
      you to have.

    While it is possible to reconfigure Ceph in a way which will max
      out your IO on "failed" OSD I would consider it as
      misconfiguration.

    Regards,
    Vladimir

    On 1/11/21 23:57, Lokendra Rathour
      wrote:

      Hi Vladimir,
        Thanks for the rich inputs and learning details. 
        The main idea behind this change that we could think of is
          to get back the Ceph to health ok before the Maintenance
          window is over. 
        Let me tell you the complete use case: Ceph OS
          Upgrade from Octopus to Pacific, which also needs OS Upgrade
          from Centos 7 to Centos 8.

        So the customer would anyways plan this activity in
          a maintenance window  ( duration of which can be planned). 
        Considering we have 30% Data of total possible in the
          system ( note we have replication factor as 3) i.e of 13TB we
          can have max user data of approx 4 TB.
        Out of three nodes, we plan to take one node at a time and
          follow the below steps in each run

            Considering the maintenance window, we can assume i/o to
              be minimum or near zero. 
            we will first shrink one node from the cluster.
            Install Centos 8 on that node.
            Add the Node back to cluster. 
            Post which recovery/backfilling should start. 
            So with 33% of data of 30% available ( approx 1 TB)
              would be a size that will actually be
              backfilled/rebalanced. 

          This is the main reason why we are looking for a quick
          recovery. Any advice/inputs would help us build this use case
          better.
        -Lokendra

        On Fri, Oct 29, 2021 at 2:56
          PM Vladimir Bashkirtsev <vladimir@xxxxxxxxxxxxxxx>
          wrote:

            Hi Lokendra,
            Now it looks more like a good ceph deployment :)
            io recovery shows TOTAL recovery rate at that very
              moment. At the moment 19 PGs (please familiarize yourself
              with the concept of PG at https://docs.ceph.com/en/latest/rados/operations/placement-groups/
              ) backfilling - meaning there 19 sets of OSDs actively
              exchanging data to recover. This process goes in parallel
              - so that's good!
            The settings max backfills and recovery max active are
              per OSD - so with 15 OSDs you may have well over max
              backfills in progress. That's why you can see 19 backfills
              while your max backfills is at 8. Simply not every single
              possible PG needs to be moved and thus it moves all 19
              misplaced PGs at once. I generally would not recommend to
              increase backfills in any form because they require CPU
              resources on each OSD and CPU is finite resource. If you
              have 15 OSDs and have just default max backfills of 1 you
              still may end up with 15 PGs moving.
            In terms of changing settings on the fly: even though it
              says "change may require restart" these two settings are
              actually applied. Restarting OSD will undo these temporary
              settings and will use settings which are persistent in
              config. So if you would elect to increase backfills for
              any reason (there no good reason actually) leave them and
              when OSD will eventually restart these settings will
              revert to normal.
            And now I will reiterate: Ceph is NOT RAID. You simply
              cannot expect that you just fail OSD and suddenly you
              going to get all resources thrown at recovery so your rig
              will start to smoke. It just does not work this way.
              Please read https://docs.ceph.com/en/latest/dev/osd_internals/backfill_reservation/
              . Ceph has very fine tuned algorithm which deals with
              recoveries. Why plural? Because there's a huge difference
              between backfilling remapped PGs (ie data is safe but
              located not where it is expected) and backfilling degraded
              PGs (when PG does not have all copies available but still
              OK to be used) and recovery (when number of copies is
              critical). In your case it shows 19
              active+remapped+backfilling - data is safe, have all
              copies expected, there just no need to rush the move of
              the data. If you start to increase recovery speed you will
              pay for it dearly by degraded performance - hardly
              something you want in production especially if there no
              benefit from quick "recovery".
            I hazard a guess that your pools are replicated with
              default size 3. It means that losing one OSD will cause
              some PGs to go degraded and recovery will start in 10
              minutes if OSD would not boot up back. Then you may expect
              recovery to start at higher rate than you see during
              remapped+backfilling - you don't have enough copies and so
              Ceph will start put more resources at recovery of this
              situation. Then you lose another OSD and some PGs (those
              who have first and second failed OSD in its OSD set) will
              have just one copy left - obviously this situation cannot
              be tolerated and thus Ceph will go into recovery (but only
              for PGs which are at risk). It will throw a lot of
              resources to recover these PGs as fast as it can and you
              may notice performance degradation for users at this time.
              Again what is better to have: users who complain about
              slow service or users who complain about lost data? Answer
              is clear: ceph needs to attend to data recovery
              immediately. But in the same time PGs which have lost only
              one OSD will not get the same treatment - they will be
              processed as degraded PGs at lower priority. PGs which now
              need to move to other OSDs due to rebalance but have all
              copies intact will get even less priority.
            So the issue you have is that your expectations are
              simplified: you expect recovery to run at maximum
              theoretical limit without any care to the rest of the
              system. But you don't want to run at 100% speed your
              hardware can achieve. You want to ensure that your data
              durability is upheld, your clients receive good IO rate,
              recoveries actually taking back seat and not interfere
              with primary purpose of the storage system. So all you
              need to do is: a) stop changing these settings - they are
              there for a reason - it was tested and re-tested by
              real-life use. b) put some trust to ceph community and
              accept that its performance is well tuned to provide good
              end user performance and safety. End users really do not
              give a care about "speed of recovery". Neither should you.

            Regards,
            Vladimir
            On 29/10/21 16:07, Lokendra Rathour wrote:

                Hi Vladimir,

                      i have
                          reconfigured the setup to 15 OSD Now,

                        Every 1.0s: sudo
                            ceph -s                                    
                                                                 Fri Oct
                            29 10:21:07 2021

                            cluster:

                              id:    
                            1a8bfc8a-ad9d-4a06-9963-5e84e7ce80ee

                              health:
                            HEALTH_OK

                            services:

                              mon: 3
                            daemons, quorum
                            storagenode1,storagenode2,storagenode3 (age
                            2h)

                              mgr:
                            storagenode3(active, since 16h), standbys:
                            storagenode2, storagenode1

                              mds: cephfs:1
                            {0=storagenode3=up:active} 2 up:standby

                              osd: 15 osds:
                            15 up (since 5m), 15 in (since 16h); 19
                            remapped pgs

                              rgw: 3
                            daemons active (storagenode1.rgw0,
                            storagenode2.rgw0, storagenode3.rgw0)

                            task status:

                              scrub status:

                            mds.storagenode3: idle

                            data:

                              pools:   7
                            pools, 265 pgs

                              objects:
                            4.13M objects, 1.9 TiB

                              usage:   6.1
                            TiB used, 7.0 TiB / 13 TiB avail

                              pgs:    
                            662670/12381873 objects misplaced (5.352%)

                                       246
                            active+clean

                                       19
                             active+remapped+backfilling

                            io:

                              recovery: 114
                            MiB/s, 173 objects/s

                  I see the recovery as around 140 MiB/s so is this per
                  OSD or this is in total, from the message you have
                  sent i could see that it is per OSD. 
                  Also from the command "ceph tell 'osd.*'
                    injectargs
                    --osd-max-backfills=2 --osd-recovery-max-active=6" i
                    do not see much visible difference. Is this that we
                    have to restart the OSD service as because after
                    running this command I see as :

                    [ansible@storagenode1 ~]$ sudo ceph tell
                      'osd.*' injectargs --osd-max-backfills=8
                      --osd-recovery-max-active=12
                    osd.0: osd_recovery_max_active = '12' (not
                      observed, change may require restart)
                    osd.0: {}
                    osd.1: osd_recovery_max_active = '12' (not
                      observed, change may require restart)
                    osd.1: {}
                    osd.2: osd_recovery_max_active = '12' (not
                      observed, change may require restart)
                    osd.2: {}
                    osd.3: osd_recovery_max_active = '12' (not
                      observed, change may require restart)
                    osd.3: {}
                    osd.4: osd_recovery_max_active = '12' (not
                      observed, change may require restart)
                    osd.4: {}
                    osd.5: osd_recovery_max_active = '12' (not
                      observed, change may require restart)
                    osd.5: {}
                    osd.6: osd_recovery_max_active = '12' (not
                      observed, change may require restart)
                    osd.6: {}
                    osd.7: osd_recovery_max_active = '12' (not
                      observed, change may require restart)
                    osd.7: {}
                    osd.8: osd_recovery_max_active = '12' (not
                      observed, change may require restart)
                    osd.8: {}
                    osd.9: osd_recovery_max_active = '12' (not
                      observed, change may require restart)
                    osd.9: {}
                    osd.10: osd_recovery_max_active = '12' (not
                      observed, change may require restart)
                    osd.10: {}
                    osd.11: osd_recovery_max_active = '12' (not
                      observed, change may require restart)
                    osd.11: {}
                    osd.12: osd_recovery_max_active = '12' (not
                      observed, change may require restart)
                    osd.12: {}
                    osd.13: osd_recovery_max_active = '12' (not
                      observed, change may require restart)
                    osd.13: {}
                    osd.14: osd_recovery_max_active = '12' (not
                      observed, change may require restart)
                    osd.14: {}

                  it says change may require restart, but even
                    after restart no impact w.r.t to recovery  rate
                    change. 

                  thanks,
                  Lokendra

                  On Thu, Oct 28, 2021
                    at 1:53 PM Vladimir Bashkirtsev <vladimir@xxxxxxxxxxxxxxx>
                    wrote:

                    1. You can do:

                      ceph tell 'osd.*' injectargs
                      --osd-max-backfills=2 --osd-recovery-max-active=6

                      This will change these settings on the fly but
                      they will be reset on OSD restart (each OSD will
                      get it and will remeber until its own restart -
                      you may have OSDs running with different
                      settings).

                      2. Nothing to do with threads: it is scenario
                      which I have covered in my previous response. If
                      you have more than 3 OSDs you can have OSDs to
                      pair up for data transfers thus (theoretically) 10
                      OSD cluster can have 5 pairs to transfer data in
                      parallel at 150MB/s achieving total 750MB/s
                      recovery speed.

                      Regards,

                      Vladimir

                      On 28 October 2021
                        7:11:31 pm AEDT, Lokendra Rathour <lokendrarathour@xxxxxxxxx>
                        wrote:

                            Hey Johansson,
                              thanks for the update here. two
                                things in line with your response.

                                  for now, I am able to change these
                                    values via ceph.conf and restart the
                                    osd service, so are there any
                                    runtime commands as well to do so ?
                                    I am using Ceph Pacific or Octopus
                                    version installed using
                                    ceph-ansible.
                                  what do you mean by " allow more
                                    parallelism" - are you referring to
                                    modifying threads with this config 
                                    "osd recovery threads" or please
                                    help elaborate.

                                Thanks once again for your help.

                              On Thu,
                                Oct 28, 2021 at 1:05 PM Janne Johansson
                                <icepic.dz@xxxxxxxxx>
                                wrote:

                                    Den
                                      tors 28 okt. 2021 kl 09:09 skrev
                                      Lokendra Rathour <lokendrarathour@xxxxxxxxx>:

                                      Hi,
                                        we have been trying to
                                          test  a scenario on ceph with
                                          the following configuration:
                                         cluster:

                                              id:    
                                          cc0ba1e4-68b9-4237-bc81-40b38455f713

                                              health: HEALTH_OK

                                            services:

                                              mon: 3 daemons, quorum
                                          storagenode1,storagenode2,storagenode3
                                          (age 4h)

                                              mgr: storagenode2(active,
                                          since 22h), standbys:
                                          storagenode1, storagenode3

                                              mds: cephfs:1
                                          {0=storagenode1=up:active} 2
                                          up:standby

                                              osd: 3 osds: 3 up (since
                                          4m), 3 in (since 4h)

                                              rgw: 3 daemons active
                                          (storagenode1.rgw0,
                                          storagenode2.rgw0,
                                          storagenode3.rgw0)

                                            task status:

                                              scrub status:

                                                  mds.storagenode1: idle

                                            data:

                                              pools:   7 pools, 169 pgs

                                              objects: 1.06M objects,
                                          1.3 TiB

                                              usage:   3.9 TiB used, 9.2
                                          TiB / 13 TiB avail

                                              pgs:     169 active+clean

                                            io:

                                              client:   43 KiB/s wr, 0
                                          op/s rd, 3 op/s wr

                                            recovery: 154 MiB/s, 98
                                            objects/s

                                          We have network links of
                                            10GiG for all the networks
                                            used in Ceph. MTU is
                                            configured as 9000. But the
                                            Transfer rate as can be seen
                                            above is max 154 MiB/s which
                                            I feel is way low
                                            than possible. 

                                          Test Case:
                                          We removed one node and
                                            added it back to the Ceph
                                            Cluster after reinstalling
                                            the OS. During this time of
                                            activity, Ceph has around
                                            1.3 TB to rebalance in the
                                            newly added node. The time
                                            taken in such a case is
                                            approximate: 4 hours. 

                                          Considering this as the
                                            production-grade setup with
                                            all production-grade infra,
                                            this time is too much.

                                          Query:

                                              Is there a way to
                                                optimize the
                                                recovery/rebalancing and
                                                i/o rate of Ceph?
                                              we found a few
                                                suggestions on the
                                                internet that we can
                                                modify the below
                                                parameters to achieve a
                                                good rate, but is this
                                                advisable

                                                  osd max backfills,
                                                  osd recovery max
                                                  active, osd recovery
                                                  max single start 

                                              we have dedicated
                                                10gig n/w infra so can
                                                we have any ideal value
                                                to reach max rate of
                                                recovery.

                                          Any input would be
                                            helpful, we are really
                                            blocked here.

                                    If this is one spinning drive
                                      receiving data, then those figures
                                      look ok. If you instead had a
                                      large cluster with more drives,
                                      the sum of the recovery traffic
                                      would be more if you allow more
                                      parallelism. Looking at
                                      osd_max_backfills to see how many
                                      parallel backfills you will allow
                                      and looking at posts and guides
                                      like this:
                                    https://www.suse.com/support/kb/doc/?id=000019693

                                    might also help.

                                  -- 

                                  May the most
                                    significant bit of your life be
                                    positive.

                            --

                                    skype: lokendrarathour

                      -- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

                -- 

                        ~ Lokendra

      -- 

              ~ Lokendra
              www.inertiaspeaks.com
              www.inertiagroups.com
              skype: lokendrarathour

_______________________________________________

Dev mailing list -- dev@xxxxxxx

To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx