Re: OSDs busy reading from Bluestore partition while bringing up nodes.

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Sat, 12 Jan 2019 13:14:50 +0200



          Hi,
          

          Your settings for
          osd_max_backfills = 10       
          osd_recovery_max_active = 10        
          

          Are above default and are too high. These limits are
              per OSD, so you may have a single disk doing 10 backfills
              ( actually 20 since in and out ) at the same time.
          

          Try dynamic limit it:
          

          ceph tell osd.* injectargs '--osd_max_backfills 1' 
          ceph tell osd.* injectargs '--osd-recovery-max-active
              1'
          ceph tell osd.* injectargs '--osd_recovery_sleep 0.2'
          

          Even if the commands return a required reboot
              message, they do take effect, you can double check by
              reading from a specific local OSD ( example 1):
          ceph daemon osd.1 config get osd_max_backfills
          

          You can also assign less weights the newely added
              OSDs for example 0.5 of their value to reduce data
              movement, then re-bump it when the cluster balanced.
          

          /Maged

          
    On 12/01/2019 04:56, Subhachandra
      Chandra wrote:

    
      Hi,
        

            We have a cluster with 9 hosts and 540 HDDs using
          Bluestore and containerized OSDs running luminous 12.2.4.
          While trying to add new nodes, the cluster collapsed as it
          could not keep up with establishing enough tcp connections. We
          fixed sysctl to be able to handle more connections and also
          recycle tw sockets faster. Currently, as we are trying to
          restart the cluster by bringing up a few OSDs at a time, some
          of the OSDs get very busy after around 360 of them come up.
          iostats show that the busy OSDs are constantly reading from
          the Bluestore partition. The number of busy OSDs per node vary
          and norecover is set with no active clients. OSD logs don't
          show anything other than cephx:
            verify_authorizer errors which happen on
              both busy and idle OSDs and doesn't seem to be related to
              drive reads.
        

          How can we figure
              out why the OSDs are busy reading from the drives? If it
              is some kind of recovery, is there a way to track
              progress? Output of ceph -s and logs from a busy and idle
              OSD are copied below.
        

        Thanks
        Chandra
        

        Uptime stats with load
              averages show variance across the9 older nodes.
        
           02:43:44 up
              19:21,  0
              users,  load
              average: 0.88, 1.03, 1.06
           02:43:44 up  7:58,  0 users,  load
              average: 16.91, 13.49, 12.43
           02:43:44 up
              1 day, 14 min, 
              0 users, 
              load average: 7.67, 6.70, 6.35
           02:43:45 up  7:01,  0 users,  load
              average: 84.40, 84.20, 83.73
           02:43:45 up  6:40,  1 user,  load
              average: 17.08, 17.40, 20.05
           02:43:45 up
              19:46,  0
              users,  load
              average: 15.58, 11.93, 11.44
           02:43:45 up
              20:39,  0
              users,  load
              average: 7.88, 6.50, 5.69
           02:43:46 up
              1 day,  1:20,  0 users,  load
              average: 5.03, 3.81, 3.49
           02:43:46 up
              1 day, 58 min, 
              0 users, 
              load average: 0.62, 1.00, 1.38
          

          Ceph
              Config
          --------------
          [global]
          cluster
              network = 192.168.13.0/24
          fsid =
              <>
          mon
              host = 172.16.13.101,172.16.13.102,172.16.13.103
          mon
              initial members = ctrl1,ctrl2,ctrl3
          mon_max_pg_per_osd
              = 750
          mon_osd_backfillfull_ratio
              = 0.92
          mon_osd_down_out_interval
              = 900
          mon_osd_full_ratio
              = 0.95
          mon_osd_nearfull_ratio
              = 0.85
          osd_crush_chooseleaf_type
              = 3
          osd_heartbeat_grace
              = 900
          mon_osd_laggy_max_interval
              = 900
          osd_max_pg_per_osd_hard_ratio
              = 1.0
          public
              network = 172.16.13.0/24
          

          [mon]
          mon_compact_on_start
              = true
          

          [osd]
          osd_deep_scrub_interval
              = 2419200
          osd_deep_scrub_stride
              = 4194304
          osd_max_backfills
              = 10
          osd_max_object_size
              = 276824064
          osd_max_scrubs
              = 1
          osd_max_write_size
              = 264
          osd_pool_erasure_code_stripe_unit
              = 2097152
          osd_recovery_max_active
              = 10
          
          osd_heartbeat_interval
              = 15
          

          Data
              nodes Sysctl params
          -----------------------------
          fs.aio-max-nr=1048576
          kernel.pid_max=4194303
          kernel.threads-max=2097152
          net.core.netdev_max_backlog=65536
          net.core.optmem_max=1048576
          net.core.rmem_max=8388608
          net.core.rmem_default=8388608
          net.core.somaxconn=2048
          net.core.wmem_max=8388608
          net.core.wmem_default=8388608
          vm.max_map_count=524288
          vm.min_free_kbytes=262144
          

          net.ipv4.tcp_tw_reuse=1
          net.ipv4.tcp_max_syn_backlog=16384
          net.ipv4.tcp_fin_timeout=10
          
          net.ipv4.tcp_slow_start_after_idle=0
          

          Ceph -s
              output
          ---------------
        
        
          root@ctrl1:/#
              ceph -s
            cluster:
              id:     06126476-6deb-4baa-b7ca-50f5ccfacb68
              health:
              HEALTH_ERR
                      noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub
              flag(s) set
                      704
              osds down
                      9
              hosts (540 osds) down
                      71
              nearfull osd(s)
                      2
              pool(s) nearfull
                      780664/74163111
              objects misplaced (1.053%)
                      7724/8242239
              objects unfound (0.094%)
                      396
              PGs pending on creation
                      Reduced
              data availability: 32597 pgs inactive, 29764 pgs down, 820
              pgs peering, 74 pgs incomplete, 1 pg stale
                      Degraded
              data redundancy: 679158/74163111 objects degraded
              (0.916%), 1250 pgs degraded, 1106 pgs undersized
                      33
              slow requests are blocked > 32 sec
                      9
              stuck requests are blocked > 4096 sec
                      mons
              ctrl1,ctrl2,ctrl3 are using a lot of disk space
           
            services:
              mon: 3
              daemons, quorum ctrl1,ctrl2,ctrl3
              mgr:
              ctrl1(active), standbys: ctrl2, ctrl3
              osd:
              1080 osds: 376 up, 1080 in; 1963 remapped pgs
                   flags
noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub
           
            data:
              pools:   2 pools,
              33280 pgs
              objects:
              8049k objects, 2073 TB
              usage:   2277 TB
              used, 458 TB / 2736 TB avail
              pgs:     3.585%
              pgs unknown
                       94.363%
              pgs not active
                       679158/74163111
              objects degraded (0.916%)
                       780664/74163111
              objects misplaced (1.053%)
                       7724/8242239
              objects unfound (0.094%)
                       29754
              down
                       1193  unknown
                       535
                peering
                       496
                activating+undersized+degraded+remapped
                       284
                remapped+peering
                       258
                active+undersized+degraded+remapped
                       161
                activating+degraded+remapped
                       143
                active+recovering+undersized+degraded+remapped
                       89    active+undersized+degraded
                       76    active+clean+remapped
                       71    incomplete
                       48    active+undersized+remapped
                       46    undersized+degraded+peered
                       34    active+recovering+degraded+remapped
                       26    active+clean
                       21    activating+remapped
                       13    activating+undersized+degraded
                       10    down+remapped
                       9
                  active+recovery_wait+undersized+degraded+remapped
                       4
                  activating
                       3
                  remapped+incomplete
                       2
                  activating+undersized+remapped
                       1
                  stale+peering
                       1
                  undersized+peered
                       1
                  undersized+remapped+peered
                       1
                  activating+degraded
           
          root@ctrl1:/#
              ceph version
          ceph
              version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
              luminous (stable)
        
        
        osd.86 - busy
        -------------------
        
          2019-01-12
              02:30:02.582363 7f00ea0f4700  0 osd.86
              213711 No AuthAuthorizeHandler found for protocol 0
          2019-01-12
              02:30:02.582383 7f00ea0f4700  0 -- 192.168.13.5:6806/351612
              >> 192.168.13.8:6868/770039
              conn(0x55c4a8d89000 :6806
              s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
              l=0).handle_connect_msg: got bad authorizer
          2019-01-12
              02:30:04.005541 7f00ea8f5700  0 auth:
              could not find secret_id=7544
          2019-01-12
              02:30:04.005554 7f00ea8f5700  0 cephx:
              verify_authorizer could not get service secret for service
              osd secret_id=7544
          2019-01-12
              02:30:04.005557 7f00ea8f5700  0 -- 192.168.13.5:6806/351612
              >> 192.168.13.3:6836/405613
              conn(0x55c4f3617800 :6806
              s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
              l=0).handle_connect_msg: got bad authorizer
          2019-01-12
              02:30:07.910816 7f00e98f3700  0 auth:
              could not find secret_id=7550
          2019-01-12
              02:30:07.910864 7f00e98f3700  0 cephx:
              verify_authorizer could not get service secret for service
              osd secret_id=7550
          2019-01-12
              02:30:07.910884 7f00e98f3700  0 -- 192.168.13.5:6806/351612
              >> 192.168.13.8:6824/767660
              conn(0x55c4ec462800 :6806
              s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
              l=0).handle_connect_msg: got bad authorizer
          2019-01-12
              02:30:16.982636 7f00ea8f5700  0 osd.86
              213711 No AuthAuthorizeHandler found for protocol 0
          2019-01-12
              02:30:16.982640 7f00ea8f5700  0 -- 192.168.13.5:6806/351612
              >> 192.168.13.6:6816/349322
              conn(0x55c4aee6f000 :6806
              s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
              l=0).handle_connect_msg: got bad authorizer
          

          OSD.132
              - idle
          ---------------
          2019-01-12
              02:31:38.370478 7fa508450700  0 auth:
              could not find secret_id=7551
          2019-01-12
              02:31:38.370487 7fa508450700  0 cephx:
              verify_authorizer could not get service secret for service
              osd secret_id=7551
          2019-01-12
              02:31:38.370489 7fa508450700  0 -- 192.168.13.5:6854/356774
              >> 192.168.13.8:6872/1201589
              conn(0x563b3e46f000 :6854
              s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
              l=0).handle_connect_msg: got bad authorizer
          2019-01-12
              02:31:41.121603 7fa509452700  0 auth:
              could not find secret_id=7544
          2019-01-12
              02:31:41.121672 7fa509452700  0 cephx:
              verify_authorizer could not get service secret for service
              osd secret_id=7544
           
          2019-01-12
              02:31:41.121707 7fa509452700  0 -- 192.168.13.5:6854/356774
              >> 192.168.13.9:6808/515991
              conn(0x563b53997800 :6854
              s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
              l=0).handle_connect_msg: got bad authorizer
        
      
        This email message, including attachments, may contain private, proprietary, or privileged information and is the confidential information and/or property of GRAIL, Inc., and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
      
      
      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
    -- 
Maged Mokhtar
CEO PetaSAN
4 Emad El Deen Kamel
Cairo 11371, Egypt
www.petasan.org
+201006979931
skype: maged.mokhtar
  

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com