Cluster Down from reweight-by-utilization

Kevin Hrpcek <kevin.hrpcek@xxxxxxxxxxxxx> · Sat, 04 Nov 2017 22:09:47 -0500



    Hello,

      
      I've run into an issue and would appreciate any assistance anyone
      can provide as I haven't been able to solve this problem yet and
      am running out of ideas. I ran a reweight-by-utilization on my
      cluster using conservative values so that it wouldn't cause a
      large rebalancing. The reweight ran and changed some osd weights,
      but shortly after OSDs started getting marked down and out of the
      cluster. This continued until I set nodown and noout, but by that
      time the cluster had <200 osd up and in out of 540. The OSD
      processes started pegging the servers at 100% cpu usage. The
      problem seems similar to one or two others i've seen in the lists
      or bugs where the OSD maps get behind and spin 100% cpu trying to
      catch up. The cluster has been down for about 30 hours now.

      
      After setting noout and nodown the cluster started to slowly mark
      osds up and in averaging 8 osd/hour. At around 490/540 up i unset
      nodown to see if these OSDs were really up and the cluster
      immediately marked most down again until i reset nodown at 161/540
      up. Once again the cluster started marking OSDs up until it got to
      245 but then it stopped increasing. The mon debug logs show that
      the cluster wants to mark majority of OSDs down. Restarting OSD
      processes doesn't bring them up and in, mgr restarts didn't
      improve anything, OSD node reboots seem to not do anything
      positive. Some logs seem to suggest there are authentication
      issues among the daemons and that daemons are simply waiting for
      new maps. I can often see the "newest_map" incrementing on osd
      daemons, but it is slow and some are behind by thousands.

      
      Thanks,

      Kevin

      
      Cluster details:

      CentOS 7.4

      Kraken ceph-11.2.1-0.el7.x86_64

      540 OSD, 3 mon/mgr/mds

      ~3.6PB, 72% raw used, ~40 million objects

      24 OSD/server

      ~25k PGs, mostly ec k=4 m=1, 2 small replicated pools

      
      The command that broke everything, this should have resulted in
      < ~20tb migrating

      # ceph osd reweight-by-utilization 110 0.1 10

      moved 200 / 112640 (0.177557%)

      avg 208.593

      stddev 60.2522 -> 60.1992 (expected baseline 14.4294)

      min osd.351 with 98 -> 98 pgs (0.469815 -> 0.469815 * mean)

      max osd.503 with 310 -> 310 pgs (1.48615 -> 1.48615 * mean)

      
      oload 110

      max_change 0.1

      max_change_osds 10

      average_utilization 0.7176

      overload_utilization 0.7893

      osd.244 weight 1.0000 -> 0.9000osd.167 weight 1.0000 ->
      0.9000osd.318 weight 1.0000 -> 0.9000osd.302 weight 0.8544
      -> 0.7545osd.264 weight 1.0000 -> 0.9000osd.233 weight
      1.0000 -> 0.9000osd.18 weight 1.0000 -> 0.9000osd.268 weight
      0.8728 -> 0.7729osd.14 weight 1.0000 -> 0.9000osd.343 weight
      1.0000 -> 0.9000

      
      A lot of OSDs are stuck in the preboot state and are marked down
      in the map but seem to be behind the osdmap reported on the
      monitors.

      From osd.1:

          "whoami": 1,

          "state": "preboot",

          "oldest_map": 502767,

          "newest_map": 516124,

      
      # ceph osd stat

           osdmap e518386: 540 osds: 245 up, 530 in; 14626 remapped pgs

                  flags
      nodown,noout,sortbitwise,require_jewel_osds,require_kraken_osds

      
      Some are caught up to the map and are in the cluster while nodown
      noout is set

      # ceph daemon osd.0 status

      {

          "cluster_fsid": "bc2a1488-74f8-4d87-b2f6-615ae26bf7c9",

          "osd_fsid": "7d88f4d2-cebd-477c-a344-ffbca8c471a8",

          "whoami": 0,

          "state": "active",

          "oldest_map": 502767,

          "newest_map": 518391,

          "num_pgs": 228

      
      But the mon debug log suggests it wants to mark it and others down

      2017-11-05 02:16:31.598083 7f42382d9700  5
      mon.sephmon1@0(leader).osd e518321 can_mark_down NODOWN flag set,
      will not mark osd.0 down

      2017-11-05 02:16:31.598095 7f42382d9700  5
      mon.sephmon1@0(leader).osd e518321 can_mark_down NODOWN flag set,
      will not mark osd.3 down

      2017-11-05 02:16:31.598100 7f42382d9700  5
      mon.sephmon1@0(leader).osd e518321 can_mark_down NODOWN flag set,
      will not mark osd.4 down

      2017-11-05 02:16:31.598104 7f42382d9700  5
      mon.sephmon1@0(leader).osd e518321 can_mark_down NODOWN flag set,
      will not mark osd.5 down

      2017-11-05 02:16:31.598107 7f42382d9700  5
      mon.sephmon1@0(leader).osd e518321 can_mark_down NODOWN flag set,
      will not mark osd.6 down

      2017-11-05 02:16:31.598111 7f42382d9700  5
      mon.sephmon1@0(leader).osd e518321 can_mark_down NODOWN flag set,
      will not mark osd.8 down

      
      Mon log is full of messages like:

      2017-11-05 02:55:55.512643 7f423b3d9700  0 -- 10.1.9.201:6789/0
      >> - conn(0x55cfe1ae6800 :6789
      s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with
      nothing to send and in the half  accept state just closed

      
      mgr log has a lot of:

      2017-11-05 02:58:05.813420 7f76dcf90700  0 --
      10.1.9.201:6801/24033 >> - conn(0x55e93ed39000 :6801
      s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with
      nothing to send and in the half  accept state just closed

      2017-11-05 02:58:05.998992 7f76dcf90700  0 --
      10.1.9.201:6801/24033 >> - conn(0x55e942ff9000 :6801
      s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with
      nothing to send and in the half  accept state just closed

      2017-11-05 02:58:06.496972 7f76da816700 -1 mgr ms_dispatch fsmap(e
      6691) v1

      2017-11-05 02:58:08.042295 7f76d8812700  1 mgr send_beacon active

      2017-11-05 02:58:08.429476 7f76da816700 -1 mgr ms_dispatch
      mgrdigest v1

      2017-11-05 02:58:08.537725 7f76dcf90700  0 --
      10.1.9.201:6801/24033 >> 10.1.9.24:6816/5624
      conn(0x55e947cde000 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
      pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 2 vs
      existing csq=2 existing_state=STATE_CONNECTING

      2017-11-05 02:58:08.538037 7f76dcf90700  0 --
      10.1.9.201:6801/24033 >> 10.1.9.24:6816/5624
      conn(0x55e947cde000 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0
      cs=0 l=0).fault with nothing to send and in the half  accept state
      just closed

      2017-11-05 02:58:08.613374 7f76dcf90700  0 --
      10.1.9.201:6801/24033 >> - conn(0x55e941843800 :6801
      s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with
      nothing to send and in the half  accept state just closed

      
      OSD logs for a 'down' osd:

      2017-11-05 02:59:44.256472 7f68fc0ad700  0 -- 0.0.0.0:6835/35706
      >> - conn(0x55acb8c11800 :6835
      s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with
      nothing to send and in the half  accept state just closed

      2017-11-05 02:59:44.751890 7f68fc8ae700  0 -- 10.1.9.1:6833/35706
      >> - conn(0x55acc93ee000 :6833
      s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with
      nothing to send and in the half  accept state just closed

      2017-11-05 02:59:44.934979 7f68fc0ad700  0 -- 0.0.0.0:6835/35706
      >> - conn(0x55acbc5ef800 :6835
      s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with
      nothing to send and in the half  accept state just closed

      2017-11-05 02:59:44.935007 7f68fc8ae700  0 -- 0.0.0.0:6834/35706
      >> - conn(0x55accb28c800 :6834
      s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with
      nothing to send and in the half  accept state just closed

      
      OSD logs for a 'up' osd:

      2017-11-05 03:00:24.266686 7f2459f41700 -1 osd.0 518417
      heartbeat_check: no reply from 10.1.9.8:6851 osd.250 since back
      2017-11-05 03:00:21.586832 front 2017-11-05 02:59:35.268399
      (cutoff 2017-11-05 03:00:04.266664)

      2017-11-05 03:00:24.922615 7f245de82700  0 --
      10.1.9.1:6813/1027111 >> 10.1.9.20:6825/2005105
      conn(0x5580adf00000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH
      pgs=0 cs=0 l=0).handle_connect_reply connect got BADAUTHORIZER

      2017-11-05 03:00:25.100982 7f245e683700  0 --
      10.1.9.1:6813/1027111 >> 10.1.9.8:6893/1015456
      conn(0x5580acba9000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH
      pgs=0 cs=0 l=0).handle_connect_reply connect got BADAUTHORIZER

      2017-11-05 03:00:25.266906 7f2459f41700 -1 osd.0 518417
      heartbeat_check: no reply from 10.1.9.8:6851 osd.250 since back
      2017-11-05 03:00:25.087925 front 2017-11-05 02:59:35.268399
      (cutoff 2017-11-05 03:00:05.266894)

      
      ceph status:

      
      [root@sephmon1 ceph]# ceph status

          cluster bc2a1488-74f8-4d87-b2f6-615ae26bf7c9

           health HEALTH_ERR

                  25701 pgs are stuck inactive for more than 300 seconds

                  2 pgs backfill_wait

                  2 pgs backfilling

                  1151 pgs degraded

                  19114 pgs down

                  2496 pgs peering

                  1 pgs recovering

                  3973 pgs stale

                  1150 pgs stuck degraded

                  21728 pgs stuck inactive

                  3973 pgs stuck stale

                  22805 pgs stuck unclean

                  1150 pgs stuck undersized

                  1151 pgs undersized

                  542105 requests are blocked > 32 sec

                  recovery 3150732/243157671 objects degraded (1.296%)

                  recovery 21562/243157671 objects misplaced (0.009%)

                  recovery 1/51923123 unfound (0.000%)

                  mds0: 4755 slow requests are blocked > 30 sec

                  284/530 in osds are down

                  nodown,noout flag(s) set

                  mon.sephmon1 low disk space

                  mon.sephmon2 low disk space

                  mon.sephmon3 low disk space

           monmap e15: 3 mons at
{sephmon1=10.1.9.201:6789/0,sephmon2=10.1.9.202:6789/0,sephmon3=10.1.9.203:6789/0}

                  election epoch 916, quorum 0,1,2
      sephmon1,sephmon2,sephmon3

            fsmap e6708: 1/1/1 up {0=sephmon1=up:active}, 1 up:standby

              mgr active: sephmon1 standbys: sephmon2, sephmon3

           osdmap e518431: 540 osds: 246 up, 530 in; 14672 remapped pgs

                  flags
      nodown,noout,sortbitwise,require_jewel_osds,require_kraken_osds

            pgmap v52920051: 23552 pgs, 6 pools, 2030 TB data, 50706
      kobjects

                  2404 TB used, 942 TB / 3347 TB avail

                  3150732/243157671 objects degraded (1.296%)

                  21562/243157671 objects misplaced (0.009%)

                  1/51923123 unfound (0.000%)

                     10465 down+remapped

                      6972 down

                      1124 stale+remapped+peering

                      1029 active+undersized+degraded

                       935 stale+down+remapped

                       742 stale+down

                       691 stale+peering

                       494 remapped+peering

                       397 active+clean

                       347 stale+active+clean

                       187 peering

                        58 stale+activating+undersized+degraded

                        36 stale+active+undersized+degraded

                        14 stale+remapped

                        12 activating+undersized+degraded

                        11 stale+activating+undersized+degraded+remapped

                         7 remapped

                         6 stale

                         5 activating

                         4 active+remapped

                         3 active+undersized+degraded+remapped

                         2 stale+activating+remapped

                         2 stale+active+remapped+backfill_wait

                         2 stale+activating

                         1 stale+active+clean+scrubbing

                         1 active+recovering+undersized+degraded

                         1 stale+active+remapped+backfilling

                         1 inactive

                         1 active+clean+scrubbing

                         1 stale+active+clean+scrubbing+deep

                         1
      active+undersized+degraded+remapped+backfilling

      
    -- 
Kevin Hrpcek
Linux Systems Administrator
Space Science & Engineering Center
University of Wisconsin-Madison
  

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com