Mimic upgrade failure

Kevin Hrpcek <kevin.hrpcek@xxxxxxxxxxxxx> · Sat, 8 Sep 2018 09:08:10 -0500



    Hello,

      
      I've had a Luminous -> Mimic upgrade go very poorly and my
      cluster is stuck with almost all pgs down. One problem is that the
      mons have started to re-elect a new quorum leader almost every
      minute. This is making it difficult to monitor the cluster and
      even run any commands on it since at least half the time a ceph
      command times out or takes over a minute to return results. I've
      looked at the debug logs and it appears there is some timeout
      occurring with paxos of about a minute. The msg_dispatch thread of
      the mons is often running a core at 100% for about a minute(user
      time, no iowait). Running strace on it shows the process is going
      through all of the mon db files (about 6gb in store.db/*.sst).
      Does anyone have an idea of what this timeout is or why my mons
      are always reelecting? One theory I have is that the msg_dispatch
      can't process the SST's fast enough and hits some timeout for a
      health check and the mon drops itself from the quorum since it
      thinks it isn't healthy. I've been thinking of introducing a new
      mon to the cluster on hardware with a better cpu to see if that
      can process the SSTs within this timeout.

      
      My cluster has the mons,mds,mgr and 30/41 osd servers on mimic,
      and 11/41 osd servers on luminous. The original problem started
      when I restarted the osds on one of the hosts. The cluster reacted
      poorly to them going down and went into a frenzy of taking down
      other osds and remapping. I eventually got that stable and the PGs
      were 90% good with the finish line in sight and then the mons
      started their issue of releecting every minute. Now I can't keep
      any decent amount of PGs up for more than a few hours. This
      started on Wednesday.....

      
      Any help would be greatly appreciated.

      
      Thanks,

      Kevin

      
      --Debug snippet from a mon at reelection time

      2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).mds
      e14242 maybe_resize_cluster in 1 max 1

      2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mds
      e14242 tick: resetting beacon timeouts due to mon delay (slow
      election?) of 59.8106s seconds

      2018-09-07 20:08:08.655 7f57b92cd700 10
      mon.sephmon2@1(leader).paxosservice(mdsmap 13504..14242)
      maybe_trim trim_to 13742 would only trim 238 <
      paxos_service_trim_min 250

      2018-09-07 20:08:08.655 7f57b92cd700 10
      mon.sephmon2@1(leader).auth v120657 auth

      2018-09-07 20:08:08.655 7f57b92cd700 10
      mon.sephmon2@1(leader).auth v120657 check_rotate updated rotating

      2018-09-07 20:08:08.655 7f57b92cd700 10
      mon.sephmon2@1(leader).paxosservice(auth 120594..120657)
      propose_pending

      2018-09-07 20:08:08.655 7f57b92cd700 10
      mon.sephmon2@1(leader).auth v120657 encode_pending v 120658

      2018-09-07 20:08:08.655 7f57b92cd700  5
      mon.sephmon2@1(leader).paxos(paxos updating c
      132917556..132918214) queue_pending_finisher 0x55dce8e5b370

      2018-09-07 20:08:08.655 7f57b92cd700 10
      mon.sephmon2@1(leader).paxos(paxos updating c
      132917556..132918214) trigger_propose not active, will propose
      later

      2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mgr
      e2234 tick: resetting beacon timeouts due to mon delay (slow
      election?) of 59.8844s seconds

      2018-09-07 20:08:08.655 7f57b92cd700 10
      mon.sephmon2@1(leader).paxosservice(mgr 1513..2234) maybe_trim
      trim_to 1734 would only trim 221 < paxos_service_trim_min 250

      2018-09-07 20:08:08.655 7f57b92cd700 10
      mon.sephmon2@1(leader).health tick

      2018-09-07 20:08:08.655 7f57b92cd700 20
      mon.sephmon2@1(leader).health check_member_health

      2018-09-07 20:08:08.657 7f57bcdd0700  1 -- 10.1.9.202:6789/0
      >> - conn(0x55dcee55be00 :6789 s=STATE_ACCEPTING pgs=0 cs=0
      l=0)._process_connection sd=447 -

      2018-09-07 20:08:08.657 7f57bcdd0700 10 mon.sephmon2@1(leader) e17
      ms_verify_authorizer 10.1.9.32:6823/4007 osd protocol 0

      2018-09-07 20:08:08.662 7f57b92cd700 10
      mon.sephmon2@1(leader).health check_member_health avail 79% total
      40 GiB, used 8.4 GiB, avail 32 GiB

      2018-09-07 20:08:08.662 7f57b92cd700 20
      mon.sephmon2@1(leader).health check_leader_health

      2018-09-07 20:08:08.662 7f57b92cd700 10
      mon.sephmon2@1(leader).paxosservice(health 1534..1720) maybe_trim
      trim_to 1715 would only trim 181 < paxos_service_trim_min 250

      2018-09-07 20:08:08.662 7f57b92cd700 10
      mon.sephmon2@1(leader).config tick

      2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader) e17
      sync_trim_providers

      2018-09-07 20:08:08.662 7f57b92cd700 -1 mon.sephmon2@1(leader) e17
      get_health_metrics reporting 1940 slow ops, oldest is
      osd_failure(failed timeout osd.72 10.1.9.9:6800/68904 for 317sec
      e987498 v987498)

      2018-09-07 20:08:08.662 7f57b92cd700  1
      mon.sephmon2@1(leader).paxos(paxos updating c
      132917556..132918214) accept timeout, calling fresh election

      2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
      bootstrap

      2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
      sync_reset_requester

      2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
      unregister_cluster_logger

      2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
      cancel_probe_timeout (none scheduled)

      2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
      e17 _reset

      2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
      e17 cancel_probe_timeout (none scheduled)

      2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
      e17 timecheck_finish

      2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing)
      e17 health_tick_stop

      2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing)
      e17 health_interval_stop

      2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
      e17 scrub_event_cancel

      2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
      e17 scrub_reset

      2018-09-07 20:08:08.662 7f57b92cd700 10
      mon.sephmon2@1(probing).paxos(paxos updating c
      132917556..132918214) restart -- canceling timeouts

      2018-09-07 20:08:08.662 7f57b92cd700 10
      mon.sephmon2@1(probing).paxosservice(mdsmap 13504..14242) restart

      2018-09-07 20:08:08.662 7f57b92cd700 10
      mon.sephmon2@1(probing).paxosservice(osdmap 917471..987503)
      restart

      
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com