Re: Mimic upgrade failure

Sage Weil <sage@xxxxxxxxxxxx> · Sat, 8 Sep 2018 23:04:54 +0000 (UTC)

Hi Kevin,

I can't think of any major luminous->mimic changes off the top of my head 
that would impact CPU usage, but it's always possible there is something 
subtle.  Can you ceph-post-file a the full log from one of your mons 
(preferbably the leader)?

You might try adjusting the rocksdb cache size.. try setting

 rocksdb_cache_size = 1342177280   # 10x the default, ~1.3 GB

on the mons and restarting?

Thanks!
sage

On Sat, 8 Sep 2018, Kevin Hrpcek wrote:

> Hello,
> 
> I've had a Luminous -> Mimic upgrade go very poorly and my cluster is stuck
> with almost all pgs down. One problem is that the mons have started to
> re-elect a new quorum leader almost every minute. This is making it difficult
> to monitor the cluster and even run any commands on it since at least half the
> time a ceph command times out or takes over a minute to return results. I've
> looked at the debug logs and it appears there is some timeout occurring with
> paxos of about a minute. The msg_dispatch thread of the mons is often running
> a core at 100% for about a minute(user time, no iowait). Running strace on it
> shows the process is going through all of the mon db files (about 6gb in
> store.db/*.sst). Does anyone have an idea of what this timeout is or why my
> mons are always reelecting? One theory I have is that the msg_dispatch can't
> process the SST's fast enough and hits some timeout for a health check and the
> mon drops itself from the quorum since it thinks it isn't healthy. I've been
> thinking of introducing a new mon to the cluster on hardware with a better cpu
> to see if that can process the SSTs within this timeout.
> 
> My cluster has the mons,mds,mgr and 30/41 osd servers on mimic, and 11/41 osd
> servers on luminous. The original problem started when I restarted the osds on
> one of the hosts. The cluster reacted poorly to them going down and went into
> a frenzy of taking down other osds and remapping. I eventually got that stable
> and the PGs were 90% good with the finish line in sight and then the mons
> started their issue of releecting every minute. Now I can't keep any decent
> amount of PGs up for more than a few hours. This started on Wednesday.....
> 
> Any help would be greatly appreciated.
> 
> Thanks,
> Kevin
> 
> --Debug snippet from a mon at reelection time
> 2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).mds e14242
> maybe_resize_cluster in 1 max 1
> 2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mds e14242
> tick: resetting beacon timeouts due to mon delay (slow election?) of 59.8106s
> seconds
> 2018-09-07 20:08:08.655 7f57b92cd700 10
> mon.sephmon2@1(leader).paxosservice(mdsmap 13504..14242) maybe_trim trim_to
> 13742 would only trim 238 < paxos_service_trim_min 250
> 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657
> auth
> 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657
> check_rotate updated rotating
> 2018-09-07 20:08:08.655 7f57b92cd700 10
> mon.sephmon2@1(leader).paxosservice(auth 120594..120657) propose_pending
> 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657
> encode_pending v 120658
> 2018-09-07 20:08:08.655 7f57b92cd700  5 mon.sephmon2@1(leader).paxos(paxos
> updating c 132917556..132918214) queue_pending_finisher 0x55dce8e5b370
> 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).paxos(paxos
> updating c 132917556..132918214) trigger_propose not active, will propose
> later
> 2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mgr e2234 tick:
> resetting beacon timeouts due to mon delay (slow election?) of 59.8844s
> seconds
> 2018-09-07 20:08:08.655 7f57b92cd700 10
> mon.sephmon2@1(leader).paxosservice(mgr 1513..2234) maybe_trim trim_to 1734
> would only trim 221 < paxos_service_trim_min 250
> 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).health tick
> 2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).health
> check_member_health
> 2018-09-07 20:08:08.657 7f57bcdd0700  1 -- 10.1.9.202:6789/0 >> -
> conn(0x55dcee55be00 :6789 s=STATE_ACCEPTING pgs=0 cs=0
> l=0)._process_connection sd=447 -
> 2018-09-07 20:08:08.657 7f57bcdd0700 10 mon.sephmon2@1(leader) e17
> ms_verify_authorizer 10.1.9.32:6823/4007 osd protocol 0
> 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader).health
> check_member_health avail 79% total 40 GiB, used 8.4 GiB, avail 32 GiB
> 2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader).health
> check_leader_health
> 2018-09-07 20:08:08.662 7f57b92cd700 10
> mon.sephmon2@1(leader).paxosservice(health 1534..1720) maybe_trim trim_to 1715
> would only trim 181 < paxos_service_trim_min 250
> 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader).config tick
> 2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader) e17
> sync_trim_providers
> 2018-09-07 20:08:08.662 7f57b92cd700 -1 mon.sephmon2@1(leader) e17
> get_health_metrics reporting 1940 slow ops, oldest is osd_failure(failed
> timeout osd.72 10.1.9.9:6800/68904 for 317sec e987498 v987498)
> 2018-09-07 20:08:08.662 7f57b92cd700  1 mon.sephmon2@1(leader).paxos(paxos
> updating c 132917556..132918214) accept timeout, calling fresh election
> 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 bootstrap
> 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
> sync_reset_requester
> 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
> unregister_cluster_logger
> 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
> cancel_probe_timeout (none scheduled)
> 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17 _reset
> 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17
> cancel_probe_timeout (none scheduled)
> 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17
> timecheck_finish
> 2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing) e17
> health_tick_stop
> 2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing) e17
> health_interval_stop
> 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17
> scrub_event_cancel
> 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17
> scrub_reset
> 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing).paxos(paxos
> updating c 132917556..132918214) restart -- canceling timeouts
> 2018-09-07 20:08:08.662 7f57b92cd700 10
> mon.sephmon2@1(probing).paxosservice(mdsmap 13504..14242) restart
> 2018-09-07 20:08:08.662 7f57b92cd700 10
> mon.sephmon2@1(probing).paxosservice(osdmap 917471..987503) restart
> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com