Sage, Unfortunately the mon election problem came back yesterday and it makes it really hard to get a cluster to stay healthy. A brief unexpected network outage occurred and sent the cluster into a frenzy and when I had it 95% healthy the mons started their nonstop reelections. In the previous logs I sent were you able to identify why the mons are constantly electing? The elections seem to be triggered by the below paxos message but do you know which lease timeout is being reached or why the lease isn't renewed instead of calling for an election? One thing I tried was to shutdown the entire cluster and bring up only the mon and mgr. The mons weren't able to hold their quorum with no osds running and the ceph-mon ms_dispatch thread runs at 100% for > 60s at a time. 2018-09-19 03:56:21.729 7f4344ec1700 1 mon.sephmon2@1(peon).paxos(paxos active c 133382665..133383355) lease_timeout -- calling new election Thanks Kevin On 09/10/2018 07:06 AM, Sage Weil wrote:
I took a look at the mon log you sent. A few things I noticed: - The frequent mon elections seem to get only 2/3 mons about half of the time. - The messages coming in a mostly osd_failure, and half of those seem to be recoveries (cancellation of the failure message). It does smell a bit like a networking issue, or some tunable that relates to the messaging layer. It might be worth looking at an OSD log for an osd that reported a failure and seeing what error code it coming up on the failed ping connection? That might provide a useful hint (e.g., ECONNREFUSED vs EMFILE or something). I'd also confirm that with nodown set the mon quorum stabilizes... sage On Mon, 10 Sep 2018, Kevin Hrpcek wrote:Update for the list archive. I went ahead and finished the mimic upgrade with the osds in a fluctuating state of up and down. The cluster did start to normalize a lot easier after everything was on mimic since the random mass OSD heartbeat failures stopped and the constant mon election problem went away. I'm still battling with the cluster reacting poorly to host reboots or small map changes, but I feel like my current pg:osd ratio may be playing a factor in that since we are 2x normal pg count while migrating data to new EC pools. I'm not sure of the root cause but it seems like the mix of luminous and mimic did not play well together for some reason. Maybe it has to do with the scale of my cluster, 871 osd, or maybe I've missed some some tuning as my cluster has scaled to this size. Kevin On 09/09/2018 12:49 PM, Kevin Hrpcek wrote:Nothing too crazy for non default settings. Some of those osd settings were in place while I was testing recovery speeds and need to be brought back closer to defaults. I was setting nodown before but it seems to mask the problem. While its good to stop the osdmap changes, OSDs would come up, get marked up, but at some point go down again (but the process is still running) and still stay up in the map. Then when I'd unset nodown the cluster would immediately mark 250+ osd down again and i'd be back where I started. This morning I went ahead and finished the osd upgrades to mimic to remove that variable. I've looked for networking problems but haven't found any. 2 of the mons are on the same switch. I've also tried combinations of shutting down a mon to see if a single one was the problem, but they keep electing no matter the mix of them that are up. Part of it feels like a networking problem but I haven't been able to find a culprit yet as everything was working normally before starting the upgrade. Other than the constant mon elections, yesterday I had the cluster 95% healthy 3 or 4 times, but it doesn't last long since at some point the OSDs start trying to fail each other through their heartbeats. 2018-09-09 17:37:29.079 7eff774f5700 1 mon.sephmon1@0(leader).osd e991282 prepare_failure osd.39 10.1.9.2:6802/168438 from osd.49 10.1.9.3:6884/317908 is reporting failure:1 2018-09-09 17:37:29.079 7eff774f5700 0 log_channel(cluster) log [DBG] : osd.39 10.1.9.2:6802/168438 reported failed by osd.49 10.1.9.3:6884/317908 2018-09-09 17:37:29.083 7eff774f5700 1 mon.sephmon1@0(leader).osd e991282 prepare_failure osd.93 10.1.9.9:6853/287469 from osd.372 10.1.9.13:6801/275806 is reporting failure:1 I'm working on getting things mostly good again with everything on mimic and will see if it behaves better. Thanks for your input on this David. [global] mon_initial_members = sephmon1, sephmon2, sephmon3 mon_host = 10.1.9.201,10.1.9.202,10.1.9.203 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true public_network = 10.1.0.0/16 osd backfill full ratio = 0.92 osd failsafe nearfull ratio = 0.90 osd max object size = 21474836480 mon max pg per osd = 350 [mon] mon warn on legacy crush tunables = false mon pg warn max per osd = 300 mon osd down out subtree limit = host mon osd nearfull ratio = 0.90 mon osd full ratio = 0.97 mon health preluminous compat warning = false osd heartbeat grace = 60 rocksdb cache size = 1342177280 [mds] mds log max segments = 100 mds log max expiring = 40 mds bal fragment size max = 200000 mds cache memory limit = 4294967296 [osd] osd mkfs options xfs = -i size=2048 -d su=512k,sw=1 osd recovery delay start = 30 osd recovery max active = 5 osd max backfills = 3 osd recovery threads = 2 osd crush initial weight = 0 osd heartbeat interval = 30 osd heartbeat grace = 60 On 09/08/2018 11:24 PM, David Turner wrote:What osd/mon/etc config settings do you have that are not default? It might be worth utilizing nodown to stop osds from marking each other down and finish the upgrade to be able to set the minimum osd version to mimic. Stop the osds in a node, manually mark them down, start them back up in mimic. Depending on how bad things are, setting pause on the cluster to just finish the upgrade faster might not be a bad idea either. This should be a simple question, have you confirmed that there are no networking problems between the MONs while the elections are happening? On Sat, Sep 8, 2018, 7:52 PM Kevin Hrpcek <kevin.hrpcek@xxxxxxxxxxxxx <mailto:kevin.hrpcek@xxxxxxxxxxxxx>> wrote: Hey Sage, I've posted the file with my email address for the user. It is with debug_mon 20/20, debug_paxos 20/20, and debug ms 1/5. The mons are calling for elections about every minute so I let this run for a few elections and saw this node become the leader a couple times. Debug logs start around 23:27:30. I had managed to get about 850/857 osds up, but it seems that within the last 30 min it has all gone bad again due to the OSDs reporting each other as failed. We relaxed the osd_heartbeat_interval to 30 and osd_heartbeat_grace to 60 in an attempt to slow down how quickly OSDs are trying to fail each other. I'll put in the rocksdb_cache_size setting. Thanks for taking a look. Kevin On 09/08/2018 06:04 PM, Sage Weil wrote:Hi Kevin, I can't think of any major luminous->mimic changes off the top of my head that would impact CPU usage, but it's always possible there is something subtle. Can you ceph-post-file a the full log from one of your mons (preferbably the leader)? You might try adjusting the rocksdb cache size.. try setting rocksdb_cache_size = 1342177280 # 10x the default, ~1.3 GB on the mons and restarting? Thanks! sage On Sat, 8 Sep 2018, Kevin Hrpcek wrote:Hello, I've had a Luminous -> Mimic upgrade go very poorly and my cluster is stuck with almost all pgs down. One problem is that the mons have started to re-elect a new quorum leader almost every minute. This is making it difficult to monitor the cluster and even run any commands on it since at least half the time a ceph command times out or takes over a minute to return results. I've looked at the debug logs and it appears there is some timeout occurring with paxos of about a minute. The msg_dispatch thread of the mons is often running a core at 100% for about a minute(user time, no iowait). Running strace on it shows the process is going through all of the mon db files (about 6gb in store.db/*.sst). Does anyone have an idea of what this timeout is or why my mons are always reelecting? One theory I have is that the msg_dispatch can't process the SST's fast enough and hits some timeout for a health check and the mon drops itself from the quorum since it thinks it isn't healthy. I've been thinking of introducing a new mon to the cluster on hardware with a better cpu to see if that can process the SSTs within this timeout. My cluster has the mons,mds,mgr and 30/41 osd servers on mimic, and 11/41 osd servers on luminous. The original problem started when I restarted the osds on one of the hosts. The cluster reacted poorly to them going down and went into a frenzy of taking down other osds and remapping. I eventually got that stable and the PGs were 90% good with the finish line in sight and then the mons started their issue of releecting every minute. Now I can't keep any decent amount of PGs up for more than a few hours. This started on Wednesday..... Any help would be greatly appreciated. Thanks, Kevin --Debug snippet from a mon at reelection time 2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).mds e14242 maybe_resize_cluster in 1 max 1 2018-09-07 20:08:08.655 7f57b92cd700 4 mon.sephmon2@1(leader).mds e14242 tick: resetting beacon timeouts due to mon delay (slow election?) of 59.8106s seconds 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).paxosservice(mdsmap 13504..14242) maybe_trim trim_to 13742 would only trim 238 < paxos_service_trim_min 250 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657 auth 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657 check_rotate updated rotating 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).paxosservice(auth 120594..120657) propose_pending 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657 encode_pending v 120658 2018-09-07 20:08:08.655 7f57b92cd700 5 mon.sephmon2@1(leader).paxos(paxos updating c 132917556..132918214) queue_pending_finisher 0x55dce8e5b370 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).paxos(paxos updating c 132917556..132918214) trigger_propose not active, will propose later 2018-09-07 20:08:08.655 7f57b92cd700 4 mon.sephmon2@1(leader).mgr e2234 tick: resetting beacon timeouts due to mon delay (slow election?) of 59.8844s seconds 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).paxosservice(mgr 1513..2234) maybe_trim trim_to 1734 would only trim 221 < paxos_service_trim_min 250 2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).health tick 2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).health check_member_health 2018-09-07 20:08:08.657 7f57bcdd0700 1 --10.1.9.202:6789/0 <http://10.1.9.202:6789/0> >> - conn(0x55dcee55be00 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=447 - 2018-09-07 20:08:08.657 7f57bcdd0700 10 mon.sephmon2@1(leader) e17 ms_verify_authorizer10.1.9.32:6823/4007 <http://10.1.9.32:6823/4007> osd protocol 0 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader).health check_member_health avail 79% total 40 GiB, used 8.4 GiB, avail 32 GiB 2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader).health check_leader_health 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader).paxosservice(health 1534..1720) maybe_trim trim_to 1715 would only trim 181 < paxos_service_trim_min 250 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader).config tick 2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader) e17 sync_trim_providers 2018-09-07 20:08:08.662 7f57b92cd700 -1 mon.sephmon2@1(leader) e17 get_health_metrics reporting 1940 slow ops, oldest is osd_failure(failed timeout osd.7210.1.9.9:6800/68904 <http://10.1.9.9:6800/68904> for 317sec e987498 v987498) 2018-09-07 20:08:08.662 7f57b92cd700 1 mon.sephmon2@1(leader).paxos(paxos updating c 132917556..132918214) accept timeout, calling fresh election 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 bootstrap 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 sync_reset_requester 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 unregister_cluster_logger 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 cancel_probe_timeout (none scheduled) 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17 _reset 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17 cancel_probe_timeout (none scheduled) 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17 timecheck_finish 2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing) e17 health_tick_stop 2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing) e17 health_interval_stop 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17 scrub_event_cancel 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17 scrub_reset 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing).paxos(paxos updating c 132917556..132918214) restart -- canceling timeouts 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing).paxosservice(mdsmap 13504..14242) restart 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing).paxosservice(osdmap 917471..987503) restart_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com