I took a look at the mon log you sent. A few things I noticed: - The frequent mon elections seem to get only 2/3 mons about half of the time. - The messages coming in a mostly osd_failure, and half of those seem to be recoveries (cancellation of the failure message). It does smell a bit like a networking issue, or some tunable that relates to the messaging layer. It might be worth looking at an OSD log for an osd that reported a failure and seeing what error code it coming up on the failed ping connection? That might provide a useful hint (e.g., ECONNREFUSED vs EMFILE or something). I'd also confirm that with nodown set the mon quorum stabilizes... sage On Mon, 10 Sep 2018, Kevin Hrpcek wrote: > Update for the list archive. > > I went ahead and finished the mimic upgrade with the osds in a fluctuating > state of up and down. The cluster did start to normalize a lot easier after > everything was on mimic since the random mass OSD heartbeat failures stopped > and the constant mon election problem went away. I'm still battling with the > cluster reacting poorly to host reboots or small map changes, but I feel like > my current pg:osd ratio may be playing a factor in that since we are 2x normal > pg count while migrating data to new EC pools. > > I'm not sure of the root cause but it seems like the mix of luminous and mimic > did not play well together for some reason. Maybe it has to do with the scale > of my cluster, 871 osd, or maybe I've missed some some tuning as my cluster > has scaled to this size. > > Kevin > > > On 09/09/2018 12:49 PM, Kevin Hrpcek wrote: > > Nothing too crazy for non default settings. Some of those osd settings were > > in place while I was testing recovery speeds and need to be brought back > > closer to defaults. I was setting nodown before but it seems to mask the > > problem. While its good to stop the osdmap changes, OSDs would come up, get > > marked up, but at some point go down again (but the process is still > > running) and still stay up in the map. Then when I'd unset nodown the > > cluster would immediately mark 250+ osd down again and i'd be back where I > > started. > > > > This morning I went ahead and finished the osd upgrades to mimic to remove > > that variable. I've looked for networking problems but haven't found any. 2 > > of the mons are on the same switch. I've also tried combinations of shutting > > down a mon to see if a single one was the problem, but they keep electing no > > matter the mix of them that are up. Part of it feels like a networking > > problem but I haven't been able to find a culprit yet as everything was > > working normally before starting the upgrade. Other than the constant mon > > elections, yesterday I had the cluster 95% healthy 3 or 4 times, but it > > doesn't last long since at some point the OSDs start trying to fail each > > other through their heartbeats. > > 2018-09-09 17:37:29.079 7eff774f5700 1 mon.sephmon1@0(leader).osd e991282 > > prepare_failure osd.39 10.1.9.2:6802/168438 from osd.49 10.1.9.3:6884/317908 > > is reporting failure:1 > > 2018-09-09 17:37:29.079 7eff774f5700 0 log_channel(cluster) log [DBG] : > > osd.39 10.1.9.2:6802/168438 reported failed by osd.49 10.1.9.3:6884/317908 > > 2018-09-09 17:37:29.083 7eff774f5700 1 mon.sephmon1@0(leader).osd e991282 > > prepare_failure osd.93 10.1.9.9:6853/287469 from osd.372 > > 10.1.9.13:6801/275806 is reporting failure:1 > > > > I'm working on getting things mostly good again with everything on mimic and > > will see if it behaves better. > > > > Thanks for your input on this David. > > > > > > [global] > > mon_initial_members = sephmon1, sephmon2, sephmon3 > > mon_host = 10.1.9.201,10.1.9.202,10.1.9.203 > > auth_cluster_required = cephx > > auth_service_required = cephx > > auth_client_required = cephx > > filestore_xattr_use_omap = true > > public_network = 10.1.0.0/16 > > osd backfill full ratio = 0.92 > > osd failsafe nearfull ratio = 0.90 > > osd max object size = 21474836480 > > mon max pg per osd = 350 > > > > [mon] > > mon warn on legacy crush tunables = false > > mon pg warn max per osd = 300 > > mon osd down out subtree limit = host > > mon osd nearfull ratio = 0.90 > > mon osd full ratio = 0.97 > > mon health preluminous compat warning = false > > osd heartbeat grace = 60 > > rocksdb cache size = 1342177280 > > > > [mds] > > mds log max segments = 100 > > mds log max expiring = 40 > > mds bal fragment size max = 200000 > > mds cache memory limit = 4294967296 > > > > [osd] > > osd mkfs options xfs = -i size=2048 -d su=512k,sw=1 > > osd recovery delay start = 30 > > osd recovery max active = 5 > > osd max backfills = 3 > > osd recovery threads = 2 > > osd crush initial weight = 0 > > osd heartbeat interval = 30 > > osd heartbeat grace = 60 > > > > > > On 09/08/2018 11:24 PM, David Turner wrote: > > > What osd/mon/etc config settings do you have that are not default? It > > > might be worth utilizing nodown to stop osds from marking each other down > > > and finish the upgrade to be able to set the minimum osd version to mimic. > > > Stop the osds in a node, manually mark them down, start them back up in > > > mimic. Depending on how bad things are, setting pause on the cluster to > > > just finish the upgrade faster might not be a bad idea either. > > > > > > This should be a simple question, have you confirmed that there are no > > > networking problems between the MONs while the elections are happening? > > > > > > On Sat, Sep 8, 2018, 7:52 PM Kevin Hrpcek <kevin.hrpcek@xxxxxxxxxxxxx > > > <mailto:kevin.hrpcek@xxxxxxxxxxxxx>> wrote: > > > > > > Hey Sage, > > > > > > I've posted the file with my email address for the user. It is > > > with debug_mon 20/20, debug_paxos 20/20, and debug ms 1/5. The > > > mons are calling for elections about every minute so I let this > > > run for a few elections and saw this node become the leader a > > > couple times. Debug logs start around 23:27:30. I had managed to > > > get about 850/857 osds up, but it seems that within the last 30 > > > min it has all gone bad again due to the OSDs reporting each > > > other as failed. We relaxed the osd_heartbeat_interval to 30 and > > > osd_heartbeat_grace to 60 in an attempt to slow down how quickly > > > OSDs are trying to fail each other. I'll put in the > > > rocksdb_cache_size setting. > > > > > > Thanks for taking a look. > > > > > > Kevin > > > > > > On 09/08/2018 06:04 PM, Sage Weil wrote: > > > > Hi Kevin, > > > > > > > > I can't think of any major luminous->mimic changes off the top of my > > > > head > > > > that would impact CPU usage, but it's always possible there is > > > > something > > > > subtle. Can you ceph-post-file a the full log from one of your mons > > > > (preferbably the leader)? > > > > > > > > You might try adjusting the rocksdb cache size.. try setting > > > > > > > > rocksdb_cache_size = 1342177280 # 10x the default, ~1.3 GB > > > > > > > > on the mons and restarting? > > > > > > > > Thanks! > > > > sage > > > > > > > > On Sat, 8 Sep 2018, Kevin Hrpcek wrote: > > > > > > > > > Hello, > > > > > > > > > > I've had a Luminous -> Mimic upgrade go very poorly and my cluster > > > > > is stuck > > > > > with almost all pgs down. One problem is that the mons have > > > > > started to > > > > > re-elect a new quorum leader almost every minute. This is making > > > > > it difficult > > > > > to monitor the cluster and even run any commands on it since at > > > > > least half the > > > > > time a ceph command times out or takes over a minute to return > > > > > results. I've > > > > > looked at the debug logs and it appears there is some timeout > > > > > occurring with > > > > > paxos of about a minute. The msg_dispatch thread of the mons is > > > > > often running > > > > > a core at 100% for about a minute(user time, no iowait). Running > > > > > strace on it > > > > > shows the process is going through all of the mon db files (about > > > > > 6gb in > > > > > store.db/*.sst). Does anyone have an idea of what this timeout is > > > > > or why my > > > > > mons are always reelecting? One theory I have is that the > > > > > msg_dispatch can't > > > > > process the SST's fast enough and hits some timeout for a health > > > > > check and the > > > > > mon drops itself from the quorum since it thinks it isn't healthy. > > > > > I've been > > > > > thinking of introducing a new mon to the cluster on hardware with > > > > > a better cpu > > > > > to see if that can process the SSTs within this timeout. > > > > > > > > > > My cluster has the mons,mds,mgr and 30/41 osd servers on mimic, > > > > > and 11/41 osd > > > > > servers on luminous. The original problem started when I restarted > > > > > the osds on > > > > > one of the hosts. The cluster reacted poorly to them going down > > > > > and went into > > > > > a frenzy of taking down other osds and remapping. I eventually got > > > > > that stable > > > > > and the PGs were 90% good with the finish line in sight and then > > > > > the mons > > > > > started their issue of releecting every minute. Now I can't keep > > > > > any decent > > > > > amount of PGs up for more than a few hours. This started on > > > > > Wednesday..... > > > > > > > > > > Any help would be greatly appreciated. > > > > > > > > > > Thanks, > > > > > Kevin > > > > > > > > > > --Debug snippet from a mon at reelection time > > > > > 2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).mds > > > > > e14242 > > > > > maybe_resize_cluster in 1 max 1 > > > > > 2018-09-07 20:08:08.655 7f57b92cd700 4 mon.sephmon2@1(leader).mds > > > > > e14242 > > > > > tick: resetting beacon timeouts due to mon delay (slow election?) > > > > > of 59.8106s > > > > > seconds > > > > > 2018-09-07 20:08:08.655 7f57b92cd700 10 > > > > > mon.sephmon2@1(leader).paxosservice(mdsmap 13504..14242) > > > > > maybe_trim trim_to > > > > > 13742 would only trim 238 < paxos_service_trim_min 250 > > > > > 2018-09-07 20:08:08.655 7f57b92cd700 10 > > > > > mon.sephmon2@1(leader).auth v120657 > > > > > auth > > > > > 2018-09-07 20:08:08.655 7f57b92cd700 10 > > > > > mon.sephmon2@1(leader).auth v120657 > > > > > check_rotate updated rotating > > > > > 2018-09-07 20:08:08.655 7f57b92cd700 10 > > > > > mon.sephmon2@1(leader).paxosservice(auth 120594..120657) > > > > > propose_pending > > > > > 2018-09-07 20:08:08.655 7f57b92cd700 10 > > > > > mon.sephmon2@1(leader).auth v120657 > > > > > encode_pending v 120658 > > > > > 2018-09-07 20:08:08.655 7f57b92cd700 5 > > > > > mon.sephmon2@1(leader).paxos(paxos > > > > > updating c 132917556..132918214) queue_pending_finisher > > > > > 0x55dce8e5b370 > > > > > 2018-09-07 20:08:08.655 7f57b92cd700 10 > > > > > mon.sephmon2@1(leader).paxos(paxos > > > > > updating c 132917556..132918214) trigger_propose not active, will > > > > > propose > > > > > later > > > > > 2018-09-07 20:08:08.655 7f57b92cd700 4 mon.sephmon2@1(leader).mgr > > > > > e2234 tick: > > > > > resetting beacon timeouts due to mon delay (slow election?) of > > > > > 59.8844s > > > > > seconds > > > > > 2018-09-07 20:08:08.655 7f57b92cd700 10 > > > > > mon.sephmon2@1(leader).paxosservice(mgr 1513..2234) maybe_trim > > > > > trim_to 1734 > > > > > would only trim 221 < paxos_service_trim_min 250 > > > > > 2018-09-07 20:08:08.655 7f57b92cd700 10 > > > > > mon.sephmon2@1(leader).health tick > > > > > 2018-09-07 20:08:08.655 7f57b92cd700 20 > > > > > mon.sephmon2@1(leader).health > > > > > check_member_health > > > > > 2018-09-07 20:08:08.657 7f57bcdd0700 1 --10.1.9.202:6789/0 > > > > > <http://10.1.9.202:6789/0> >> - > > > > > conn(0x55dcee55be00 :6789 s=STATE_ACCEPTING pgs=0 cs=0 > > > > > l=0)._process_connection sd=447 - > > > > > 2018-09-07 20:08:08.657 7f57bcdd0700 10 mon.sephmon2@1(leader) e17 > > > > > ms_verify_authorizer10.1.9.32:6823/4007 > > > > > <http://10.1.9.32:6823/4007> osd protocol 0 > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 10 > > > > > mon.sephmon2@1(leader).health > > > > > check_member_health avail 79% total 40 GiB, used 8.4 GiB, avail 32 > > > > > GiB > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 20 > > > > > mon.sephmon2@1(leader).health > > > > > check_leader_health > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 10 > > > > > mon.sephmon2@1(leader).paxosservice(health 1534..1720) maybe_trim > > > > > trim_to 1715 > > > > > would only trim 181 < paxos_service_trim_min 250 > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 10 > > > > > mon.sephmon2@1(leader).config tick > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader) e17 > > > > > sync_trim_providers > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 -1 mon.sephmon2@1(leader) e17 > > > > > get_health_metrics reporting 1940 slow ops, oldest is > > > > > osd_failure(failed > > > > > timeout osd.7210.1.9.9:6800/68904 <http://10.1.9.9:6800/68904> > > > > > for 317sec e987498 v987498) > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 1 > > > > > mon.sephmon2@1(leader).paxos(paxos > > > > > updating c 132917556..132918214) accept timeout, calling fresh > > > > > election > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 > > > > > bootstrap > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 > > > > > sync_reset_requester > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 > > > > > unregister_cluster_logger > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 > > > > > cancel_probe_timeout (none scheduled) > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) > > > > > e17 _reset > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) > > > > > e17 > > > > > cancel_probe_timeout (none scheduled) > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) > > > > > e17 > > > > > timecheck_finish > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing) > > > > > e17 > > > > > health_tick_stop > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing) > > > > > e17 > > > > > health_interval_stop > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) > > > > > e17 > > > > > scrub_event_cancel > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) > > > > > e17 > > > > > scrub_reset > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 10 > > > > > mon.sephmon2@1(probing).paxos(paxos > > > > > updating c 132917556..132918214) restart -- canceling timeouts > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 10 > > > > > mon.sephmon2@1(probing).paxosservice(mdsmap 13504..14242) restart > > > > > 2018-09-07 20:08:08.662 7f57b92cd700 10 > > > > > mon.sephmon2@1(probing).paxosservice(osdmap 917471..987503) > > > > > restart > > > > > > > > > > > > > > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > >
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com