Re: Mimic upgrade failure

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 10 Sep 2018 12:06:40 +0000 (UTC)

I took a look at the mon log you sent.  A few things I noticed:

- The frequent mon elections seem to get only 2/3 mons about half of the 
time.
- The messages coming in a mostly osd_failure, and half of those seem to 
be recoveries (cancellation of the failure message).

It does smell a bit like a networking issue, or some tunable that relates 
to the messaging layer.  It might be worth looking at an OSD log for an 
osd that reported a failure and seeing what error code it coming up on the 
failed ping connection?  That might provide a useful hint (e.g., 
ECONNREFUSED vs EMFILE or something).

I'd also confirm that with nodown set the mon quorum stabilizes...

sage

On Mon, 10 Sep 2018, Kevin Hrpcek wrote:

> Update for the list archive.
> 
> I went ahead and finished the mimic upgrade with the osds in a fluctuating
> state of up and down. The cluster did start to normalize a lot easier after
> everything was on mimic since the random mass OSD heartbeat failures stopped
> and the constant mon election problem went away. I'm still battling with the
> cluster reacting poorly to host reboots or small map changes, but I feel like
> my current pg:osd ratio may be playing a factor in that since we are 2x normal
> pg count while migrating data to new EC pools.
> 
> I'm not sure of the root cause but it seems like the mix of luminous and mimic
> did not play well together for some reason. Maybe it has to do with the scale
> of my cluster, 871 osd, or maybe I've missed some some tuning as my cluster
> has scaled to this size.
> 
> Kevin
> 
> 
> On 09/09/2018 12:49 PM, Kevin Hrpcek wrote:
> > Nothing too crazy for non default settings. Some of those osd settings were
> > in place while I was testing recovery speeds and need to be brought back
> > closer to defaults. I was setting nodown before but it seems to mask the
> > problem. While its good to stop the osdmap changes, OSDs would come up, get
> > marked up, but at some point go down again (but the process is still
> > running) and still stay up in the map. Then when I'd unset nodown the
> > cluster would immediately mark 250+ osd down again and i'd be back where I
> > started.
> > 
> > This morning I went ahead and finished the osd upgrades to mimic to remove
> > that variable. I've looked for networking problems but haven't found any. 2
> > of the mons are on the same switch. I've also tried combinations of shutting
> > down a mon to see if a single one was the problem, but they keep electing no
> > matter the mix of them that are up. Part of it feels like a networking
> > problem but I haven't been able to find a culprit yet as everything was
> > working normally before starting the upgrade. Other than the constant mon
> > elections, yesterday I had the cluster 95% healthy 3 or 4 times, but it
> > doesn't last long since at some point the OSDs start trying to fail each
> > other through their heartbeats.
> > 2018-09-09 17:37:29.079 7eff774f5700  1 mon.sephmon1@0(leader).osd e991282
> > prepare_failure osd.39 10.1.9.2:6802/168438 from osd.49 10.1.9.3:6884/317908
> > is reporting failure:1
> > 2018-09-09 17:37:29.079 7eff774f5700  0 log_channel(cluster) log [DBG] :
> > osd.39 10.1.9.2:6802/168438 reported failed by osd.49 10.1.9.3:6884/317908
> > 2018-09-09 17:37:29.083 7eff774f5700  1 mon.sephmon1@0(leader).osd e991282
> > prepare_failure osd.93 10.1.9.9:6853/287469 from osd.372
> > 10.1.9.13:6801/275806 is reporting failure:1
> > 
> > I'm working on getting things mostly good again with everything on mimic and
> > will see if it behaves better.
> > 
> > Thanks for your input on this David.
> > 
> > 
> > [global]
> > mon_initial_members = sephmon1, sephmon2, sephmon3
> > mon_host = 10.1.9.201,10.1.9.202,10.1.9.203
> > auth_cluster_required = cephx
> > auth_service_required = cephx
> > auth_client_required = cephx
> > filestore_xattr_use_omap = true
> > public_network = 10.1.0.0/16
> > osd backfill full ratio = 0.92
> > osd failsafe nearfull ratio = 0.90
> > osd max object size = 21474836480
> > mon max pg per osd = 350
> > 
> > [mon]
> > mon warn on legacy crush tunables = false
> > mon pg warn max per osd = 300
> > mon osd down out subtree limit = host
> > mon osd nearfull ratio = 0.90
> > mon osd full ratio = 0.97
> > mon health preluminous compat warning = false
> > osd heartbeat grace = 60
> > rocksdb cache size = 1342177280
> > 
> > [mds]
> > mds log max segments = 100
> > mds log max expiring = 40
> > mds bal fragment size max = 200000
> > mds cache memory limit = 4294967296
> > 
> > [osd]
> > osd mkfs options xfs = -i size=2048 -d su=512k,sw=1
> > osd recovery delay start = 30
> > osd recovery max active = 5
> > osd max backfills = 3
> > osd recovery threads = 2
> > osd crush initial weight = 0
> > osd heartbeat interval = 30
> > osd heartbeat grace = 60
> > 
> > 
> > On 09/08/2018 11:24 PM, David Turner wrote:
> > > What osd/mon/etc config settings do you have that are not default? It
> > > might be worth utilizing nodown to stop osds from marking each other down
> > > and finish the upgrade to be able to set the minimum osd version to mimic.
> > > Stop the osds in a node, manually mark them down, start them back up in
> > > mimic. Depending on how bad things are, setting pause on the cluster to
> > > just finish the upgrade faster might not be a bad idea either.
> > > 
> > > This should be a simple question, have you confirmed that there are no
> > > networking problems between the MONs while the elections are happening?
> > > 
> > > On Sat, Sep 8, 2018, 7:52 PM Kevin Hrpcek <kevin.hrpcek@xxxxxxxxxxxxx
> > > <mailto:kevin.hrpcek@xxxxxxxxxxxxx>> wrote:
> > > 
> > >     Hey Sage,
> > > 
> > >     I've posted the file with my email address for the user. It is
> > >     with debug_mon 20/20, debug_paxos 20/20, and debug ms 1/5. The
> > >     mons are calling for elections about every minute so I let this
> > >     run for a few elections and saw this node become the leader a
> > >     couple times. Debug logs start around 23:27:30. I had managed to
> > >     get about 850/857 osds up, but it seems that within the last 30
> > >     min it has all gone bad again due to the OSDs reporting each
> > >     other as failed. We relaxed the osd_heartbeat_interval to 30 and
> > >     osd_heartbeat_grace to 60 in an attempt to slow down how quickly
> > >     OSDs are trying to fail each other. I'll put in the
> > >     rocksdb_cache_size setting.
> > > 
> > >     Thanks for taking a look.
> > > 
> > >     Kevin
> > > 
> > >     On 09/08/2018 06:04 PM, Sage Weil wrote:
> > > >     Hi Kevin,
> > > > 
> > > >     I can't think of any major luminous->mimic changes off the top of my
> > > > head
> > > >     that would impact CPU usage, but it's always possible there is
> > > > something
> > > >     subtle.  Can you ceph-post-file a the full log from one of your mons
> > > >     (preferbably the leader)?
> > > > 
> > > >     You might try adjusting the rocksdb cache size.. try setting
> > > > 
> > > >       rocksdb_cache_size = 1342177280   # 10x the default, ~1.3 GB
> > > > 
> > > >     on the mons and restarting?
> > > > 
> > > >     Thanks!
> > > >     sage
> > > > 
> > > >     On Sat, 8 Sep 2018, Kevin Hrpcek wrote:
> > > > 
> > > > >     Hello,
> > > > > 
> > > > >     I've had a Luminous -> Mimic upgrade go very poorly and my cluster
> > > > > is stuck
> > > > >     with almost all pgs down. One problem is that the mons have
> > > > > started to
> > > > >     re-elect a new quorum leader almost every minute. This is making
> > > > > it difficult
> > > > >     to monitor the cluster and even run any commands on it since at
> > > > > least half the
> > > > >     time a ceph command times out or takes over a minute to return
> > > > > results. I've
> > > > >     looked at the debug logs and it appears there is some timeout
> > > > > occurring with
> > > > >     paxos of about a minute. The msg_dispatch thread of the mons is
> > > > > often running
> > > > >     a core at 100% for about a minute(user time, no iowait). Running
> > > > > strace on it
> > > > >     shows the process is going through all of the mon db files (about
> > > > > 6gb in
> > > > >     store.db/*.sst). Does anyone have an idea of what this timeout is
> > > > > or why my
> > > > >     mons are always reelecting? One theory I have is that the
> > > > > msg_dispatch can't
> > > > >     process the SST's fast enough and hits some timeout for a health
> > > > > check and the
> > > > >     mon drops itself from the quorum since it thinks it isn't healthy.
> > > > > I've been
> > > > >     thinking of introducing a new mon to the cluster on hardware with
> > > > > a better cpu
> > > > >     to see if that can process the SSTs within this timeout.
> > > > > 
> > > > >     My cluster has the mons,mds,mgr and 30/41 osd servers on mimic,
> > > > > and 11/41 osd
> > > > >     servers on luminous. The original problem started when I restarted
> > > > > the osds on
> > > > >     one of the hosts. The cluster reacted poorly to them going down
> > > > > and went into
> > > > >     a frenzy of taking down other osds and remapping. I eventually got
> > > > > that stable
> > > > >     and the PGs were 90% good with the finish line in sight and then
> > > > > the mons
> > > > >     started their issue of releecting every minute. Now I can't keep
> > > > > any decent
> > > > >     amount of PGs up for more than a few hours. This started on
> > > > > Wednesday.....
> > > > > 
> > > > >     Any help would be greatly appreciated.
> > > > > 
> > > > >     Thanks,
> > > > >     Kevin
> > > > > 
> > > > >     --Debug snippet from a mon at reelection time
> > > > >     2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).mds
> > > > > e14242
> > > > >     maybe_resize_cluster in 1 max 1
> > > > >     2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mds
> > > > > e14242
> > > > >     tick: resetting beacon timeouts due to mon delay (slow election?)
> > > > > of 59.8106s
> > > > >     seconds
> > > > >     2018-09-07 20:08:08.655 7f57b92cd700 10
> > > > >     mon.sephmon2@1(leader).paxosservice(mdsmap 13504..14242)
> > > > > maybe_trim trim_to
> > > > >     13742 would only trim 238 < paxos_service_trim_min 250
> > > > >     2018-09-07 20:08:08.655 7f57b92cd700 10
> > > > > mon.sephmon2@1(leader).auth v120657
> > > > >     auth
> > > > >     2018-09-07 20:08:08.655 7f57b92cd700 10
> > > > > mon.sephmon2@1(leader).auth v120657
> > > > >     check_rotate updated rotating
> > > > >     2018-09-07 20:08:08.655 7f57b92cd700 10
> > > > >     mon.sephmon2@1(leader).paxosservice(auth 120594..120657)
> > > > > propose_pending
> > > > >     2018-09-07 20:08:08.655 7f57b92cd700 10
> > > > > mon.sephmon2@1(leader).auth v120657
> > > > >     encode_pending v 120658
> > > > >     2018-09-07 20:08:08.655 7f57b92cd700  5
> > > > > mon.sephmon2@1(leader).paxos(paxos
> > > > >     updating c 132917556..132918214) queue_pending_finisher
> > > > > 0x55dce8e5b370
> > > > >     2018-09-07 20:08:08.655 7f57b92cd700 10
> > > > > mon.sephmon2@1(leader).paxos(paxos
> > > > >     updating c 132917556..132918214) trigger_propose not active, will
> > > > > propose
> > > > >     later
> > > > >     2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mgr
> > > > > e2234 tick:
> > > > >     resetting beacon timeouts due to mon delay (slow election?) of
> > > > > 59.8844s
> > > > >     seconds
> > > > >     2018-09-07 20:08:08.655 7f57b92cd700 10
> > > > >     mon.sephmon2@1(leader).paxosservice(mgr 1513..2234) maybe_trim
> > > > > trim_to 1734
> > > > >     would only trim 221 < paxos_service_trim_min 250
> > > > >     2018-09-07 20:08:08.655 7f57b92cd700 10
> > > > > mon.sephmon2@1(leader).health tick
> > > > >     2018-09-07 20:08:08.655 7f57b92cd700 20
> > > > > mon.sephmon2@1(leader).health
> > > > >     check_member_health
> > > > >     2018-09-07 20:08:08.657 7f57bcdd0700  1 --10.1.9.202:6789/0
> > > > > <http://10.1.9.202:6789/0>  >> -
> > > > >     conn(0x55dcee55be00 :6789 s=STATE_ACCEPTING pgs=0 cs=0
> > > > >     l=0)._process_connection sd=447 -
> > > > >     2018-09-07 20:08:08.657 7f57bcdd0700 10 mon.sephmon2@1(leader) e17
> > > > >     ms_verify_authorizer10.1.9.32:6823/4007
> > > > > <http://10.1.9.32:6823/4007>  osd protocol 0
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 10
> > > > > mon.sephmon2@1(leader).health
> > > > >     check_member_health avail 79% total 40 GiB, used 8.4 GiB, avail 32
> > > > > GiB
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 20
> > > > > mon.sephmon2@1(leader).health
> > > > >     check_leader_health
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 10
> > > > >     mon.sephmon2@1(leader).paxosservice(health 1534..1720) maybe_trim
> > > > > trim_to 1715
> > > > >     would only trim 181 < paxos_service_trim_min 250
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 10
> > > > > mon.sephmon2@1(leader).config tick
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader) e17
> > > > >     sync_trim_providers
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 -1 mon.sephmon2@1(leader) e17
> > > > >     get_health_metrics reporting 1940 slow ops, oldest is
> > > > > osd_failure(failed
> > > > >     timeout osd.7210.1.9.9:6800/68904 <http://10.1.9.9:6800/68904>
> > > > > for 317sec e987498 v987498)
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700  1
> > > > > mon.sephmon2@1(leader).paxos(paxos
> > > > >     updating c 132917556..132918214) accept timeout, calling fresh
> > > > > election
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
> > > > > bootstrap
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
> > > > >     sync_reset_requester
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
> > > > >     unregister_cluster_logger
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
> > > > >     cancel_probe_timeout (none scheduled)
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
> > > > > e17 _reset
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
> > > > > e17
> > > > >     cancel_probe_timeout (none scheduled)
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
> > > > > e17
> > > > >     timecheck_finish
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing)
> > > > > e17
> > > > >     health_tick_stop
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing)
> > > > > e17
> > > > >     health_interval_stop
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
> > > > > e17
> > > > >     scrub_event_cancel
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
> > > > > e17
> > > > >     scrub_reset
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 10
> > > > > mon.sephmon2@1(probing).paxos(paxos
> > > > >     updating c 132917556..132918214) restart -- canceling timeouts
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 10
> > > > >     mon.sephmon2@1(probing).paxosservice(mdsmap 13504..14242) restart
> > > > >     2018-09-07 20:08:08.662 7f57b92cd700 10
> > > > >     mon.sephmon2@1(probing).paxosservice(osdmap 917471..987503)
> > > > > restart
> > > > > 
> > > > > 
> > > 
> > >     _______________________________________________
> > >     ceph-users mailing list
> > >     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> > >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com