Nothing too crazy for non default settings. Some of
those osd settings were in place while I was testing recovery
speeds and need to be brought back closer to defaults. I was
setting nodown before but it seems to mask the problem. While its
good to stop the osdmap changes, OSDs would come up, get marked
up, but at some point go down again (but the process is still
running) and still stay up in the map. Then when I'd unset nodown
the cluster would immediately mark 250+ osd down again and i'd be
back where I started.
This morning I went ahead and finished the osd upgrades to mimic
to remove that variable. I've looked for networking problems but
haven't found any. 2 of the mons are on the same switch. I've also
tried combinations of shutting down a mon to see if a single one
was the problem, but they keep electing no matter the mix of them
that are up. Part of it feels like a networking problem but I
haven't been able to find a culprit yet as everything was working
normally before starting the upgrade. Other than the constant mon
elections, yesterday I had the cluster 95% healthy 3 or 4 times,
but it doesn't last long since at some point the OSDs start trying
to fail each other through their heartbeats.
2018-09-09 17:37:29.079 7eff774f5700 1 mon.sephmon1@0(leader).osd
e991282 prepare_failure osd.39 10.1.9.2:6802/168438 from osd.49
10.1.9.3:6884/317908 is reporting failure:1
2018-09-09 17:37:29.079 7eff774f5700 0 log_channel(cluster) log
[DBG] : osd.39 10.1.9.2:6802/168438 reported failed by osd.49
10.1.9.3:6884/317908
2018-09-09 17:37:29.083 7eff774f5700 1 mon.sephmon1@0(leader).osd
e991282 prepare_failure osd.93 10.1.9.9:6853/287469 from osd.372
10.1.9.13:6801/275806 is reporting failure:1
I'm working on getting things mostly good again with everything on
mimic and will see if it behaves better.
Thanks for your input on this David.
[global]
mon_initial_members = sephmon1, sephmon2, sephmon3
mon_host = 10.1.9.201,10.1.9.202,10.1.9.203
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = 10.1.0.0/16
osd backfill full ratio = 0.92
osd failsafe nearfull ratio = 0.90
osd max object size = 21474836480
mon max pg per osd = 350
[mon]
mon warn on legacy crush tunables = false
mon pg warn max per osd = 300
mon osd down out subtree limit = host
mon osd nearfull ratio = 0.90
mon osd full ratio = 0.97
mon health preluminous compat warning = false
osd heartbeat grace = 60
rocksdb cache size = 1342177280
[mds]
mds log max segments = 100
mds log max expiring = 40
mds bal fragment size max = 200000
mds cache memory limit = 4294967296
[osd]
osd mkfs options xfs = -i size=2048 -d su=512k,sw=1
osd recovery delay start = 30
osd recovery max active = 5
osd max backfills = 3
osd recovery threads = 2
osd crush initial weight = 0
osd heartbeat interval = 30
osd heartbeat grace = 60
On 09/08/2018 11:24 PM, David Turner
wrote:
What osd/mon/etc config settings do you have that are not default?
It might be worth utilizing nodown to stop osds from marking each
other down and finish the upgrade to be able to set the minimum
osd version to mimic. Stop the osds in a node, manually mark them
down, start them back up in mimic. Depending on how bad things
are, setting pause on the cluster to just finish the upgrade
faster might not be a bad idea either.
This should be a simple question, have you confirmed that
there are no networking problems between the MONs while the
elections are happening?
Hey
Sage,
I've posted the file with my email address for the user.
It is with debug_mon 20/20, debug_paxos 20/20, and debug
ms 1/5. The mons are calling for elections about every
minute so I let this run for a few elections and saw
this node become the leader a couple times. Debug logs
start around 23:27:30. I had managed to get about
850/857 osds up, but it seems that within the last 30
min it has all gone bad again due to the OSDs reporting
each other as failed. We relaxed the
osd_heartbeat_interval to 30 and osd_heartbeat_grace to
60 in an attempt to slow down how quickly OSDs are
trying to fail each other. I'll put in the
rocksdb_cache_size setting.
Thanks for taking a look.
Kevin
On
09/08/2018 06:04 PM, Sage Weil wrote:
Hi Kevin,
I can't think of any major luminous->mimic changes off the top of my head
that would impact CPU usage, but it's always possible there is something
subtle. Can you ceph-post-file a the full log from one of your mons
(preferbably the leader)?
You might try adjusting the rocksdb cache size.. try setting
rocksdb_cache_size = 1342177280 # 10x the default, ~1.3 GB
on the mons and restarting?
Thanks!
sage
On Sat, 8 Sep 2018, Kevin Hrpcek wrote:
Hello,
I've had a Luminous -> Mimic upgrade go very poorly and my cluster is stuck
with almost all pgs down. One problem is that the mons have started to
re-elect a new quorum leader almost every minute. This is making it difficult
to monitor the cluster and even run any commands on it since at least half the
time a ceph command times out or takes over a minute to return results. I've
looked at the debug logs and it appears there is some timeout occurring with
paxos of about a minute. The msg_dispatch thread of the mons is often running
a core at 100% for about a minute(user time, no iowait). Running strace on it
shows the process is going through all of the mon db files (about 6gb in
store.db/*.sst). Does anyone have an idea of what this timeout is or why my
mons are always reelecting? One theory I have is that the msg_dispatch can't
process the SST's fast enough and hits some timeout for a health check and the
mon drops itself from the quorum since it thinks it isn't healthy. I've been
thinking of introducing a new mon to the cluster on hardware with a better cpu
to see if that can process the SSTs within this timeout.
My cluster has the mons,mds,mgr and 30/41 osd servers on mimic, and 11/41 osd
servers on luminous. The original problem started when I restarted the osds on
one of the hosts. The cluster reacted poorly to them going down and went into
a frenzy of taking down other osds and remapping. I eventually got that stable
and the PGs were 90% good with the finish line in sight and then the mons
started their issue of releecting every minute. Now I can't keep any decent
amount of PGs up for more than a few hours. This started on Wednesday.....
Any help would be greatly appreciated.
Thanks,
Kevin
--Debug snippet from a mon at reelection time
2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).mds e14242
maybe_resize_cluster in 1 max 1
2018-09-07 20:08:08.655 7f57b92cd700 4 mon.sephmon2@1(leader).mds e14242
tick: resetting beacon timeouts due to mon delay (slow election?) of 59.8106s
seconds
2018-09-07 20:08:08.655 7f57b92cd700 10
mon.sephmon2@1(leader).paxosservice(mdsmap 13504..14242) maybe_trim trim_to
13742 would only trim 238 < paxos_service_trim_min 250
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657
auth
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657
check_rotate updated rotating
2018-09-07 20:08:08.655 7f57b92cd700 10
mon.sephmon2@1(leader).paxosservice(auth 120594..120657) propose_pending
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657
encode_pending v 120658
2018-09-07 20:08:08.655 7f57b92cd700 5 mon.sephmon2@1(leader).paxos(paxos
updating c 132917556..132918214) queue_pending_finisher 0x55dce8e5b370
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).paxos(paxos
updating c 132917556..132918214) trigger_propose not active, will propose
later
2018-09-07 20:08:08.655 7f57b92cd700 4 mon.sephmon2@1(leader).mgr e2234 tick:
resetting beacon timeouts due to mon delay (slow election?) of 59.8844s
seconds
2018-09-07 20:08:08.655 7f57b92cd700 10
mon.sephmon2@1(leader).paxosservice(mgr 1513..2234) maybe_trim trim_to 1734
would only trim 221 < paxos_service_trim_min 250
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).health tick
2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).health
check_member_health
2018-09-07 20:08:08.657 7f57bcdd0700 1 -- 10.1.9.202:6789/0 >> -
conn(0x55dcee55be00 :6789 s=STATE_ACCEPTING pgs=0 cs=0
l=0)._process_connection sd=447 -
2018-09-07 20:08:08.657 7f57bcdd0700 10 mon.sephmon2@1(leader) e17
ms_verify_authorizer 10.1.9.32:6823/4007 osd protocol 0
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader).health
check_member_health avail 79% total 40 GiB, used 8.4 GiB, avail 32 GiB
2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader).health
check_leader_health
2018-09-07 20:08:08.662 7f57b92cd700 10
mon.sephmon2@1(leader).paxosservice(health 1534..1720) maybe_trim trim_to 1715
would only trim 181 < paxos_service_trim_min 250
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader).config tick
2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader) e17
sync_trim_providers
2018-09-07 20:08:08.662 7f57b92cd700 -1 mon.sephmon2@1(leader) e17
get_health_metrics reporting 1940 slow ops, oldest is osd_failure(failed
timeout osd.72 10.1.9.9:6800/68904 for 317sec e987498 v987498)
2018-09-07 20:08:08.662 7f57b92cd700 1 mon.sephmon2@1(leader).paxos(paxos
updating c 132917556..132918214) accept timeout, calling fresh election
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 bootstrap
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
sync_reset_requester
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
unregister_cluster_logger
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
cancel_probe_timeout (none scheduled)
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17 _reset
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17
cancel_probe_timeout (none scheduled)
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17
timecheck_finish
2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing) e17
health_tick_stop
2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing) e17
health_interval_stop
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17
scrub_event_cancel
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17
scrub_reset
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing).paxos(paxos
updating c 132917556..132918214) restart -- canceling timeouts
2018-09-07 20:08:08.662 7f57b92cd700 10
mon.sephmon2@1(probing).paxosservice(mdsmap 13504..14242) restart
2018-09-07 20:08:08.662 7f57b92cd700 10
mon.sephmon2@1(probing).paxosservice(osdmap 917471..987503) restart
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
|