Re: Mimic upgrade failure

Kevin Hrpcek <kevin.hrpcek@xxxxxxxxxxxxx> · Wed, 12 Sep 2018 12:55:04 -0500

    I couldn't find any sign of a networking issue at
      the OS or switches. No changes have been made in those to get the
      cluster stable again. I looked through a couple OSD logs and here
      is a selection of some of most frequent errors they were getting.
      Maybe something below is more obvious to you. 

      2018-09-09 18:17:33.245 7feb92079700  2 osd.84 991324
      ms_handle_refused con 0x560e428b9800 session 0x560eb26b0060

      2018-09-09 18:17:33.245 7feb9307b700  2 osd.84 991324
      ms_handle_refused con 0x560ea639f000 session 0x560eb26b0060

      2018-09-09 18:18:55.919 7feb9307b700 10 osd.84 991337
      heartbeat_reset failed hb con 0x560e424a3600 for osd.20, reopening

      2018-09-09 18:18:55.919 7feb9307b700  2 osd.84 991337
      ms_handle_refused con 0x560e447df600 session 0x560e9ec37680

      2018-09-09 18:18:55.919 7feb92079700  2 osd.84 991337
      ms_handle_refused con 0x560e427a5600 session 0x560e9ec37680

      2018-09-09 18:18:55.935 7feb92079700 10 osd.84 991337
      heartbeat_reset failed hb con 0x560e40afcc00 for osd.18, reopening

      2018-09-09 18:18:55.935 7feb92079700  2 osd.84 991337
      ms_handle_refused con 0x560e44398c00 session 0x560e6a3a0620

      2018-09-09 18:18:55.935 7feb9307b700  2 osd.84 991337
      ms_handle_refused con 0x560e42f4ea00 session 0x560e6a3a0620

      2018-09-09 18:18:55.939 7feb9307b700 10 osd.84 991337
      heartbeat_reset failed hb con 0x560e424c1e00 for osd.9, reopening

      2018-09-09 18:18:55.940 7feb9307b700  2 osd.84 991337
      ms_handle_refused con 0x560ea4d09600 session 0x560e115e8120

      2018-09-09 18:18:55.940 7feb92079700  2 osd.84 991337
      ms_handle_refused con 0x560e424a3600 session 0x560e115e8120

      2018-09-09 18:18:55.956 7febadf54700 20 osd.84 991337
      share_map_peer 0x560e411ca600 already has epoch 991337

      2018-09-09 18:24:59.595 7febae755700 10 osd.84 991362  new session
      0x560e40b5ce00 con=0x560e42471800 addr=10.1.9.13:6836/2276068

      2018-09-09 18:24:59.595 7febae755700 10 osd.84 991362  session
      0x560e40b5ce00 osd.376 has caps osdcap[grant(*)] 'allow *'

      2018-09-09 18:24:59.596 7feb9407d700  2 osd.84 991362
      ms_handle_reset con 0x560e42471800 session 0x560e40b5ce00

      2018-09-09 18:24:59.606 7feb9407d700  2 osd.84 991362
      ms_handle_refused con 0x560e42d04600 session 0x560e10dfd000

      2018-09-09 18:24:59.633 7febad753700 10 osd.84 991362
      OSD::ms_get_authorizer type=osd

      2018-09-09 18:24:59.633 7febad753700 10 osd.84 991362
      ms_get_authorizer bailing, we are shutting down

      2018-09-09 18:24:59.633 7febad753700  0 -- 10.1.9.9:6848/4287624
      >> 10.1.9.12:6801/2269104 conn(0x560e42326a00 :-1
      s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=18630 cs=1
      l=0).handle_connect_reply connect got BADAUTHORIZER

      2018-09-09 18:22:56.434 7febadf54700  0 cephx: verify_authorizer
      could not decrypt ticket info: error: bad magic in decode_decrypt,
      3995972256093848467 != 18374858748799134293

      2018-09-09 18:22:56.434 7febadf54700  0 -- 10.1.9.9:6848/4287624
      >> 10.1.9.12:6801/2269104 conn(0x560e41fad600 :6848
      s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
      l=0).handle_connect_msg: got bad authorizer

      2018-09-10 03:30:17.324 7ff0ab678700 -1 osd.84 992286
      heartbeat_check: no reply from 10.1.9.28:6843 osd.578 since back
      2018-09-10 03:15:35.358240 front 2018-09-10 03:15:47.879015
      (cutoff 2018-09-10 03:29:17.326329)

      Kevin

    On 09/10/2018 07:06 AM, Sage Weil
      wrote:

      I took a look at the mon log you sent.  A few things I noticed:

- The frequent mon elections seem to get only 2/3 mons about half of the 
time.
- The messages coming in a mostly osd_failure, and half of those seem to 
be recoveries (cancellation of the failure message).

It does smell a bit like a networking issue, or some tunable that relates 
to the messaging layer.  It might be worth looking at an OSD log for an 
osd that reported a failure and seeing what error code it coming up on the 
failed ping connection?  That might provide a useful hint (e.g., 
ECONNREFUSED vs EMFILE or something).

I'd also confirm that with nodown set the mon quorum stabilizes...

sage

On Mon, 10 Sep 2018, Kevin Hrpcek wrote:

        Update for the list archive.

I went ahead and finished the mimic upgrade with the osds in a fluctuating
state of up and down. The cluster did start to normalize a lot easier after
everything was on mimic since the random mass OSD heartbeat failures stopped
and the constant mon election problem went away. I'm still battling with the
cluster reacting poorly to host reboots or small map changes, but I feel like
my current pg:osd ratio may be playing a factor in that since we are 2x normal
pg count while migrating data to new EC pools.

I'm not sure of the root cause but it seems like the mix of luminous and mimic
did not play well together for some reason. Maybe it has to do with the scale
of my cluster, 871 osd, or maybe I've missed some some tuning as my cluster
has scaled to this size.

Kevin

On 09/09/2018 12:49 PM, Kevin Hrpcek wrote:

          Nothing too crazy for non default settings. Some of those osd settings were
in place while I was testing recovery speeds and need to be brought back
closer to defaults. I was setting nodown before but it seems to mask the
problem. While its good to stop the osdmap changes, OSDs would come up, get
marked up, but at some point go down again (but the process is still
running) and still stay up in the map. Then when I'd unset nodown the
cluster would immediately mark 250+ osd down again and i'd be back where I
started.

This morning I went ahead and finished the osd upgrades to mimic to remove
that variable. I've looked for networking problems but haven't found any. 2
of the mons are on the same switch. I've also tried combinations of shutting
down a mon to see if a single one was the problem, but they keep electing no
matter the mix of them that are up. Part of it feels like a networking
problem but I haven't been able to find a culprit yet as everything was
working normally before starting the upgrade. Other than the constant mon
elections, yesterday I had the cluster 95% healthy 3 or 4 times, but it
doesn't last long since at some point the OSDs start trying to fail each
other through their heartbeats.
2018-09-09 17:37:29.079 7eff774f5700  1 mon.sephmon1@0(leader).osd e991282
prepare_failure osd.39 10.1.9.2:6802/168438 from osd.49 10.1.9.3:6884/317908
is reporting failure:1
2018-09-09 17:37:29.079 7eff774f5700  0 log_channel(cluster) log [DBG] :
osd.39 10.1.9.2:6802/168438 reported failed by osd.49 10.1.9.3:6884/317908
2018-09-09 17:37:29.083 7eff774f5700  1 mon.sephmon1@0(leader).osd e991282
prepare_failure osd.93 10.1.9.9:6853/287469 from osd.372
10.1.9.13:6801/275806 is reporting failure:1

I'm working on getting things mostly good again with everything on mimic and
will see if it behaves better.

Thanks for your input on this David.

[global]
mon_initial_members = sephmon1, sephmon2, sephmon3
mon_host = 10.1.9.201,10.1.9.202,10.1.9.203
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = 10.1.0.0/16
osd backfill full ratio = 0.92
osd failsafe nearfull ratio = 0.90
osd max object size = 21474836480
mon max pg per osd = 350

[mon]
mon warn on legacy crush tunables = false
mon pg warn max per osd = 300
mon osd down out subtree limit = host
mon osd nearfull ratio = 0.90
mon osd full ratio = 0.97
mon health preluminous compat warning = false
osd heartbeat grace = 60
rocksdb cache size = 1342177280

[mds]
mds log max segments = 100
mds log max expiring = 40
mds bal fragment size max = 200000
mds cache memory limit = 4294967296

[osd]
osd mkfs options xfs = -i size=2048 -d su=512k,sw=1
osd recovery delay start = 30
osd recovery max active = 5
osd max backfills = 3
osd recovery threads = 2
osd crush initial weight = 0
osd heartbeat interval = 30
osd heartbeat grace = 60

On 09/08/2018 11:24 PM, David Turner wrote:

            What osd/mon/etc config settings do you have that are not default? It
might be worth utilizing nodown to stop osds from marking each other down
and finish the upgrade to be able to set the minimum osd version to mimic.
Stop the osds in a node, manually mark them down, start them back up in
mimic. Depending on how bad things are, setting pause on the cluster to
just finish the upgrade faster might not be a bad idea either.

This should be a simple question, have you confirmed that there are no
networking problems between the MONs while the elections are happening?

On Sat, Sep 8, 2018, 7:52 PM Kevin Hrpcek <kevin.hrpcek@xxxxxxxxxxxxx
<mailto:kevin.hrpcek@xxxxxxxxxxxxx>> wrote:

    Hey Sage,

    I've posted the file with my email address for the user. It is
    with debug_mon 20/20, debug_paxos 20/20, and debug ms 1/5. The
    mons are calling for elections about every minute so I let this
    run for a few elections and saw this node become the leader a
    couple times. Debug logs start around 23:27:30. I had managed to
    get about 850/857 osds up, but it seems that within the last 30
    min it has all gone bad again due to the OSDs reporting each
    other as failed. We relaxed the osd_heartbeat_interval to 30 and
    osd_heartbeat_grace to 60 in an attempt to slow down how quickly
    OSDs are trying to fail each other. I'll put in the
    rocksdb_cache_size setting.

    Thanks for taking a look.

    Kevin

    On 09/08/2018 06:04 PM, Sage Weil wrote:

                  Hi Kevin,

    I can't think of any major luminous->mimic changes off the top of my
head
    that would impact CPU usage, but it's always possible there is
something
    subtle.  Can you ceph-post-file a the full log from one of your mons
    (preferbably the leader)?

    You might try adjusting the rocksdb cache size.. try setting

      rocksdb_cache_size = 1342177280   # 10x the default, ~1.3 GB

    on the mons and restarting?

    Thanks!
    sage

    On Sat, 8 Sep 2018, Kevin Hrpcek wrote:

                    Hello,

    I've had a Luminous -> Mimic upgrade go very poorly and my cluster
is stuck
    with almost all pgs down. One problem is that the mons have
started to
    re-elect a new quorum leader almost every minute. This is making
it difficult
    to monitor the cluster and even run any commands on it since at
least half the
    time a ceph command times out or takes over a minute to return
results. I've
    looked at the debug logs and it appears there is some timeout
occurring with
    paxos of about a minute. The msg_dispatch thread of the mons is
often running
    a core at 100% for about a minute(user time, no iowait). Running
strace on it
    shows the process is going through all of the mon db files (about
6gb in
    store.db/*.sst). Does anyone have an idea of what this timeout is
or why my
    mons are always reelecting? One theory I have is that the
msg_dispatch can't
    process the SST's fast enough and hits some timeout for a health
check and the
    mon drops itself from the quorum since it thinks it isn't healthy.
I've been
    thinking of introducing a new mon to the cluster on hardware with
a better cpu
    to see if that can process the SSTs within this timeout.

    My cluster has the mons,mds,mgr and 30/41 osd servers on mimic,
and 11/41 osd
    servers on luminous. The original problem started when I restarted
the osds on
    one of the hosts. The cluster reacted poorly to them going down
and went into
    a frenzy of taking down other osds and remapping. I eventually got
that stable
    and the PGs were 90% good with the finish line in sight and then
the mons
    started their issue of releecting every minute. Now I can't keep
any decent
    amount of PGs up for more than a few hours. This started on
Wednesday.....

    Any help would be greatly appreciated.

    Thanks,
    Kevin

    --Debug snippet from a mon at reelection time
    2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).mds
e14242
    maybe_resize_cluster in 1 max 1
    2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mds
e14242
    tick: resetting beacon timeouts due to mon delay (slow election?)
of 59.8106s
    seconds
    2018-09-07 20:08:08.655 7f57b92cd700 10
    mon.sephmon2@1(leader).paxosservice(mdsmap 13504..14242)
maybe_trim trim_to
    13742 would only trim 238 < paxos_service_trim_min 250
    2018-09-07 20:08:08.655 7f57b92cd700 10
mon.sephmon2@1(leader).auth v120657
    auth
    2018-09-07 20:08:08.655 7f57b92cd700 10
mon.sephmon2@1(leader).auth v120657
    check_rotate updated rotating
    2018-09-07 20:08:08.655 7f57b92cd700 10
    mon.sephmon2@1(leader).paxosservice(auth 120594..120657)
propose_pending
    2018-09-07 20:08:08.655 7f57b92cd700 10
mon.sephmon2@1(leader).auth v120657
    encode_pending v 120658
    2018-09-07 20:08:08.655 7f57b92cd700  5
mon.sephmon2@1(leader).paxos(paxos
    updating c 132917556..132918214) queue_pending_finisher
0x55dce8e5b370
    2018-09-07 20:08:08.655 7f57b92cd700 10
mon.sephmon2@1(leader).paxos(paxos
    updating c 132917556..132918214) trigger_propose not active, will
propose
    later
    2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mgr
e2234 tick:
    resetting beacon timeouts due to mon delay (slow election?) of
59.8844s
    seconds
    2018-09-07 20:08:08.655 7f57b92cd700 10
    mon.sephmon2@1(leader).paxosservice(mgr 1513..2234) maybe_trim
trim_to 1734
    would only trim 221 < paxos_service_trim_min 250
    2018-09-07 20:08:08.655 7f57b92cd700 10
mon.sephmon2@1(leader).health tick
    2018-09-07 20:08:08.655 7f57b92cd700 20
mon.sephmon2@1(leader).health
    check_member_health
    2018-09-07 20:08:08.657 7f57bcdd0700  1 --10.1.9.202:6789/0
<http://10.1.9.202:6789/0>  >> -
    conn(0x55dcee55be00 :6789 s=STATE_ACCEPTING pgs=0 cs=0
    l=0)._process_connection sd=447 -
    2018-09-07 20:08:08.657 7f57bcdd0700 10 mon.sephmon2@1(leader) e17
    ms_verify_authorizer10.1.9.32:6823/4007
<http://10.1.9.32:6823/4007>  osd protocol 0
    2018-09-07 20:08:08.662 7f57b92cd700 10
mon.sephmon2@1(leader).health
    check_member_health avail 79% total 40 GiB, used 8.4 GiB, avail 32
GiB
    2018-09-07 20:08:08.662 7f57b92cd700 20
mon.sephmon2@1(leader).health
    check_leader_health
    2018-09-07 20:08:08.662 7f57b92cd700 10
    mon.sephmon2@1(leader).paxosservice(health 1534..1720) maybe_trim
trim_to 1715
    would only trim 181 < paxos_service_trim_min 250
    2018-09-07 20:08:08.662 7f57b92cd700 10
mon.sephmon2@1(leader).config tick
    2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader) e17
    sync_trim_providers
    2018-09-07 20:08:08.662 7f57b92cd700 -1 mon.sephmon2@1(leader) e17
    get_health_metrics reporting 1940 slow ops, oldest is
osd_failure(failed
    timeout osd.7210.1.9.9:6800/68904 <http://10.1.9.9:6800/68904>
for 317sec e987498 v987498)
    2018-09-07 20:08:08.662 7f57b92cd700  1
mon.sephmon2@1(leader).paxos(paxos
    updating c 132917556..132918214) accept timeout, calling fresh
election
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
bootstrap
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
    sync_reset_requester
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
    unregister_cluster_logger
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
    cancel_probe_timeout (none scheduled)
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
e17 _reset
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
e17
    cancel_probe_timeout (none scheduled)
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
e17
    timecheck_finish
    2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing)
e17
    health_tick_stop
    2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing)
e17
    health_interval_stop
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
e17
    scrub_event_cancel
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
e17
    scrub_reset
    2018-09-07 20:08:08.662 7f57b92cd700 10
mon.sephmon2@1(probing).paxos(paxos
    updating c 132917556..132918214) restart -- canceling timeouts
    2018-09-07 20:08:08.662 7f57b92cd700 10
    mon.sephmon2@1(probing).paxosservice(mdsmap 13504..14242) restart
    2018-09-07 20:08:08.662 7f57b92cd700 10
    mon.sephmon2@1(probing).paxosservice(osdmap 917471..987503)
restart

                _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com