Re: Mimic upgrade failure

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 19 Sep 2018 14:35:50 +0000 (UTC)

It's hard to tell exactly from the below, but it looks to me like there is 
still a lot of OSDMap reencoding going on.  Take a look at 'ceph features' 
output and see who in the cluster is using pre-luminous features.. I'm 
guessing all of the clients?  For any of those sessions, fetching OSDMaps 
from the cluster will require reencoding.

If it's all clients (well, non-OSDs), I think we could work around it by 
avoiding the reencode entirely (it is only really there for OSDs, which 
want a perfect OSDMap copy that will match the monitor's CRC).

sage

On Wed, 19 Sep 2018, KEVIN MICHAEL HRPCEK wrote:

> I set mon lease = 30 yesterday and it had no effect on the quorum election. To give you an idea of how much cpu ms_dispatch is using, from the last mon restart about 7.5 hours ago, the ms_dispatch thread has 5h 40m of cpu time. Below are 2 snippets from perf top. I took them while ms_dispatch was 100% of a core, the first is using the pid of the ceph-mon, the second is the pid of the ms_dispatch thread. The last thing is a snippet from stracing the ms_dispatch pid. It is running through all of the sst files.
> 
> perf top ceph-mon
> Overhead  Shared Object         Symbol
>   17.71%  libceph-common.so.0   [.] ceph::decode<pg_t, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >
>    6.67%  libceph-common.so.0   [.] ceph::buffer::ptr::release
>    5.35%  libceph-common.so.0   [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<pg_t>, mempoo
>    5.20%  libceph-common.so.0   [.] ceph::buffer::ptr::append
>    5.12%  libceph-common.so.0   [.] ceph::buffer::list::iterator_impl<false>::copy
>    4.66%  libceph-common.so.0   [.] ceph::buffer::list::append
>    4.33%  libstdc++.so.6.0.19   [.] std::_Rb_tree_increment
>    4.27%  ceph-mon              [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<pg_t>, mempoo
>    4.18%  libceph-common.so.0   [.] ceph::buffer::list::append
>    3.10%  libceph-common.so.0   [.] ceph::decode<std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, denc_traits<std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, void> >
>    2.90%  libceph-common.so.0   [.] ceph::encode<pg_t, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >
>    2.56%  libceph-common.so.0   [.] ceph::buffer::ptr::ptr
>    2.50%  libstdc++.so.6.0.19   [.] std::_Rb_tree_insert_and_rebalance
>    2.39%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out
>    2.33%  libceph-common.so.0   [.] ceph::buffer::list::iterator_impl<false>::advance
>    2.21%  libtcmalloc.so.4.4.5  [.] tcmalloc::CentralFreeList::FetchFromOneSpans
>    1.97%  libtcmalloc.so.4.4.5  [.] tcmalloc::CentralFreeList::ReleaseToSpans
>    1.60%  libceph-common.so.0   [.] crc32_iscsi_00
>    1.42%  libtcmalloc.so.4.4.5  [.] operator new[]
>    1.29%  libceph-common.so.0   [.] ceph::buffer::ptr::unused_tail_length
>    1.28%  libceph-common.so.0   [.] ceph::buffer::list::iterator_impl<false>::copy_shallow
>    1.25%  libceph-common.so.0   [.] ceph::buffer::ptr::raw_length@plt
>    1.06%  libceph-common.so.0   [.] ceph::buffer::ptr::end_c_str
>    1.06%  libceph-common.so.0   [.] ceph::buffer::list::iterator::copy
>    0.99%  libc-2.17.so          [.] __memcpy_ssse3_back
>    0.94%  libc-2.17.so          [.] _IO_default_xsputn
>    0.89%  libceph-common.so.0   [.] ceph::buffer::list::iterator_impl<false>::advance@plt
>    0.87%  libtcmalloc.so.4.4.5  [.] tcmalloc::ThreadCache::ReleaseToCentralCache
>    0.76%  libleveldb.so.1.0.7   [.] leveldb::FindFile
>    0.72%  [vdso]                [.] __vdso_clock_gettime
>    0.67%  ceph-mon              [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<pg_t>, mempoo
>    0.63%  libtcmalloc.so.4.4.5  [.] tc_deletearray_nothrow
>    0.59%  libceph-common.so.0   [.] ceph::buffer::list::iterator::advance
>    0.52%  libceph-common.so.0   [.] ceph::buffer::list::iterator::get_current_ptr
> 
> 
> perf top ms_dispatch
>   11.88%  libceph-common.so.0   [.] ceph::decode<pg_t, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >
>   11.23%  [kernel]              [k] system_call_after_swapgs
>    9.36%  libceph-common.so.0   [.] crush_hash32_3
>    6.55%  libceph-common.so.0   [.] crush_choose_indep
>    4.39%  [kernel]              [k] smp_call_function_many
>    3.17%  libceph-common.so.0   [.] ceph::buffer::list::append
>    3.03%  libceph-common.so.0   [.] ceph::buffer::list::append
>    3.02%  libceph-common.so.0   [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<pg_t>, mempoo
>    2.92%  libceph-common.so.0   [.] ceph::buffer::ptr::release
>    2.65%  libceph-common.so.0   [.] ceph::buffer::list::iterator_impl<false>::advance
>    2.57%  ceph-mon              [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<pg_t>, mempoo
>    2.27%  libceph-common.so.0   [.] ceph::buffer::ptr::ptr
>    1.99%  libstdc++.so.6.0.19   [.] std::_Rb_tree_increment
>    1.93%  libc-2.17.so          [.] __memcpy_ssse3_back
>    1.91%  libceph-common.so.0   [.] ceph::buffer::ptr::append
>    1.87%  libceph-common.so.0   [.] crush_hash32_3@plt
>    1.84%  libceph-common.so.0   [.] ceph::buffer::list::iterator_impl<false>::copy
>    1.75%  libtcmalloc.so.4.4.5  [.] tcmalloc::CentralFreeList::FetchFromOneSpans
>    1.63%  libceph-common.so.0   [.] ceph::encode<pg_t, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >
>    1.57%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out
>    1.55%  libstdc++.so.6.0.19   [.] std::_Rb_tree_insert_and_rebalance
>    1.47%  libceph-common.so.0   [.] ceph::buffer::ptr::raw_length
>    1.33%  libtcmalloc.so.4.4.5  [.] tc_deletearray_nothrow
>    1.09%  libceph-common.so.0   [.] ceph::decode<std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, denc_traits<std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, void> >
>    1.07%  libtcmalloc.so.4.4.5  [.] operator new[]
>    1.02%  libceph-common.so.0   [.] ceph::buffer::list::iterator::copy
>    1.01%  libtcmalloc.so.4.4.5  [.] tc_posix_memalign
>    0.85%  ceph-mon              [.] ceph::buffer::ptr::release@plt
>    0.76%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out@plt
>    0.74%  libceph-common.so.0   [.] crc32_iscsi_00
> 
> strace
> munmap(0x7f2eda736000, 2463941)         = 0
> open("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299339.sst", O_RDONLY) = 429
> stat("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299339.sst", {st_mode=S_IFREG|0644, st_size=1658656, ...}) = 0
> mmap(NULL, 1658656, PROT_READ, MAP_SHARED, 429, 0) = 0x7f2eea87e000
> close(429)                              = 0
> munmap(0x7f2ea8c97000, 2468005)         = 0
> open("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299338.sst", O_RDONLY) = 429
> stat("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299338.sst", {st_mode=S_IFREG|0644, st_size=2484001, ...}) = 0
> mmap(NULL, 2484001, PROT_READ, MAP_SHARED, 429, 0) = 0x7f2eda74b000
> close(429)                              = 0
> munmap(0x7f2ee21dc000, 2472343)         = 0
> 
> Kevin
> 
> 
> On 09/19/2018 06:50 AM, Sage Weil wrote:
> 
> On Wed, 19 Sep 2018, KEVIN MICHAEL HRPCEK wrote:
> 
> 
> Sage,
> 
> Unfortunately the mon election problem came back yesterday and it makes
> it really hard to get a cluster to stay healthy. A brief unexpected
> network outage occurred and sent the cluster into a frenzy and when I
> had it 95% healthy the mons started their nonstop reelections. In the
> previous logs I sent were you able to identify why the mons are
> constantly electing? The elections seem to be triggered by the below
> paxos message but do you know which lease timeout is being reached or
> why the lease isn't renewed instead of calling for an election?
> 
> One thing I tried was to shutdown the entire cluster and bring up only
> the mon and mgr. The mons weren't able to hold their quorum with no osds
> running and the ceph-mon ms_dispatch thread runs at 100% for > 60s at a
> time.
> 
> 
> 
> This is odd... with no other dameons running I'm not sure what would be
> eating up the CPU.  Can you run a 'perf top -p `pidof ceph-mon`' (or
> similar) on the machine to see what the process is doing?  You might need
> to install ceph-mon-dbg or ceph-debuginfo to get better symbols.
> 
> 
> 
> 2018-09-19 03:56:21.729 7f4344ec1700 1 mon.sephmon2@1(peon).paxos(paxos
> active c 133382665..133383355) lease_timeout -- calling new election
> 
> 
> 
> A workaround is probably to increase the lease timeout.  Try setting
> mon_lease = 15 (default is 5... could also go higher than 15) in the
> ceph.conf for all of the mons.  This is a bit of a band-aid but should
> help you keep the mons in quorum until we sort out what is going on.
> 
> sage
> 
> 
> 
> 
> 
> Thanks
> Kevin
> 
> On 09/10/2018 07:06 AM, Sage Weil wrote:
> 
> I took a look at the mon log you sent.  A few things I noticed:
> 
> - The frequent mon elections seem to get only 2/3 mons about half of the
> time.
> - The messages coming in a mostly osd_failure, and half of those seem to
> be recoveries (cancellation of the failure message).
> 
> It does smell a bit like a networking issue, or some tunable that relates
> to the messaging layer.  It might be worth looking at an OSD log for an
> osd that reported a failure and seeing what error code it coming up on the
> failed ping connection?  That might provide a useful hint (e.g.,
> ECONNREFUSED vs EMFILE or something).
> 
> I'd also confirm that with nodown set the mon quorum stabilizes...
> 
> sage
> 
> 
> 
> 
> On Mon, 10 Sep 2018, Kevin Hrpcek wrote:
> 
> 
> 
> Update for the list archive.
> 
> I went ahead and finished the mimic upgrade with the osds in a fluctuating
> state of up and down. The cluster did start to normalize a lot easier after
> everything was on mimic since the random mass OSD heartbeat failures stopped
> and the constant mon election problem went away. I'm still battling with the
> cluster reacting poorly to host reboots or small map changes, but I feel like
> my current pg:osd ratio may be playing a factor in that since we are 2x normal
> pg count while migrating data to new EC pools.
> 
> I'm not sure of the root cause but it seems like the mix of luminous and mimic
> did not play well together for some reason. Maybe it has to do with the scale
> of my cluster, 871 osd, or maybe I've missed some some tuning as my cluster
> has scaled to this size.
> 
> Kevin
> 
> 
> On 09/09/2018 12:49 PM, Kevin Hrpcek wrote:
> 
> 
> Nothing too crazy for non default settings. Some of those osd settings were
> in place while I was testing recovery speeds and need to be brought back
> closer to defaults. I was setting nodown before but it seems to mask the
> problem. While its good to stop the osdmap changes, OSDs would come up, get
> marked up, but at some point go down again (but the process is still
> running) and still stay up in the map. Then when I'd unset nodown the
> cluster would immediately mark 250+ osd down again and i'd be back where I
> started.
> 
> This morning I went ahead and finished the osd upgrades to mimic to remove
> that variable. I've looked for networking problems but haven't found any. 2
> of the mons are on the same switch. I've also tried combinations of shutting
> down a mon to see if a single one was the problem, but they keep electing no
> matter the mix of them that are up. Part of it feels like a networking
> problem but I haven't been able to find a culprit yet as everything was
> working normally before starting the upgrade. Other than the constant mon
> elections, yesterday I had the cluster 95% healthy 3 or 4 times, but it
> doesn't last long since at some point the OSDs start trying to fail each
> other through their heartbeats.
> 2018-09-09 17:37:29.079 7eff774f5700  1 mon.sephmon1@0(leader).osd e991282
> prepare_failure osd.39 10.1.9.2:6802/168438 from osd.49 10.1.9.3:6884/317908
> is reporting failure:1
> 2018-09-09 17:37:29.079 7eff774f5700  0 log_channel(cluster) log [DBG] :
> osd.39 10.1.9.2:6802/168438 reported failed by osd.49 10.1.9.3:6884/317908
> 2018-09-09 17:37:29.083 7eff774f5700  1 mon.sephmon1@0(leader).osd e991282
> prepare_failure osd.93 10.1.9.9:6853/287469 from osd.372
> 10.1.9.13:6801/275806 is reporting failure:1
> 
> I'm working on getting things mostly good again with everything on mimic and
> will see if it behaves better.
> 
> Thanks for your input on this David.
> 
> 
> [global]
> mon_initial_members = sephmon1, sephmon2, sephmon3
> mon_host = 10.1.9.201,10.1.9.202,10.1.9.203
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
> public_network = 10.1.0.0/16
> osd backfill full ratio = 0.92
> osd failsafe nearfull ratio = 0.90
> osd max object size = 21474836480
> mon max pg per osd = 350
> 
> [mon]
> mon warn on legacy crush tunables = false
> mon pg warn max per osd = 300
> mon osd down out subtree limit = host
> mon osd nearfull ratio = 0.90
> mon osd full ratio = 0.97
> mon health preluminous compat warning = false
> osd heartbeat grace = 60
> rocksdb cache size = 1342177280
> 
> [mds]
> mds log max segments = 100
> mds log max expiring = 40
> mds bal fragment size max = 200000
> mds cache memory limit = 4294967296
> 
> [osd]
> osd mkfs options xfs = -i size=2048 -d su=512k,sw=1
> osd recovery delay start = 30
> osd recovery max active = 5
> osd max backfills = 3
> osd recovery threads = 2
> osd crush initial weight = 0
> osd heartbeat interval = 30
> osd heartbeat grace = 60
> 
> 
> On 09/08/2018 11:24 PM, David Turner wrote:
> 
> 
> What osd/mon/etc config settings do you have that are not default? It
> might be worth utilizing nodown to stop osds from marking each other down
> and finish the upgrade to be able to set the minimum osd version to mimic.
> Stop the osds in a node, manually mark them down, start them back up in
> mimic. Depending on how bad things are, setting pause on the cluster to
> just finish the upgrade faster might not be a bad idea either.
> 
> This should be a simple question, have you confirmed that there are no
> networking problems between the MONs while the elections are happening?
> 
> On Sat, Sep 8, 2018, 7:52 PM Kevin Hrpcek <kevin.hrpcek@xxxxxxxxxxxxx<mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx>
> <mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx>> wrote:
> 
>     Hey Sage,
> 
>     I've posted the file with my email address for the user. It is
>     with debug_mon 20/20, debug_paxos 20/20, and debug ms 1/5. The
>     mons are calling for elections about every minute so I let this
>     run for a few elections and saw this node become the leader a
>     couple times. Debug logs start around 23:27:30. I had managed to
>     get about 850/857 osds up, but it seems that within the last 30
>     min it has all gone bad again due to the OSDs reporting each
>     other as failed. We relaxed the osd_heartbeat_interval to 30 and
>     osd_heartbeat_grace to 60 in an attempt to slow down how quickly
>     OSDs are trying to fail each other. I'll put in the
>     rocksdb_cache_size setting.
> 
>     Thanks for taking a look.
> 
>     Kevin
> 
>     On 09/08/2018 06:04 PM, Sage Weil wrote:
> 
> 
>     Hi Kevin,
> 
>     I can't think of any major luminous->mimic changes off the top of my
> head
>     that would impact CPU usage, but it's always possible there is
> something
>     subtle.  Can you ceph-post-file a the full log from one of your mons
>     (preferbably the leader)?
> 
>     You might try adjusting the rocksdb cache size.. try setting
> 
>       rocksdb_cache_size = 1342177280   # 10x the default, ~1.3 GB
> 
>     on the mons and restarting?
> 
>     Thanks!
>     sage
> 
>     On Sat, 8 Sep 2018, Kevin Hrpcek wrote:
> 
> 
> 
>     Hello,
> 
>     I've had a Luminous -> Mimic upgrade go very poorly and my cluster
> is stuck
>     with almost all pgs down. One problem is that the mons have
> started to
>     re-elect a new quorum leader almost every minute. This is making
> it difficult
>     to monitor the cluster and even run any commands on it since at
> least half the
>     time a ceph command times out or takes over a minute to return
> results. I've
>     looked at the debug logs and it appears there is some timeout
> occurring with
>     paxos of about a minute. The msg_dispatch thread of the mons is
> often running
>     a core at 100% for about a minute(user time, no iowait). Running
> strace on it
>     shows the process is going through all of the mon db files (about
> 6gb in
>     store.db/*.sst). Does anyone have an idea of what this timeout is
> or why my
>     mons are always reelecting? One theory I have is that the
> msg_dispatch can't
>     process the SST's fast enough and hits some timeout for a health
> check and the
>     mon drops itself from the quorum since it thinks it isn't healthy.
> I've been
>     thinking of introducing a new mon to the cluster on hardware with
> a better cpu
>     to see if that can process the SSTs within this timeout.
> 
>     My cluster has the mons,mds,mgr and 30/41 osd servers on mimic,
> and 11/41 osd
>     servers on luminous. The original problem started when I restarted
> the osds on
>     one of the hosts. The cluster reacted poorly to them going down
> and went into
>     a frenzy of taking down other osds and remapping. I eventually got
> that stable
>     and the PGs were 90% good with the finish line in sight and then
> the mons
>     started their issue of releecting every minute. Now I can't keep
> any decent
>     amount of PGs up for more than a few hours. This started on
> Wednesday.....
> 
>     Any help would be greatly appreciated.
> 
>     Thanks,
>     Kevin
> 
>     --Debug snippet from a mon at reelection time
>     2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).mds
> e14242
>     maybe_resize_cluster in 1 max 1
>     2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mds
> e14242
>     tick: resetting beacon timeouts due to mon delay (slow election?)
> of 59.8106s
>     seconds
>     2018-09-07 20:08:08.655 7f57b92cd700 10
>     mon.sephmon2@1(leader).paxosservice(mdsmap 13504..14242)
> maybe_trim trim_to
>     13742 would only trim 238 < paxos_service_trim_min 250
>     2018-09-07 20:08:08.655 7f57b92cd700 10
> mon.sephmon2@1(leader).auth v120657
>     auth
>     2018-09-07 20:08:08.655 7f57b92cd700 10
> mon.sephmon2@1(leader).auth v120657
>     check_rotate updated rotating
>     2018-09-07 20:08:08.655 7f57b92cd700 10
>     mon.sephmon2@1(leader).paxosservice(auth 120594..120657)
> propose_pending
>     2018-09-07 20:08:08.655 7f57b92cd700 10
> mon.sephmon2@1(leader).auth v120657
>     encode_pending v 120658
>     2018-09-07 20:08:08.655 7f57b92cd700  5
> mon.sephmon2@1(leader).paxos(paxos
>     updating c 132917556..132918214) queue_pending_finisher
> 0x55dce8e5b370
>     2018-09-07 20:08:08.655 7f57b92cd700 10
> mon.sephmon2@1(leader).paxos(paxos
>     updating c 132917556..132918214) trigger_propose not active, will
> propose
>     later
>     2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mgr
> e2234 tick:
>     resetting beacon timeouts due to mon delay (slow election?) of
> 59.8844s
>     seconds
>     2018-09-07 20:08:08.655 7f57b92cd700 10
>     mon.sephmon2@1(leader).paxosservice(mgr 1513..2234) maybe_trim
> trim_to 1734
>     would only trim 221 < paxos_service_trim_min 250
>     2018-09-07 20:08:08.655 7f57b92cd700 10
> mon.sephmon2@1(leader).health tick
>     2018-09-07 20:08:08.655 7f57b92cd700 20
> mon.sephmon2@1(leader).health
>     check_member_health
>     2018-09-07 20:08:08.657 7f57bcdd0700  1 --10.1.9.202:6789/0
> <http://10.1.9.202:6789/0><http://10.1.9.202:6789/0><http://10.1.9.202:6789/0><http://10.1.9.202:6789/0>  >> -
>     conn(0x55dcee55be00 :6789 s=STATE_ACCEPTING pgs=0 cs=0
>     l=0)._process_connection sd=447 -
>     2018-09-07 20:08:08.657 7f57bcdd0700 10 mon.sephmon2@1(leader) e17
>     ms_verify_authorizer10.1.9.32:6823/4007
> <http://10.1.9.32:6823/4007><http://10.1.9.32:6823/4007><http://10.1.9.32:6823/4007><http://10.1.9.32:6823/4007>  osd protocol 0
>     2018-09-07 20:08:08.662 7f57b92cd700 10
> mon.sephmon2@1(leader).health
>     check_member_health avail 79% total 40 GiB, used 8.4 GiB, avail 32
> GiB
>     2018-09-07 20:08:08.662 7f57b92cd700 20
> mon.sephmon2@1(leader).health
>     check_leader_health
>     2018-09-07 20:08:08.662 7f57b92cd700 10
>     mon.sephmon2@1(leader).paxosservice(health 1534..1720) maybe_trim
> trim_to 1715
>     would only trim 181 < paxos_service_trim_min 250
>     2018-09-07 20:08:08.662 7f57b92cd700 10
> mon.sephmon2@1(leader).config tick
>     2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader) e17
>     sync_trim_providers
>     2018-09-07 20:08:08.662 7f57b92cd700 -1 mon.sephmon2@1(leader) e17
>     get_health_metrics reporting 1940 slow ops, oldest is
> osd_failure(failed
>     timeout osd.7210.1.9.9:6800/68904 <http://10.1.9.9:6800/68904><http://10.1.9.9:6800/68904><http://10.1.9.9:6800/68904><http://10.1.9.9:6800/68904>
> for 317sec e987498 v987498)
>     2018-09-07 20:08:08.662 7f57b92cd700  1
> mon.sephmon2@1(leader).paxos(paxos
>     updating c 132917556..132918214) accept timeout, calling fresh
> election
>     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
> bootstrap
>     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
>     sync_reset_requester
>     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
>     unregister_cluster_logger
>     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
>     cancel_probe_timeout (none scheduled)
>     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
> e17 _reset
>     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
> e17
>     cancel_probe_timeout (none scheduled)
>     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
> e17
>     timecheck_finish
>     2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing)
> e17
>     health_tick_stop
>     2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing)
> e17
>     health_interval_stop
>     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
> e17
>     scrub_event_cancel
>     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
> e17
>     scrub_reset
>     2018-09-07 20:08:08.662 7f57b92cd700 10
> mon.sephmon2@1(probing).paxos(paxos
>     updating c 132917556..132918214) restart -- canceling timeouts
>     2018-09-07 20:08:08.662 7f57b92cd700 10
>     mon.sephmon2@1(probing).paxosservice(mdsmap 13504..14242) restart
>     2018-09-07 20:08:08.662 7f57b92cd700 10
>     mon.sephmon2@1(probing).paxosservice(osdmap 917471..987503)
> restart
> 
> 
> 
> 
> 
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx> <mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com