Re: Mimic upgrade failure

KEVIN MICHAEL HRPCEK <kevin.hrpcek@xxxxxxxxxxxxx> · Mon, 24 Sep 2018 18:27:49 +0000

The cluster is healthy and stable. I'll leave a summary for the archive in case anyone else has a similar problem.

centos 7.5

ceph mimic 13.2.1

3 mon/mgr/mds hosts, 862 osd (41 hosts)

This was all triggered by an unexpected ~1 min network blip on our 10Gbit switch. The ceph cluster lost connectivity to each other and obviously tried to remap everything once connectivity returned and tons of OSDs were being marked down. This was made worse
 by the OSDs trying to use large amounts of memory while recovering and ending up swapping, hanging, and me ipmi resetting hosts.  All of this caused a lot of osd map changes and the mons will have stored all of them without trimming due to the unhealthy PGs.
 I was able to get almost all PGs active and clean on a few occasions but the cluster would fall over again after about 2 hours with cephx auth errors or OSDs trying to mark each other down
(the mons seemed to not be rotating cephx auth keys). Setting 'osd_heartbeat_interval = 30' helped a bit, but I eventually disabled process cephx auth with 'auth_cluster_required = none'. Setting that stopped the
 OSDs from falling over after 2 hours. From the beginning of this the MONs were running 100% on the ms_dispatch thread and constantly reelecting a leader every minute and not holding a consistent quorum with paxos lease_timeouts in the logs. The ms_dispatch
 was reading through the /var/lib/ceph/mon/mon-$hostname/store.db/*.sst constantly and strace showed this taking anywhere from 60 seconds to a couple minutes. This was almost all cpu user time and not much iowait. I think what was happening is that the mons
 failed health checks due to spending so much time constantly reading through the db and that held up other mon tasks which caused constant reelections.

We eventually reduced the MON reelections by finding the average ms_dispatch sst read time on the rank 0 mon took 65 seconds and setting 'mon_lease = 75' so that the paxos lease would last longer than ms_dispatch running 100%.  I also greatly increased the
 rocksdb_cache_size and leveldb_cache_size on the mons to be big enough to cache the entire db, but that didn't seem to make much difference initially. After working with Sage, he set the mon_osd_cache_size = 200000 (default 10).
The huge mon_osd_cache_size let the mons cache all osd maps on the first read and the ms_dispatch thread was able to use this cache instead of spinning 100% on rereading them every minute. This stopped the constant elections
 because the mon stopped failing health checks and was able to complete other tasks. Lastly there were some self inflicted osd corruptions from the ipmi resets that needed to be dealt with to get all PGs active+clean, and the cephx change was rolled
 back to operate normally.

Sage, thanks again for your assistance with this.

Kevin

tl;dr Cache as much as you can.

On 09/24/2018 09:24 AM, Sage Weil wrote:

Hi Kevin,

Do you have an update on the state of the cluster?

I've opened a ticket http://tracker.ceph.com/issues/36163 to track the 
likely root cause we identified, and have a PR open at 
https://github.com/ceph/ceph/pull/24247

Thanks!
sage

On Thu, 20 Sep 2018, Sage Weil wrote:

On Thu, 20 Sep 2018, KEVIN MICHAEL HRPCEK wrote:

Top results when both were taken with ms_dispatch at 100%. The mon one 
changes alot so I've included 3 snapshots of those. I'll update 
mon_osd_cache_size.

After disabling auth_cluster_required and a cluster reboot I am having 
less problems keeping OSDs in the cluster since they seem to not be 
having auth problems around the 2 hour uptime mark. The mons still have 
their problems but 859/861 OSDs are up with 2 crashing. I found a brief 
mention on a forum or somewhere that the mons will only trim their 
storedb when the cluster is healthy. If that's true do you think it is 
likely that once all osds are healthy and unset some no* cluster flags 
the mons will be able to trim their db and the result will be that 
ms_dispatch no longer takes to long to churn through the db? Our primary 
theory here is that ms_dispatch is taking too long and the mons reach a 
timeout and then reelect in a nonstop cycle.

It's the PGs that need to all get healthy (active+clean) before the 
osdmaps get trimmed.  Other health warnigns (e.g. about noout being set) 
aren't related.

ceph-mon
    34.24%    34.24%  libpthread-2.17.so    [.] pthread_rwlock_rdlock
+   34.00%    34.00%  libceph-common.so.0   [.] crush_hash32_3

If this is the -g output you need to hit enter on lines like this to see 
the call graph...  Or you can do 'perf record -g -p <pid>' and then 'perf 
report --stdio' (or similar) to dump it all to a file, fully expanded.

Thanks!
sage

+    5.01%     5.01%  libceph-common.so.0   [.] ceph::decode<pg_t, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)
+    4.66%     0.00%  libceph-common.so.0   [.] ceph::buffer::raw_combined::~raw_combined
+    2.37%     0.00%  [unknown]             [.] 0000000000000000
+    2.21%     2.21%  libceph-common.so.0   [.] ceph::buffer::ptr::release
+    2.14%     2.14%  libceph-common.so.0   [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<p
+    1.88%     1.08%  libc-2.17.so          [.] __memcpy_ssse3_back
+    1.70%     1.70%  ceph-mon              [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<p
+    1.65%     1.65%  libceph-common.so.0   [.] ceph::buffer::list::append
+    1.54%     1.54%  libstdc++.so.6.0.19   [.] std::_Rb_tree_increment
     1.44%     1.44%  libstdc++.so.6.0.19   [.] std::_Rb_tree_insert_and_rebalance
+    1.28%     1.28%  libceph-common.so.0   [.] ceph::buffer::ptr::append
+    1.23%     1.23%  libceph-common.so.0   [.] ceph::buffer::list::append
+    0.98%     0.98%  libceph-common.so.0   [.] ceph::buffer::ptr::ptr
+    0.79%     0.79%  libceph-common.so.0   [.] ceph::buffer::list::iterator_impl<false>::copy
+    0.79%     0.79%  libceph-common.so.0   [.] ceph::buffer::list::iterator_impl<false>::advance
+    0.66%     0.66%  libceph-common.so.0   [.] ceph::buffer::ptr::unused_tail_length
+    0.64%     0.64%  libceph-common.so.0   [.] ceph::encode<pg_t, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)
+    0.60%     0.60%  libceph-common.so.0   [.] ceph::buffer::ptr::operator=
     0.53%     0.53%  libtcmalloc.so.4.4.5  [.] tcmalloc::CentralFreeList::FetchFromOneSpans
     0.52%     0.52%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out
+    0.51%     0.51%  libtcmalloc.so.4.4.5  [.] operator new[]
+    0.49%     0.49%  libceph-common.so.0   [.] ceph::decode<std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, denc_traits<std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, void> >
     0.49%     0.49%  ceph-mon              [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<p
+    0.29%     0.29%  ceph-mon              [.] ceph::buffer::list::contiguous_appender::~contiguous_appender
     0.27%     0.27%  libceph-common.so.0   [.] ceph::buffer::list::iterator_impl<false>::advance@plt
+    0.27%     0.27%  [kernel]              [k] retint_userspace_restore_args

ceph-mon
  Children      Self  Shared Object         Symbol
+   12.36%    12.36%  libceph-common.so.0   [.] ceph::buffer::list::append
+   11.58%    11.58%  libceph-common.so.0   [.] ceph::decode<pg_t, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)
+    8.76%     0.00%  libceph-common.so.0   [.] ceph::buffer::raw_combined::~raw_combined
+    7.10%     7.10%  libceph-common.so.0   [.] ceph::buffer::list::append
+    6.51%     6.51%  libceph-common.so.0   [.] ceph::buffer::ptr::ptr
+    6.29%     6.29%  libceph-common.so.0   [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<p
+    5.73%     5.73%  ceph-mon              [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<p
+    5.06%     5.06%  libceph-common.so.0   [.] ceph::encode<pg_t, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)
+    4.35%     4.35%  libceph-common.so.0   [.] ceph::buffer::ptr::release
+    3.10%     0.00%  libc-2.17.so          [.] __memcpy_ssse3_back
     2.94%     2.94%  libstdc++.so.6.0.19   [.] std::_Rb_tree_insert_and_rebalance
+    2.80%     0.00%  [unknown]             [.] 0000000000000000
+    2.72%     2.72%  libceph-common.so.0   [.] ceph::decode<std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, denc_traits<std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, void> >
+    2.35%     2.35%  libstdc++.so.6.0.19   [.] std::_Rb_tree_increment
     2.22%     2.22%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out
     2.09%     2.09%  libceph-common.so.0   [.] ceph::buffer::list::iterator_impl<false>::advance@plt
+    2.01%     2.01%  [kernel]              [k] retint_userspace_restore_args
+    1.95%     1.95%  ceph-mon              [.] ceph::buffer::list::contiguous_appender::~contiguous_appender
+    1.89%     1.89%  libceph-common.so.0   [.] crush_hash32_3
     1.83%     1.83%  libceph-common.so.0   [.] crc32_iscsi_00
+    1.60%     1.60%  libceph-common.so.0   [.] ceph::buffer::list::iterator_impl<false>::copy
+    1.57%     1.57%  libtcmalloc.so.4.4.5  [.] operator new[]
+    1.40%     1.40%  libceph-common.so.0   [.] ceph::buffer::list::iterator_impl<false>::advance
+    1.32%     1.32%  libceph-common.so.0   [.] ceph::buffer::ptr::end_c_str
+    1.23%     1.23%  libceph-common.so.0   [.] ceph::buffer::ptr::append

ceph-mon
  Children      Self  Shared Object         Symbol
+   32.46%     0.00%  libceph-common.so.0   [.] OSDMap::_pg_to_up_acting_osds
+   32.46%    32.46%  libceph-common.so.0   [.] crush_init_workspace
+   28.13%    28.13%  libceph-common.so.0   [.] crush_hash32_3
+   18.91%     0.00%  [kernel]              [k] system_call_fastpath
+   16.23%     0.00%  [kernel]              [k] unmap_region
+   16.23%    16.23%  [kernel]              [k] unmap_page_range
+   10.55%    10.55%  [kernel]              [k] system_call_after_swapgs
+    7.25%     0.00%  [unknown]             [.] 0000000000000000
+    7.25%     7.25%  ceph-mon              [.] OSDMap::deepish_copy_from
+    1.74%     1.74%  libceph-common.so.0   [.] AsyncConnection::read_bulk
+    1.45%     1.45%  [kernel]              [k] ___sys_sendmsg
+    1.23%     1.23%  [kernel]              [k] native_sched_clock
+    1.23%     0.00%  [kernel]              [k] sched_clock
+    1.23%     0.00%  [kernel]              [k] __schedule
+    1.23%     0.00%  [kernel]              [k] __perf_event_task_sched_out
+    1.23%     0.00%  [kernel]              [k] ctx_sched_out
+    0.95%     0.95%  libceph-common.so.0   [.] ceph::buffer::list::append

ms_dispatch thread
  Children      Self  Shared Object         Symbol
+   15.07%    15.07%  libceph-common.so.0   [.] ceph::decode<pg_t, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)
+   10.24%     0.00%  libceph-common.so.0   [.] ceph::buffer::raw_combined::~raw_combined
+    7.27%     7.27%  ceph-mon              [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<p
+    6.19%     6.19%  libceph-common.so.0   [.] ceph::buffer::ptr::release
+    4.84%     4.84%  libstdc++.so.6.0.19   [.] std::_Rb_tree_increment
+    4.59%     4.59%  libceph-common.so.0   [.] ceph::buffer::list::append
+    4.20%     4.20%  ceph-mon              [.] std::_Rb_tree<std::string, std::string, std::_Identity<std::string>, std::less<std::string>, std::allocator<std::string> >::find
+    4.04%     4.04%  libceph-common.so.0   [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<p
     3.94%     3.94%  libtcmalloc.so.4.4.5  [.] tcmalloc::CentralFreeList::FetchFromOneSpans
+    3.71%     3.71%  libceph-common.so.0   [.] ceph::buffer::ptr::ptr
+    3.43%     3.43%  libceph-common.so.0   [.] ceph::buffer::list::append
+    3.03%     1.18%  libc-2.17.so          [.] __memcpy_ssse3_back
+    2.66%     0.00%  [kernel]              [k] do_futex
+    2.66%     0.00%  [kernel]              [k] futex_wait_queue_me
+    2.66%     1.15%  [kernel]              [k] timerqueue_add
+    2.49%     2.49%  libceph-common.so.0   [.] ceph::decode<std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, denc_traits<std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, void> >
+    2.02%     2.02%  libceph-common.so.0   [.] ceph::buffer::ptr::unused_tail_length
+    1.99%     1.99%  libceph-common.so.0   [.] ceph::buffer::list::iterator_impl<false>::copy
+    1.67%     0.00%  [unknown]             [k] 0000000000000000
     1.64%     1.64%  libstdc++.so.6.0.19   [.] std::_Rb_tree_insert_and_rebalance
+    1.57%     1.57%  libtcmalloc.so.4.4.5  [.] operator new[]
+    1.56%     1.56%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out
+    1.55%     1.55%  libceph-common.so.0   [.] ceph::buffer::ptr::append
     1.53%     1.53%  libceph-common.so.0   [.] ceph::buffer::list::iterator_impl<false>::copy@plt
+    1.51%     1.51%  [kernel]              [k] rb_insert_color
+    1.36%     1.36%  libceph-common.so.0   [.] ceph::buffer::list::iterator_impl<false>::advance
+    1.27%     1.27%  libceph-common.so.0   [.] ceph::encode<pg_t, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)
     1.13%     1.13%  ceph-mon              [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<p
+    1.10%     1.10%  ceph-mon              [.] entity_addr_t::encode
     1.08%     1.08%  libceph-common.so.0   [.] ceph::buffer::ptr::unused_tail_length@plt
     1.04%     1.04%  libceph-common.so.0   [.] ceph::buffer::list::contiguous_appender::~contiguous_appender@plt
     1.03%     1.03%  libtcmalloc.so.4.4.5  [.] tcmalloc::ThreadCache::ReleaseToCentralCache
+    1.03%     1.03%  libtcmalloc.so.4.4.5  [.] tcmalloc::CentralFreeList::ReleaseToSpans
     0.92%     0.92%  libceph-common.so.0   [.] ceph::buffer::list::iterator::advance

On 09/20/2018 08:21 AM, Sage Weil wrote:

Can you try 'perf top -g -p <pid>' to see where all of the encoding
activity is coming from?  I see two possibilities (the mon attempts to
cache encoded maps, and the MOSDMap message itself will also reencode
if/when that fails).

Also: mon_osd_cache_size = 10 by default... try making that 500 or
something.

sage

On Wed, 19 Sep 2018, Kevin Hrpcek wrote:

Majority of the clients are luminous with a few kraken stragglers. I looked at
ceph features and 'ceph daemon mon.sephmon1 sessions'. Nothing is reporting as
having mimic features, all mon,mgr,osd are running 13.2.1 but are reporting
luminous features, and majority of the luminous clients are reporting jewel
features. I shut down my compute cluster to get rid of majority of the clients
that are reporting jewel features, and there is still a lot of time spent by
ms_dispatch in ceph::decode<pg_t.......

My small mimic test cluster actually shows similar in it's features,
mon,mgr,mds,osd all report luminous features yet have 13.2.1 installed, so
maybe that is normal.

Kevin

On 09/19/2018 09:35 AM, Sage Weil wrote:

It's hard to tell exactly from the below, but it looks to me like there is
still a lot of OSDMap reencoding going on.  Take a look at 'ceph features'
output and see who in the cluster is using pre-luminous features.. I'm
guessing all of the clients?  For any of those sessions, fetching OSDMaps
from the cluster will require reencoding.

If it's all clients (well, non-OSDs), I think we could work around it by
avoiding the reencode entirely (it is only really there for OSDs, which
want a perfect OSDMap copy that will match the monitor's CRC).

sage

On Wed, 19 Sep 2018, KEVIN MICHAEL HRPCEK wrote:

I set mon lease = 30 yesterday and it had no effect on the quorum
election. To give you an idea of how much cpu ms_dispatch is using, from
the last mon restart about 7.5 hours ago, the ms_dispatch thread has 5h
40m of cpu time. Below are 2 snippets from perf top. I took them while
ms_dispatch was 100% of a core, the first is using the pid of the
ceph-mon, the second is the pid of the ms_dispatch thread. The last thing
is a snippet from stracing the ms_dispatch pid. It is running through all
of the sst files.

perf top ceph-mon
Overhead  Shared Object         Symbol
   17.71%  libceph-common.so.0   [.] ceph::decode<pg_t, std::vector<int,
mempool::pool_allocator<(mempool::pool_index_t)15, int> >,
std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15,
std::pair<pg_t const, std::vector<int,
mempool::pool_allocator<(mempool::pool_index_t)15, int> > >
    6.67%  libceph-common.so.0   [.] ceph::buffer::ptr::release
    5.35%  libceph-common.so.0   [.] std::_Rb_tree<pg_t, std::pair<pg_t
const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15,
int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int,
mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >,
std::less<pg_t>, mempoo
    5.20%  libceph-common.so.0   [.] ceph::buffer::ptr::append
    5.12%  libceph-common.so.0   [.]
ceph::buffer::list::iterator_impl<false>::copy
    4.66%  libceph-common.so.0   [.] ceph::buffer::list::append
    4.33%  libstdc++.so.6.0.19   [.] std::_Rb_tree_increment
    4.27%  ceph-mon              [.] std::_Rb_tree<pg_t, std::pair<pg_t
const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15,
int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int,
mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >,
std::less<pg_t>, mempoo
    4.18%  libceph-common.so.0   [.] ceph::buffer::list::append
    3.10%  libceph-common.so.0   [.] ceph::decode<std::vector<int,
mempool::pool_allocator<(mempool::pool_index_t)15, int> >,
denc_traits<std::vector<int,
mempool::pool_allocator<(mempool::pool_index_t)15, int> >, void> >
    2.90%  libceph-common.so.0   [.] ceph::encode<pg_t, std::vector<int,
mempool::pool_allocator<(mempool::pool_index_t)15, int> >,
std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15,
std::pair<pg_t const, std::vector<int,
mempool::pool_allocator<(mempool::pool_index_t)15, int> > >
    2.56%  libceph-common.so.0   [.] ceph::buffer::ptr::ptr
    2.50%  libstdc++.so.6.0.19   [.] std::_Rb_tree_insert_and_rebalance
    2.39%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out
    2.33%  libceph-common.so.0   [.]
ceph::buffer::list::iterator_impl<false>::advance
    2.21%  libtcmalloc.so.4.4.5  [.]
tcmalloc::CentralFreeList::FetchFromOneSpans
    1.97%  libtcmalloc.so.4.4.5  [.]
tcmalloc::CentralFreeList::ReleaseToSpans
    1.60%  libceph-common.so.0   [.] crc32_iscsi_00
    1.42%  libtcmalloc.so.4.4.5  [.] operator new[]
    1.29%  libceph-common.so.0   [.] ceph::buffer::ptr::unused_tail_length
    1.28%  libceph-common.so.0   [.]
ceph::buffer::list::iterator_impl<false>::copy_shallow
    1.25%  libceph-common.so.0   [.] ceph::buffer::ptr::raw_length@plt
    1.06%  libceph-common.so.0   [.] ceph::buffer::ptr::end_c_str
    1.06%  libceph-common.so.0   [.] ceph::buffer::list::iterator::copy
    0.99%  libc-2.17.so          [.] __memcpy_ssse3_back
    0.94%  libc-2.17.so          [.] _IO_default_xsputn
    0.89%  libceph-common.so.0   [.]
ceph::buffer::list::iterator_impl<false>::advance@plt
    0.87%  libtcmalloc.so.4.4.5  [.]
tcmalloc::ThreadCache::ReleaseToCentralCache
    0.76%  libleveldb.so.1.0.7   [.] leveldb::FindFile
    0.72%  [vdso]                [.] __vdso_clock_gettime
    0.67%  ceph-mon              [.] std::_Rb_tree<pg_t, std::pair<pg_t
const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15,
int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int,
mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >,
std::less<pg_t>, mempoo
    0.63%  libtcmalloc.so.4.4.5  [.] tc_deletearray_nothrow
    0.59%  libceph-common.so.0   [.] ceph::buffer::list::iterator::advance
    0.52%  libceph-common.so.0   [.]
ceph::buffer::list::iterator::get_current_ptr

perf top ms_dispatch
   11.88%  libceph-common.so.0   [.] ceph::decode<pg_t, std::vector<int,
mempool::pool_allocator<(mempool::pool_index_t)15, int> >,
std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15,
std::pair<pg_t const, std::vector<int,
mempool::pool_allocator<(mempool::pool_index_t)15, int> > >
   11.23%  [kernel]              [k] system_call_after_swapgs
    9.36%  libceph-common.so.0   [.] crush_hash32_3
    6.55%  libceph-common.so.0   [.] crush_choose_indep
    4.39%  [kernel]              [k] smp_call_function_many
    3.17%  libceph-common.so.0   [.] ceph::buffer::list::append
    3.03%  libceph-common.so.0   [.] ceph::buffer::list::append
    3.02%  libceph-common.so.0   [.] std::_Rb_tree<pg_t, std::pair<pg_t
const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15,
int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int,
mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >,
std::less<pg_t>, mempoo
    2.92%  libceph-common.so.0   [.] ceph::buffer::ptr::release
    2.65%  libceph-common.so.0   [.]
ceph::buffer::list::iterator_impl<false>::advance
    2.57%  ceph-mon              [.] std::_Rb_tree<pg_t, std::pair<pg_t
const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15,
int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int,
mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >,
std::less<pg_t>, mempoo
    2.27%  libceph-common.so.0   [.] ceph::buffer::ptr::ptr
    1.99%  libstdc++.so.6.0.19   [.] std::_Rb_tree_increment
    1.93%  libc-2.17.so          [.] __memcpy_ssse3_back
    1.91%  libceph-common.so.0   [.] ceph::buffer::ptr::append
    1.87%  libceph-common.so.0   [.] crush_hash32_3@plt
    1.84%  libceph-common.so.0   [.]
ceph::buffer::list::iterator_impl<false>::copy
    1.75%  libtcmalloc.so.4.4.5  [.]
tcmalloc::CentralFreeList::FetchFromOneSpans
    1.63%  libceph-common.so.0   [.] ceph::encode<pg_t, std::vector<int,
mempool::pool_allocator<(mempool::pool_index_t)15, int> >,
std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15,
std::pair<pg_t const, std::vector<int,
mempool::pool_allocator<(mempool::pool_index_t)15, int> > >
    1.57%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out
    1.55%  libstdc++.so.6.0.19   [.] std::_Rb_tree_insert_and_rebalance
    1.47%  libceph-common.so.0   [.] ceph::buffer::ptr::raw_length
    1.33%  libtcmalloc.so.4.4.5  [.] tc_deletearray_nothrow
    1.09%  libceph-common.so.0   [.] ceph::decode<std::vector<int,
mempool::pool_allocator<(mempool::pool_index_t)15, int> >,
denc_traits<std::vector<int,
mempool::pool_allocator<(mempool::pool_index_t)15, int> >, void> >
    1.07%  libtcmalloc.so.4.4.5  [.] operator new[]
    1.02%  libceph-common.so.0   [.] ceph::buffer::list::iterator::copy
    1.01%  libtcmalloc.so.4.4.5  [.] tc_posix_memalign
    0.85%  ceph-mon              [.] ceph::buffer::ptr::release@plt
    0.76%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out@plt
    0.74%  libceph-common.so.0   [.] crc32_iscsi_00

strace
munmap(0x7f2eda736000, 2463941)         = 0
open("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299339.sst", O_RDONLY) =
429
stat("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299339.sst",
{st_mode=S_IFREG|0644, st_size=1658656, ...}) = 0
mmap(NULL, 1658656, PROT_READ, MAP_SHARED, 429, 0) = 0x7f2eea87e000
close(429)                              = 0
munmap(0x7f2ea8c97000, 2468005)         = 0
open("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299338.sst", O_RDONLY) =
429
stat("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299338.sst",
{st_mode=S_IFREG|0644, st_size=2484001, ...}) = 0
mmap(NULL, 2484001, PROT_READ, MAP_SHARED, 429, 0) = 0x7f2eda74b000
close(429)                              = 0
munmap(0x7f2ee21dc000, 2472343)         = 0

Kevin

On 09/19/2018 06:50 AM, Sage Weil wrote:

On Wed, 19 Sep 2018, KEVIN MICHAEL HRPCEK wrote:

Sage,

Unfortunately the mon election problem came back yesterday and it makes
it really hard to get a cluster to stay healthy. A brief unexpected
network outage occurred and sent the cluster into a frenzy and when I
had it 95% healthy the mons started their nonstop reelections. In the
previous logs I sent were you able to identify why the mons are
constantly electing? The elections seem to be triggered by the below
paxos message but do you know which lease timeout is being reached or
why the lease isn't renewed instead of calling for an election?

One thing I tried was to shutdown the entire cluster and bring up only
the mon and mgr. The mons weren't able to hold their quorum with no osds
running and the ceph-mon ms_dispatch thread runs at 100% for > 60s at a
time.

This is odd... with no other dameons running I'm not sure what would be
eating up the CPU.  Can you run a 'perf top -p `pidof ceph-mon`' (or
similar) on the machine to see what the process is doing?  You might need
to install ceph-mon-dbg or ceph-debuginfo to get better symbols.

2018-09-19 03:56:21.729 7f4344ec1700 1 mon.sephmon2@1(peon).paxos(paxos
active c 133382665..133383355) lease_timeout -- calling new election

A workaround is probably to increase the lease timeout.  Try setting
mon_lease = 15 (default is 5... could also go higher than 15) in the
ceph.conf for all of the mons.  This is a bit of a band-aid but should
help you keep the mons in quorum until we sort out what is going on.

sage

Thanks
Kevin

On 09/10/2018 07:06 AM, Sage Weil wrote:

I took a look at the mon log you sent.  A few things I noticed:

- The frequent mon elections seem to get only 2/3 mons about half of the
time.
- The messages coming in a mostly osd_failure, and half of those seem to
be recoveries (cancellation of the failure message).

It does smell a bit like a networking issue, or some tunable that relates
to the messaging layer.  It might be worth looking at an OSD log for an
osd that reported a failure and seeing what error code it coming up on the
failed ping connection?  That might provide a useful hint (e.g.,
ECONNREFUSED vs EMFILE or something).

I'd also confirm that with nodown set the mon quorum stabilizes...

sage

On Mon, 10 Sep 2018, Kevin Hrpcek wrote:

Update for the list archive.

I went ahead and finished the mimic upgrade with the osds in a fluctuating
state of up and down. The cluster did start to normalize a lot easier
after
everything was on mimic since the random mass OSD heartbeat failures
stopped
and the constant mon election problem went away. I'm still battling with
the
cluster reacting poorly to host reboots or small map changes, but I feel
like
my current pg:osd ratio may be playing a factor in that since we are 2x
normal
pg count while migrating data to new EC pools.

I'm not sure of the root cause but it seems like the mix of luminous and
mimic
did not play well together for some reason. Maybe it has to do with the
scale
of my cluster, 871 osd, or maybe I've missed some some tuning as my
cluster
has scaled to this size.

Kevin

On 09/09/2018 12:49 PM, Kevin Hrpcek wrote:

Nothing too crazy for non default settings. Some of those osd settings
were
in place while I was testing recovery speeds and need to be brought back
closer to defaults. I was setting nodown before but it seems to mask the
problem. While its good to stop the osdmap changes, OSDs would come up,
get
marked up, but at some point go down again (but the process is still
running) and still stay up in the map. Then when I'd unset nodown the
cluster would immediately mark 250+ osd down again and i'd be back where I
started.

This morning I went ahead and finished the osd upgrades to mimic to remove
that variable. I've looked for networking problems but haven't found any.
2
of the mons are on the same switch. I've also tried combinations of
shutting
down a mon to see if a single one was the problem, but they keep electing
no
matter the mix of them that are up. Part of it feels like a networking
problem but I haven't been able to find a culprit yet as everything was
working normally before starting the upgrade. Other than the constant mon
elections, yesterday I had the cluster 95% healthy 3 or 4 times, but it
doesn't last long since at some point the OSDs start trying to fail each
other through their heartbeats.
2018-09-09 17:37:29.079 7eff774f5700  1 mon.sephmon1@0(leader).osd e991282
prepare_failure osd.39 10.1.9.2:6802/168438 from osd.49
10.1.9.3:6884/317908
is reporting failure:1
2018-09-09 17:37:29.079 7eff774f5700  0 log_channel(cluster) log [DBG] :
osd.39 10.1.9.2:6802/168438 reported failed by osd.49 10.1.9.3:6884/317908
2018-09-09 17:37:29.083 7eff774f5700  1 mon.sephmon1@0(leader).osd e991282
prepare_failure osd.93 10.1.9.9:6853/287469 from osd.372
10.1.9.13:6801/275806 is reporting failure:1

I'm working on getting things mostly good again with everything on mimic
and
will see if it behaves better.

Thanks for your input on this David.

[global]
mon_initial_members = sephmon1, sephmon2, sephmon3
mon_host = 10.1.9.201,10.1.9.202,10.1.9.203
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = 10.1.0.0/16
osd backfill full ratio = 0.92
osd failsafe nearfull ratio = 0.90
osd max object size = 21474836480
mon max pg per osd = 350

[mon]
mon warn on legacy crush tunables = false
mon pg warn max per osd = 300
mon osd down out subtree limit = host
mon osd nearfull ratio = 0.90
mon osd full ratio = 0.97
mon health preluminous compat warning = false
osd heartbeat grace = 60
rocksdb cache size = 1342177280

[mds]
mds log max segments = 100
mds log max expiring = 40
mds bal fragment size max = 200000
mds cache memory limit = 4294967296

[osd]
osd mkfs options xfs = -i size=2048 -d su=512k,sw=1
osd recovery delay start = 30
osd recovery max active = 5
osd max backfills = 3
osd recovery threads = 2
osd crush initial weight = 0
osd heartbeat interval = 30
osd heartbeat grace = 60

On 09/08/2018 11:24 PM, David Turner wrote:

What osd/mon/etc config settings do you have that are not default? It
might be worth utilizing nodown to stop osds from marking each other down
and finish the upgrade to be able to set the minimum osd version to mimic.
Stop the osds in a node, manually mark them down, start them back up in
mimic. Depending on how bad things are, setting pause on the cluster to
just finish the upgrade faster might not be a bad idea either.

This should be a simple question, have you confirmed that there are no
networking problems between the MONs while the elections are happening?

On Sat, Sep 8, 2018, 7:52 PM Kevin Hrpcek
<kevin.hrpcek@xxxxxxxxxxxxx<mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx>
<mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx>>
wrote:

     Hey Sage,

     I've posted the file with my email address for the user. It is
     with debug_mon 20/20, debug_paxos 20/20, and debug ms 1/5. The
     mons are calling for elections about every minute so I let this
     run for a few elections and saw this node become the leader a
     couple times. Debug logs start around 23:27:30. I had managed to
     get about 850/857 osds up, but it seems that within the last 30
     min it has all gone bad again due to the OSDs reporting each
     other as failed. We relaxed the osd_heartbeat_interval to 30 and
     osd_heartbeat_grace to 60 in an attempt to slow down how quickly
     OSDs are trying to fail each other. I'll put in the
     rocksdb_cache_size setting.

     Thanks for taking a look.

     Kevin

     On 09/08/2018 06:04 PM, Sage Weil wrote:

     Hi Kevin,

     I can't think of any major luminous->mimic changes off the top of my
head
     that would impact CPU usage, but it's always possible there is
something
     subtle.  Can you ceph-post-file a the full log from one of your mons
     (preferbably the leader)?

     You might try adjusting the rocksdb cache size.. try setting

       rocksdb_cache_size = 1342177280   # 10x the default, ~1.3 GB

     on the mons and restarting?

     Thanks!
     sage

     On Sat, 8 Sep 2018, Kevin Hrpcek wrote:

     Hello,

     I've had a Luminous -> Mimic upgrade go very poorly and my cluster
is stuck
     with almost all pgs down. One problem is that the mons have
started to
     re-elect a new quorum leader almost every minute. This is making
it difficult
     to monitor the cluster and even run any commands on it since at
least half the
     time a ceph command times out or takes over a minute to return
results. I've
     looked at the debug logs and it appears there is some timeout
occurring with
     paxos of about a minute. The msg_dispatch thread of the mons is
often running
     a core at 100% for about a minute(user time, no iowait). Running
strace on it
     shows the process is going through all of the mon db files (about
6gb in
     store.db/*.sst). Does anyone have an idea of what this timeout is
or why my
     mons are always reelecting? One theory I have is that the
msg_dispatch can't
     process the SST's fast enough and hits some timeout for a health
check and the
     mon drops itself from the quorum since it thinks it isn't healthy.
I've been
     thinking of introducing a new mon to the cluster on hardware with
a better cpu
     to see if that can process the SSTs within this timeout.

     My cluster has the mons,mds,mgr and 30/41 osd servers on mimic,
and 11/41 osd
     servers on luminous. The original problem started when I restarted
the osds on
     one of the hosts. The cluster reacted poorly to them going down
and went into
     a frenzy of taking down other osds and remapping. I eventually got
that stable
     and the PGs were 90% good with the finish line in sight and then
the mons
     started their issue of releecting every minute. Now I can't keep
any decent
     amount of PGs up for more than a few hours. This started on
Wednesday.....

     Any help would be greatly appreciated.

     Thanks,
     Kevin

     --Debug snippet from a mon at reelection time
     2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).mds
e14242
     maybe_resize_cluster in 1 max 1
     2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mds
e14242
     tick: resetting beacon timeouts due to mon delay (slow election?)
of 59.8106s
     seconds
     2018-09-07 20:08:08.655 7f57b92cd700 10
     mon.sephmon2@1(leader).paxosservice(mdsmap 13504..14242)
maybe_trim trim_to
     13742 would only trim 238 < paxos_service_trim_min 250
     2018-09-07 20:08:08.655 7f57b92cd700 10
mon.sephmon2@1(leader).auth v120657
     auth
     2018-09-07 20:08:08.655 7f57b92cd700 10
mon.sephmon2@1(leader).auth v120657
     check_rotate updated rotating
     2018-09-07 20:08:08.655 7f57b92cd700 10
     mon.sephmon2@1(leader).paxosservice(auth 120594..120657)
propose_pending
     2018-09-07 20:08:08.655 7f57b92cd700 10
mon.sephmon2@1(leader).auth v120657
     encode_pending v 120658
     2018-09-07 20:08:08.655 7f57b92cd700  5
mon.sephmon2@1(leader).paxos(paxos
     updating c 132917556..132918214) queue_pending_finisher
0x55dce8e5b370
     2018-09-07 20:08:08.655 7f57b92cd700 10
mon.sephmon2@1(leader).paxos(paxos
     updating c 132917556..132918214) trigger_propose not active, will
propose
     later
     2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mgr
e2234 tick:
     resetting beacon timeouts due to mon delay (slow election?) of
59.8844s
     seconds
     2018-09-07 20:08:08.655 7f57b92cd700 10
     mon.sephmon2@1(leader).paxosservice(mgr 1513..2234) maybe_trim
trim_to 1734
     would only trim 221 < paxos_service_trim_min 250
     2018-09-07 20:08:08.655 7f57b92cd700 10
mon.sephmon2@1(leader).health tick
     2018-09-07 20:08:08.655 7f57b92cd700 20
mon.sephmon2@1(leader).health
     check_member_health
     2018-09-07 20:08:08.657 7f57bcdd0700  1 --10.1.9.202:6789/0
<http://10.1.9.202:6789/0><http://10.1.9.202:6789/0><http://10.1.9.202:6789/0><http://10.1.9.202:6789/0><http://10.1.9.202:6789/0><http://10.1.9.202:6789/0><http://10.1.9.202:6789/0><http://10.1.9.202:6789/0>

-

     conn(0x55dcee55be00 :6789 s=STATE_ACCEPTING pgs=0 cs=0
     l=0)._process_connection sd=447 -
     2018-09-07 20:08:08.657 7f57bcdd0700 10 mon.sephmon2@1(leader) e17
     ms_verify_authorizer10.1.9.32:6823/4007
<http://10.1.9.32:6823/4007><http://10.1.9.32:6823/4007><http://10.1.9.32:6823/4007><http://10.1.9.32:6823/4007><http://10.1.9.32:6823/4007><http://10.1.9.32:6823/4007><http://10.1.9.32:6823/4007><http://10.1.9.32:6823/4007>
osd protocol 0
     2018-09-07 20:08:08.662 7f57b92cd700 10
mon.sephmon2@1(leader).health
     check_member_health avail 79% total 40 GiB, used 8.4 GiB, avail 32
GiB
     2018-09-07 20:08:08.662 7f57b92cd700 20
mon.sephmon2@1(leader).health
     check_leader_health
     2018-09-07 20:08:08.662 7f57b92cd700 10
     mon.sephmon2@1(leader).paxosservice(health 1534..1720) maybe_trim
trim_to 1715
     would only trim 181 < paxos_service_trim_min 250
     2018-09-07 20:08:08.662 7f57b92cd700 10
mon.sephmon2@1(leader).config tick
     2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader) e17
     sync_trim_providers
     2018-09-07 20:08:08.662 7f57b92cd700 -1 mon.sephmon2@1(leader) e17
     get_health_metrics reporting 1940 slow ops, oldest is
osd_failure(failed
     timeout osd.7210.1.9.9:6800/68904
<http://10.1.9.9:6800/68904><http://10.1.9.9:6800/68904><http://10.1.9.9:6800/68904><http://10.1.9.9:6800/68904><http://10.1.9.9:6800/68904><http://10.1.9.9:6800/68904><http://10.1.9.9:6800/68904><http://10.1.9.9:6800/68904>
for 317sec e987498 v987498)
     2018-09-07 20:08:08.662 7f57b92cd700  1
mon.sephmon2@1(leader).paxos(paxos
     updating c 132917556..132918214) accept timeout, calling fresh
election
     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
bootstrap
     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
     sync_reset_requester
     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
     unregister_cluster_logger
     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
     cancel_probe_timeout (none scheduled)
     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
e17 _reset
     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
e17
     cancel_probe_timeout (none scheduled)
     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
e17
     timecheck_finish
     2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing)
e17
     health_tick_stop
     2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing)
e17
     health_interval_stop
     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
e17
     scrub_event_cancel
     2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing)
e17
     scrub_reset
     2018-09-07 20:08:08.662 7f57b92cd700 10
mon.sephmon2@1(probing).paxos(paxos
     updating c 132917556..132918214) restart -- canceling timeouts
     2018-09-07 20:08:08.662 7f57b92cd700 10
     mon.sephmon2@1(probing).paxosservice(mdsmap 13504..14242) restart
     2018-09-07 20:08:08.662 7f57b92cd700 10
     mon.sephmon2@1(probing).paxosservice(osdmap 917471..987503)
restart

     _______________________________________________
     ceph-users mailing list
     ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx>
<mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx>
     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com