Hi Kevin, Do you have an update on the state of the cluster? I've opened a ticket http://tracker.ceph.com/issues/36163 to track the likely root cause we identified, and have a PR open at https://github.com/ceph/ceph/pull/24247 Thanks! sage On Thu, 20 Sep 2018, Sage Weil wrote: > On Thu, 20 Sep 2018, KEVIN MICHAEL HRPCEK wrote: > > Top results when both were taken with ms_dispatch at 100%. The mon one > > changes alot so I've included 3 snapshots of those. I'll update > > mon_osd_cache_size. > > > > After disabling auth_cluster_required and a cluster reboot I am having > > less problems keeping OSDs in the cluster since they seem to not be > > having auth problems around the 2 hour uptime mark. The mons still have > > their problems but 859/861 OSDs are up with 2 crashing. I found a brief > > mention on a forum or somewhere that the mons will only trim their > > storedb when the cluster is healthy. If that's true do you think it is > > likely that once all osds are healthy and unset some no* cluster flags > > the mons will be able to trim their db and the result will be that > > ms_dispatch no longer takes to long to churn through the db? Our primary > > theory here is that ms_dispatch is taking too long and the mons reach a > > timeout and then reelect in a nonstop cycle. > > It's the PGs that need to all get healthy (active+clean) before the > osdmaps get trimmed. Other health warnigns (e.g. about noout being set) > aren't related. > > > ceph-mon > > 34.24% 34.24% libpthread-2.17.so [.] pthread_rwlock_rdlock > > + 34.00% 34.00% libceph-common.so.0 [.] crush_hash32_3 > > If this is the -g output you need to hit enter on lines like this to see > the call graph... Or you can do 'perf record -g -p <pid>' and then 'perf > report --stdio' (or similar) to dump it all to a file, fully expanded. > > Thanks! > sage > > > + 5.01% 5.01% libceph-common.so.0 [.] ceph::decode<pg_t, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t) > > + 4.66% 0.00% libceph-common.so.0 [.] ceph::buffer::raw_combined::~raw_combined > > + 2.37% 0.00% [unknown] [.] 0000000000000000 > > + 2.21% 2.21% libceph-common.so.0 [.] ceph::buffer::ptr::release > > + 2.14% 2.14% libceph-common.so.0 [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<p > > + 1.88% 1.08% libc-2.17.so [.] __memcpy_ssse3_back > > + 1.70% 1.70% ceph-mon [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<p > > + 1.65% 1.65% libceph-common.so.0 [.] ceph::buffer::list::append > > + 1.54% 1.54% libstdc++.so.6.0.19 [.] std::_Rb_tree_increment > > 1.44% 1.44% libstdc++.so.6.0.19 [.] std::_Rb_tree_insert_and_rebalance > > + 1.28% 1.28% libceph-common.so.0 [.] ceph::buffer::ptr::append > > + 1.23% 1.23% libceph-common.so.0 [.] ceph::buffer::list::append > > + 0.98% 0.98% libceph-common.so.0 [.] ceph::buffer::ptr::ptr > > + 0.79% 0.79% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl<false>::copy > > + 0.79% 0.79% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl<false>::advance > > + 0.66% 0.66% libceph-common.so.0 [.] ceph::buffer::ptr::unused_tail_length > > + 0.64% 0.64% libceph-common.so.0 [.] ceph::encode<pg_t, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t) > > + 0.60% 0.60% libceph-common.so.0 [.] ceph::buffer::ptr::operator= > > 0.53% 0.53% libtcmalloc.so.4.4.5 [.] tcmalloc::CentralFreeList::FetchFromOneSpans > > 0.52% 0.52% libceph-common.so.0 [.] ceph::buffer::ptr::copy_out > > + 0.51% 0.51% libtcmalloc.so.4.4.5 [.] operator new[] > > + 0.49% 0.49% libceph-common.so.0 [.] ceph::decode<std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, denc_traits<std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, void> > > > 0.49% 0.49% ceph-mon [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<p > > + 0.29% 0.29% ceph-mon [.] ceph::buffer::list::contiguous_appender::~contiguous_appender > > 0.27% 0.27% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl<false>::advance@plt > > + 0.27% 0.27% [kernel] [k] retint_userspace_restore_args > > > > ceph-mon > > Children Self Shared Object Symbol > > + 12.36% 12.36% libceph-common.so.0 [.] ceph::buffer::list::append > > + 11.58% 11.58% libceph-common.so.0 [.] ceph::decode<pg_t, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t) > > + 8.76% 0.00% libceph-common.so.0 [.] ceph::buffer::raw_combined::~raw_combined > > + 7.10% 7.10% libceph-common.so.0 [.] ceph::buffer::list::append > > + 6.51% 6.51% libceph-common.so.0 [.] ceph::buffer::ptr::ptr > > + 6.29% 6.29% libceph-common.so.0 [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<p > > + 5.73% 5.73% ceph-mon [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<p > > + 5.06% 5.06% libceph-common.so.0 [.] ceph::encode<pg_t, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t) > > + 4.35% 4.35% libceph-common.so.0 [.] ceph::buffer::ptr::release > > + 3.10% 0.00% libc-2.17.so [.] __memcpy_ssse3_back > > 2.94% 2.94% libstdc++.so.6.0.19 [.] std::_Rb_tree_insert_and_rebalance > > + 2.80% 0.00% [unknown] [.] 0000000000000000 > > + 2.72% 2.72% libceph-common.so.0 [.] ceph::decode<std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, denc_traits<std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, void> > > > + 2.35% 2.35% libstdc++.so.6.0.19 [.] std::_Rb_tree_increment > > 2.22% 2.22% libceph-common.so.0 [.] ceph::buffer::ptr::copy_out > > 2.09% 2.09% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl<false>::advance@plt > > + 2.01% 2.01% [kernel] [k] retint_userspace_restore_args > > + 1.95% 1.95% ceph-mon [.] ceph::buffer::list::contiguous_appender::~contiguous_appender > > + 1.89% 1.89% libceph-common.so.0 [.] crush_hash32_3 > > 1.83% 1.83% libceph-common.so.0 [.] crc32_iscsi_00 > > + 1.60% 1.60% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl<false>::copy > > + 1.57% 1.57% libtcmalloc.so.4.4.5 [.] operator new[] > > + 1.40% 1.40% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl<false>::advance > > + 1.32% 1.32% libceph-common.so.0 [.] ceph::buffer::ptr::end_c_str > > + 1.23% 1.23% libceph-common.so.0 [.] ceph::buffer::ptr::append > > > > ceph-mon > > Children Self Shared Object Symbol > > + 32.46% 0.00% libceph-common.so.0 [.] OSDMap::_pg_to_up_acting_osds > > + 32.46% 32.46% libceph-common.so.0 [.] crush_init_workspace > > + 28.13% 28.13% libceph-common.so.0 [.] crush_hash32_3 > > + 18.91% 0.00% [kernel] [k] system_call_fastpath > > + 16.23% 0.00% [kernel] [k] unmap_region > > + 16.23% 16.23% [kernel] [k] unmap_page_range > > + 10.55% 10.55% [kernel] [k] system_call_after_swapgs > > + 7.25% 0.00% [unknown] [.] 0000000000000000 > > + 7.25% 7.25% ceph-mon [.] OSDMap::deepish_copy_from > > + 1.74% 1.74% libceph-common.so.0 [.] AsyncConnection::read_bulk > > + 1.45% 1.45% [kernel] [k] ___sys_sendmsg > > + 1.23% 1.23% [kernel] [k] native_sched_clock > > + 1.23% 0.00% [kernel] [k] sched_clock > > + 1.23% 0.00% [kernel] [k] __schedule > > + 1.23% 0.00% [kernel] [k] __perf_event_task_sched_out > > + 1.23% 0.00% [kernel] [k] ctx_sched_out > > + 0.95% 0.95% libceph-common.so.0 [.] ceph::buffer::list::append > > > > > > > > ms_dispatch thread > > Children Self Shared Object Symbol > > + 15.07% 15.07% libceph-common.so.0 [.] ceph::decode<pg_t, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t) > > + 10.24% 0.00% libceph-common.so.0 [.] ceph::buffer::raw_combined::~raw_combined > > + 7.27% 7.27% ceph-mon [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<p > > + 6.19% 6.19% libceph-common.so.0 [.] ceph::buffer::ptr::release > > + 4.84% 4.84% libstdc++.so.6.0.19 [.] std::_Rb_tree_increment > > + 4.59% 4.59% libceph-common.so.0 [.] ceph::buffer::list::append > > + 4.20% 4.20% ceph-mon [.] std::_Rb_tree<std::string, std::string, std::_Identity<std::string>, std::less<std::string>, std::allocator<std::string> >::find > > + 4.04% 4.04% libceph-common.so.0 [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<p > > 3.94% 3.94% libtcmalloc.so.4.4.5 [.] tcmalloc::CentralFreeList::FetchFromOneSpans > > + 3.71% 3.71% libceph-common.so.0 [.] ceph::buffer::ptr::ptr > > + 3.43% 3.43% libceph-common.so.0 [.] ceph::buffer::list::append > > + 3.03% 1.18% libc-2.17.so [.] __memcpy_ssse3_back > > + 2.66% 0.00% [kernel] [k] do_futex > > + 2.66% 0.00% [kernel] [k] futex_wait_queue_me > > + 2.66% 1.15% [kernel] [k] timerqueue_add > > + 2.49% 2.49% libceph-common.so.0 [.] ceph::decode<std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, denc_traits<std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, void> > > > + 2.02% 2.02% libceph-common.so.0 [.] ceph::buffer::ptr::unused_tail_length > > + 1.99% 1.99% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl<false>::copy > > + 1.67% 0.00% [unknown] [k] 0000000000000000 > > 1.64% 1.64% libstdc++.so.6.0.19 [.] std::_Rb_tree_insert_and_rebalance > > + 1.57% 1.57% libtcmalloc.so.4.4.5 [.] operator new[] > > + 1.56% 1.56% libceph-common.so.0 [.] ceph::buffer::ptr::copy_out > > + 1.55% 1.55% libceph-common.so.0 [.] ceph::buffer::ptr::append > > 1.53% 1.53% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl<false>::copy@plt > > + 1.51% 1.51% [kernel] [k] rb_insert_color > > + 1.36% 1.36% libceph-common.so.0 [.] ceph::buffer::list::iterator_impl<false>::advance > > + 1.27% 1.27% libceph-common.so.0 [.] ceph::encode<pg_t, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> >, std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t) > > 1.13% 1.13% ceph-mon [.] std::_Rb_tree<pg_t, std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, std::less<p > > + 1.10% 1.10% ceph-mon [.] entity_addr_t::encode > > 1.08% 1.08% libceph-common.so.0 [.] ceph::buffer::ptr::unused_tail_length@plt > > 1.04% 1.04% libceph-common.so.0 [.] ceph::buffer::list::contiguous_appender::~contiguous_appender@plt > > 1.03% 1.03% libtcmalloc.so.4.4.5 [.] tcmalloc::ThreadCache::ReleaseToCentralCache > > + 1.03% 1.03% libtcmalloc.so.4.4.5 [.] tcmalloc::CentralFreeList::ReleaseToSpans > > 0.92% 0.92% libceph-common.so.0 [.] ceph::buffer::list::iterator::advance > > > > > > > > > > On 09/20/2018 08:21 AM, Sage Weil wrote: > > > > Can you try 'perf top -g -p <pid>' to see where all of the encoding > > activity is coming from? I see two possibilities (the mon attempts to > > cache encoded maps, and the MOSDMap message itself will also reencode > > if/when that fails). > > > > Also: mon_osd_cache_size = 10 by default... try making that 500 or > > something. > > > > sage > > > > > > > > On Wed, 19 Sep 2018, Kevin Hrpcek wrote: > > > > > > > > Majority of the clients are luminous with a few kraken stragglers. I looked at > > ceph features and 'ceph daemon mon.sephmon1 sessions'. Nothing is reporting as > > having mimic features, all mon,mgr,osd are running 13.2.1 but are reporting > > luminous features, and majority of the luminous clients are reporting jewel > > features. I shut down my compute cluster to get rid of majority of the clients > > that are reporting jewel features, and there is still a lot of time spent by > > ms_dispatch in ceph::decode<pg_t....... > > > > My small mimic test cluster actually shows similar in it's features, > > mon,mgr,mds,osd all report luminous features yet have 13.2.1 installed, so > > maybe that is normal. > > > > Kevin > > > > On 09/19/2018 09:35 AM, Sage Weil wrote: > > > > > > It's hard to tell exactly from the below, but it looks to me like there is > > still a lot of OSDMap reencoding going on. Take a look at 'ceph features' > > output and see who in the cluster is using pre-luminous features.. I'm > > guessing all of the clients? For any of those sessions, fetching OSDMaps > > from the cluster will require reencoding. > > > > If it's all clients (well, non-OSDs), I think we could work around it by > > avoiding the reencode entirely (it is only really there for OSDs, which > > want a perfect OSDMap copy that will match the monitor's CRC). > > > > sage > > > > On Wed, 19 Sep 2018, KEVIN MICHAEL HRPCEK wrote: > > > > > > > > I set mon lease = 30 yesterday and it had no effect on the quorum > > election. To give you an idea of how much cpu ms_dispatch is using, from > > the last mon restart about 7.5 hours ago, the ms_dispatch thread has 5h > > 40m of cpu time. Below are 2 snippets from perf top. I took them while > > ms_dispatch was 100% of a core, the first is using the pid of the > > ceph-mon, the second is the pid of the ms_dispatch thread. The last thing > > is a snippet from stracing the ms_dispatch pid. It is running through all > > of the sst files. > > > > perf top ceph-mon > > Overhead Shared Object Symbol > > 17.71% libceph-common.so.0 [.] ceph::decode<pg_t, std::vector<int, > > mempool::pool_allocator<(mempool::pool_index_t)15, int> >, > > std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, > > std::pair<pg_t const, std::vector<int, > > mempool::pool_allocator<(mempool::pool_index_t)15, int> > > > > 6.67% libceph-common.so.0 [.] ceph::buffer::ptr::release > > 5.35% libceph-common.so.0 [.] std::_Rb_tree<pg_t, std::pair<pg_t > > const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, > > int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, > > mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, > > std::less<pg_t>, mempoo > > 5.20% libceph-common.so.0 [.] ceph::buffer::ptr::append > > 5.12% libceph-common.so.0 [.] > > ceph::buffer::list::iterator_impl<false>::copy > > 4.66% libceph-common.so.0 [.] ceph::buffer::list::append > > 4.33% libstdc++.so.6.0.19 [.] std::_Rb_tree_increment > > 4.27% ceph-mon [.] std::_Rb_tree<pg_t, std::pair<pg_t > > const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, > > int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, > > mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, > > std::less<pg_t>, mempoo > > 4.18% libceph-common.so.0 [.] ceph::buffer::list::append > > 3.10% libceph-common.so.0 [.] ceph::decode<std::vector<int, > > mempool::pool_allocator<(mempool::pool_index_t)15, int> >, > > denc_traits<std::vector<int, > > mempool::pool_allocator<(mempool::pool_index_t)15, int> >, void> > > > 2.90% libceph-common.so.0 [.] ceph::encode<pg_t, std::vector<int, > > mempool::pool_allocator<(mempool::pool_index_t)15, int> >, > > std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, > > std::pair<pg_t const, std::vector<int, > > mempool::pool_allocator<(mempool::pool_index_t)15, int> > > > > 2.56% libceph-common.so.0 [.] ceph::buffer::ptr::ptr > > 2.50% libstdc++.so.6.0.19 [.] std::_Rb_tree_insert_and_rebalance > > 2.39% libceph-common.so.0 [.] ceph::buffer::ptr::copy_out > > 2.33% libceph-common.so.0 [.] > > ceph::buffer::list::iterator_impl<false>::advance > > 2.21% libtcmalloc.so.4.4.5 [.] > > tcmalloc::CentralFreeList::FetchFromOneSpans > > 1.97% libtcmalloc.so.4.4.5 [.] > > tcmalloc::CentralFreeList::ReleaseToSpans > > 1.60% libceph-common.so.0 [.] crc32_iscsi_00 > > 1.42% libtcmalloc.so.4.4.5 [.] operator new[] > > 1.29% libceph-common.so.0 [.] ceph::buffer::ptr::unused_tail_length > > 1.28% libceph-common.so.0 [.] > > ceph::buffer::list::iterator_impl<false>::copy_shallow > > 1.25% libceph-common.so.0 [.] ceph::buffer::ptr::raw_length@plt > > 1.06% libceph-common.so.0 [.] ceph::buffer::ptr::end_c_str > > 1.06% libceph-common.so.0 [.] ceph::buffer::list::iterator::copy > > 0.99% libc-2.17.so [.] __memcpy_ssse3_back > > 0.94% libc-2.17.so [.] _IO_default_xsputn > > 0.89% libceph-common.so.0 [.] > > ceph::buffer::list::iterator_impl<false>::advance@plt > > 0.87% libtcmalloc.so.4.4.5 [.] > > tcmalloc::ThreadCache::ReleaseToCentralCache > > 0.76% libleveldb.so.1.0.7 [.] leveldb::FindFile > > 0.72% [vdso] [.] __vdso_clock_gettime > > 0.67% ceph-mon [.] std::_Rb_tree<pg_t, std::pair<pg_t > > const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, > > int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, > > mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, > > std::less<pg_t>, mempoo > > 0.63% libtcmalloc.so.4.4.5 [.] tc_deletearray_nothrow > > 0.59% libceph-common.so.0 [.] ceph::buffer::list::iterator::advance > > 0.52% libceph-common.so.0 [.] > > ceph::buffer::list::iterator::get_current_ptr > > > > > > perf top ms_dispatch > > 11.88% libceph-common.so.0 [.] ceph::decode<pg_t, std::vector<int, > > mempool::pool_allocator<(mempool::pool_index_t)15, int> >, > > std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, > > std::pair<pg_t const, std::vector<int, > > mempool::pool_allocator<(mempool::pool_index_t)15, int> > > > > 11.23% [kernel] [k] system_call_after_swapgs > > 9.36% libceph-common.so.0 [.] crush_hash32_3 > > 6.55% libceph-common.so.0 [.] crush_choose_indep > > 4.39% [kernel] [k] smp_call_function_many > > 3.17% libceph-common.so.0 [.] ceph::buffer::list::append > > 3.03% libceph-common.so.0 [.] ceph::buffer::list::append > > 3.02% libceph-common.so.0 [.] std::_Rb_tree<pg_t, std::pair<pg_t > > const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, > > int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, > > mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, > > std::less<pg_t>, mempoo > > 2.92% libceph-common.so.0 [.] ceph::buffer::ptr::release > > 2.65% libceph-common.so.0 [.] > > ceph::buffer::list::iterator_impl<false>::advance > > 2.57% ceph-mon [.] std::_Rb_tree<pg_t, std::pair<pg_t > > const, std::vector<int, mempool::pool_allocator<(mempool::pool_index_t)15, > > int> > >, std::_Select1st<std::pair<pg_t const, std::vector<int, > > mempool::pool_allocator<(mempool::pool_index_t)15, int> > > >, > > std::less<pg_t>, mempoo > > 2.27% libceph-common.so.0 [.] ceph::buffer::ptr::ptr > > 1.99% libstdc++.so.6.0.19 [.] std::_Rb_tree_increment > > 1.93% libc-2.17.so [.] __memcpy_ssse3_back > > 1.91% libceph-common.so.0 [.] ceph::buffer::ptr::append > > 1.87% libceph-common.so.0 [.] crush_hash32_3@plt > > 1.84% libceph-common.so.0 [.] > > ceph::buffer::list::iterator_impl<false>::copy > > 1.75% libtcmalloc.so.4.4.5 [.] > > tcmalloc::CentralFreeList::FetchFromOneSpans > > 1.63% libceph-common.so.0 [.] ceph::encode<pg_t, std::vector<int, > > mempool::pool_allocator<(mempool::pool_index_t)15, int> >, > > std::less<pg_t>, mempool::pool_allocator<(mempool::pool_index_t)15, > > std::pair<pg_t const, std::vector<int, > > mempool::pool_allocator<(mempool::pool_index_t)15, int> > > > > 1.57% libceph-common.so.0 [.] ceph::buffer::ptr::copy_out > > 1.55% libstdc++.so.6.0.19 [.] std::_Rb_tree_insert_and_rebalance > > 1.47% libceph-common.so.0 [.] ceph::buffer::ptr::raw_length > > 1.33% libtcmalloc.so.4.4.5 [.] tc_deletearray_nothrow > > 1.09% libceph-common.so.0 [.] ceph::decode<std::vector<int, > > mempool::pool_allocator<(mempool::pool_index_t)15, int> >, > > denc_traits<std::vector<int, > > mempool::pool_allocator<(mempool::pool_index_t)15, int> >, void> > > > 1.07% libtcmalloc.so.4.4.5 [.] operator new[] > > 1.02% libceph-common.so.0 [.] ceph::buffer::list::iterator::copy > > 1.01% libtcmalloc.so.4.4.5 [.] tc_posix_memalign > > 0.85% ceph-mon [.] ceph::buffer::ptr::release@plt > > 0.76% libceph-common.so.0 [.] ceph::buffer::ptr::copy_out@plt > > 0.74% libceph-common.so.0 [.] crc32_iscsi_00 > > > > strace > > munmap(0x7f2eda736000, 2463941) = 0 > > open("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299339.sst", O_RDONLY) = > > 429 > > stat("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299339.sst", > > {st_mode=S_IFREG|0644, st_size=1658656, ...}) = 0 > > mmap(NULL, 1658656, PROT_READ, MAP_SHARED, 429, 0) = 0x7f2eea87e000 > > close(429) = 0 > > munmap(0x7f2ea8c97000, 2468005) = 0 > > open("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299338.sst", O_RDONLY) = > > 429 > > stat("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299338.sst", > > {st_mode=S_IFREG|0644, st_size=2484001, ...}) = 0 > > mmap(NULL, 2484001, PROT_READ, MAP_SHARED, 429, 0) = 0x7f2eda74b000 > > close(429) = 0 > > munmap(0x7f2ee21dc000, 2472343) = 0 > > > > Kevin > > > > > > On 09/19/2018 06:50 AM, Sage Weil wrote: > > > > On Wed, 19 Sep 2018, KEVIN MICHAEL HRPCEK wrote: > > > > > > Sage, > > > > Unfortunately the mon election problem came back yesterday and it makes > > it really hard to get a cluster to stay healthy. A brief unexpected > > network outage occurred and sent the cluster into a frenzy and when I > > had it 95% healthy the mons started their nonstop reelections. In the > > previous logs I sent were you able to identify why the mons are > > constantly electing? The elections seem to be triggered by the below > > paxos message but do you know which lease timeout is being reached or > > why the lease isn't renewed instead of calling for an election? > > > > One thing I tried was to shutdown the entire cluster and bring up only > > the mon and mgr. The mons weren't able to hold their quorum with no osds > > running and the ceph-mon ms_dispatch thread runs at 100% for > 60s at a > > time. > > > > > > > > This is odd... with no other dameons running I'm not sure what would be > > eating up the CPU. Can you run a 'perf top -p `pidof ceph-mon`' (or > > similar) on the machine to see what the process is doing? You might need > > to install ceph-mon-dbg or ceph-debuginfo to get better symbols. > > > > > > > > 2018-09-19 03:56:21.729 7f4344ec1700 1 mon.sephmon2@1(peon).paxos(paxos > > active c 133382665..133383355) lease_timeout -- calling new election > > > > > > > > A workaround is probably to increase the lease timeout. Try setting > > mon_lease = 15 (default is 5... could also go higher than 15) in the > > ceph.conf for all of the mons. This is a bit of a band-aid but should > > help you keep the mons in quorum until we sort out what is going on. > > > > sage > > > > > > > > > > > > Thanks > > Kevin > > > > On 09/10/2018 07:06 AM, Sage Weil wrote: > > > > I took a look at the mon log you sent. A few things I noticed: > > > > - The frequent mon elections seem to get only 2/3 mons about half of the > > time. > > - The messages coming in a mostly osd_failure, and half of those seem to > > be recoveries (cancellation of the failure message). > > > > It does smell a bit like a networking issue, or some tunable that relates > > to the messaging layer. It might be worth looking at an OSD log for an > > osd that reported a failure and seeing what error code it coming up on the > > failed ping connection? That might provide a useful hint (e.g., > > ECONNREFUSED vs EMFILE or something). > > > > I'd also confirm that with nodown set the mon quorum stabilizes... > > > > sage > > > > > > > > > > On Mon, 10 Sep 2018, Kevin Hrpcek wrote: > > > > > > > > Update for the list archive. > > > > I went ahead and finished the mimic upgrade with the osds in a fluctuating > > state of up and down. The cluster did start to normalize a lot easier > > after > > everything was on mimic since the random mass OSD heartbeat failures > > stopped > > and the constant mon election problem went away. I'm still battling with > > the > > cluster reacting poorly to host reboots or small map changes, but I feel > > like > > my current pg:osd ratio may be playing a factor in that since we are 2x > > normal > > pg count while migrating data to new EC pools. > > > > I'm not sure of the root cause but it seems like the mix of luminous and > > mimic > > did not play well together for some reason. Maybe it has to do with the > > scale > > of my cluster, 871 osd, or maybe I've missed some some tuning as my > > cluster > > has scaled to this size. > > > > Kevin > > > > > > On 09/09/2018 12:49 PM, Kevin Hrpcek wrote: > > > > > > Nothing too crazy for non default settings. Some of those osd settings > > were > > in place while I was testing recovery speeds and need to be brought back > > closer to defaults. I was setting nodown before but it seems to mask the > > problem. While its good to stop the osdmap changes, OSDs would come up, > > get > > marked up, but at some point go down again (but the process is still > > running) and still stay up in the map. Then when I'd unset nodown the > > cluster would immediately mark 250+ osd down again and i'd be back where I > > started. > > > > This morning I went ahead and finished the osd upgrades to mimic to remove > > that variable. I've looked for networking problems but haven't found any. > > 2 > > of the mons are on the same switch. I've also tried combinations of > > shutting > > down a mon to see if a single one was the problem, but they keep electing > > no > > matter the mix of them that are up. Part of it feels like a networking > > problem but I haven't been able to find a culprit yet as everything was > > working normally before starting the upgrade. Other than the constant mon > > elections, yesterday I had the cluster 95% healthy 3 or 4 times, but it > > doesn't last long since at some point the OSDs start trying to fail each > > other through their heartbeats. > > 2018-09-09 17:37:29.079 7eff774f5700 1 mon.sephmon1@0(leader).osd e991282 > > prepare_failure osd.39 10.1.9.2:6802/168438 from osd.49 > > 10.1.9.3:6884/317908 > > is reporting failure:1 > > 2018-09-09 17:37:29.079 7eff774f5700 0 log_channel(cluster) log [DBG] : > > osd.39 10.1.9.2:6802/168438 reported failed by osd.49 10.1.9.3:6884/317908 > > 2018-09-09 17:37:29.083 7eff774f5700 1 mon.sephmon1@0(leader).osd e991282 > > prepare_failure osd.93 10.1.9.9:6853/287469 from osd.372 > > 10.1.9.13:6801/275806 is reporting failure:1 > > > > I'm working on getting things mostly good again with everything on mimic > > and > > will see if it behaves better. > > > > Thanks for your input on this David. > > > > > > [global] > > mon_initial_members = sephmon1, sephmon2, sephmon3 > > mon_host = 10.1.9.201,10.1.9.202,10.1.9.203 > > auth_cluster_required = cephx > > auth_service_required = cephx > > auth_client_required = cephx > > filestore_xattr_use_omap = true > > public_network = 10.1.0.0/16 > > osd backfill full ratio = 0.92 > > osd failsafe nearfull ratio = 0.90 > > osd max object size = 21474836480 > > mon max pg per osd = 350 > > > > [mon] > > mon warn on legacy crush tunables = false > > mon pg warn max per osd = 300 > > mon osd down out subtree limit = host > > mon osd nearfull ratio = 0.90 > > mon osd full ratio = 0.97 > > mon health preluminous compat warning = false > > osd heartbeat grace = 60 > > rocksdb cache size = 1342177280 > > > > [mds] > > mds log max segments = 100 > > mds log max expiring = 40 > > mds bal fragment size max = 200000 > > mds cache memory limit = 4294967296 > > > > [osd] > > osd mkfs options xfs = -i size=2048 -d su=512k,sw=1 > > osd recovery delay start = 30 > > osd recovery max active = 5 > > osd max backfills = 3 > > osd recovery threads = 2 > > osd crush initial weight = 0 > > osd heartbeat interval = 30 > > osd heartbeat grace = 60 > > > > > > On 09/08/2018 11:24 PM, David Turner wrote: > > > > > > What osd/mon/etc config settings do you have that are not default? It > > might be worth utilizing nodown to stop osds from marking each other down > > and finish the upgrade to be able to set the minimum osd version to mimic. > > Stop the osds in a node, manually mark them down, start them back up in > > mimic. Depending on how bad things are, setting pause on the cluster to > > just finish the upgrade faster might not be a bad idea either. > > > > This should be a simple question, have you confirmed that there are no > > networking problems between the MONs while the elections are happening? > > > > On Sat, Sep 8, 2018, 7:52 PM Kevin Hrpcek > > <kevin.hrpcek@xxxxxxxxxxxxx<mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx> > > <mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx><mailto:kevin.hrpcek@xxxxxxxxxxxxx>> > > wrote: > > > > Hey Sage, > > > > I've posted the file with my email address for the user. It is > > with debug_mon 20/20, debug_paxos 20/20, and debug ms 1/5. The > > mons are calling for elections about every minute so I let this > > run for a few elections and saw this node become the leader a > > couple times. Debug logs start around 23:27:30. I had managed to > > get about 850/857 osds up, but it seems that within the last 30 > > min it has all gone bad again due to the OSDs reporting each > > other as failed. We relaxed the osd_heartbeat_interval to 30 and > > osd_heartbeat_grace to 60 in an attempt to slow down how quickly > > OSDs are trying to fail each other. I'll put in the > > rocksdb_cache_size setting. > > > > Thanks for taking a look. > > > > Kevin > > > > On 09/08/2018 06:04 PM, Sage Weil wrote: > > > > > > Hi Kevin, > > > > I can't think of any major luminous->mimic changes off the top of my > > head > > that would impact CPU usage, but it's always possible there is > > something > > subtle. Can you ceph-post-file a the full log from one of your mons > > (preferbably the leader)? > > > > You might try adjusting the rocksdb cache size.. try setting > > > > rocksdb_cache_size = 1342177280 # 10x the default, ~1.3 GB > > > > on the mons and restarting? > > > > Thanks! > > sage > > > > On Sat, 8 Sep 2018, Kevin Hrpcek wrote: > > > > > > > > Hello, > > > > I've had a Luminous -> Mimic upgrade go very poorly and my cluster > > is stuck > > with almost all pgs down. One problem is that the mons have > > started to > > re-elect a new quorum leader almost every minute. This is making > > it difficult > > to monitor the cluster and even run any commands on it since at > > least half the > > time a ceph command times out or takes over a minute to return > > results. I've > > looked at the debug logs and it appears there is some timeout > > occurring with > > paxos of about a minute. The msg_dispatch thread of the mons is > > often running > > a core at 100% for about a minute(user time, no iowait). Running > > strace on it > > shows the process is going through all of the mon db files (about > > 6gb in > > store.db/*.sst). Does anyone have an idea of what this timeout is > > or why my > > mons are always reelecting? One theory I have is that the > > msg_dispatch can't > > process the SST's fast enough and hits some timeout for a health > > check and the > > mon drops itself from the quorum since it thinks it isn't healthy. > > I've been > > thinking of introducing a new mon to the cluster on hardware with > > a better cpu > > to see if that can process the SSTs within this timeout. > > > > My cluster has the mons,mds,mgr and 30/41 osd servers on mimic, > > and 11/41 osd > > servers on luminous. The original problem started when I restarted > > the osds on > > one of the hosts. The cluster reacted poorly to them going down > > and went into > > a frenzy of taking down other osds and remapping. I eventually got > > that stable > > and the PGs were 90% good with the finish line in sight and then > > the mons > > started their issue of releecting every minute. Now I can't keep > > any decent > > amount of PGs up for more than a few hours. This started on > > Wednesday..... > > > > Any help would be greatly appreciated. > > > > Thanks, > > Kevin > > > > --Debug snippet from a mon at reelection time > > 2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).mds > > e14242 > > maybe_resize_cluster in 1 max 1 > > 2018-09-07 20:08:08.655 7f57b92cd700 4 mon.sephmon2@1(leader).mds > > e14242 > > tick: resetting beacon timeouts due to mon delay (slow election?) > > of 59.8106s > > seconds > > 2018-09-07 20:08:08.655 7f57b92cd700 10 > > mon.sephmon2@1(leader).paxosservice(mdsmap 13504..14242) > > maybe_trim trim_to > > 13742 would only trim 238 < paxos_service_trim_min 250 > > 2018-09-07 20:08:08.655 7f57b92cd700 10 > > mon.sephmon2@1(leader).auth v120657 > > auth > > 2018-09-07 20:08:08.655 7f57b92cd700 10 > > mon.sephmon2@1(leader).auth v120657 > > check_rotate updated rotating > > 2018-09-07 20:08:08.655 7f57b92cd700 10 > > mon.sephmon2@1(leader).paxosservice(auth 120594..120657) > > propose_pending > > 2018-09-07 20:08:08.655 7f57b92cd700 10 > > mon.sephmon2@1(leader).auth v120657 > > encode_pending v 120658 > > 2018-09-07 20:08:08.655 7f57b92cd700 5 > > mon.sephmon2@1(leader).paxos(paxos > > updating c 132917556..132918214) queue_pending_finisher > > 0x55dce8e5b370 > > 2018-09-07 20:08:08.655 7f57b92cd700 10 > > mon.sephmon2@1(leader).paxos(paxos > > updating c 132917556..132918214) trigger_propose not active, will > > propose > > later > > 2018-09-07 20:08:08.655 7f57b92cd700 4 mon.sephmon2@1(leader).mgr > > e2234 tick: > > resetting beacon timeouts due to mon delay (slow election?) of > > 59.8844s > > seconds > > 2018-09-07 20:08:08.655 7f57b92cd700 10 > > mon.sephmon2@1(leader).paxosservice(mgr 1513..2234) maybe_trim > > trim_to 1734 > > would only trim 221 < paxos_service_trim_min 250 > > 2018-09-07 20:08:08.655 7f57b92cd700 10 > > mon.sephmon2@1(leader).health tick > > 2018-09-07 20:08:08.655 7f57b92cd700 20 > > mon.sephmon2@1(leader).health > > check_member_health > > 2018-09-07 20:08:08.657 7f57bcdd0700 1 --10.1.9.202:6789/0 > > <http://10.1.9.202:6789/0><http://10.1.9.202:6789/0><http://10.1.9.202:6789/0><http://10.1.9.202:6789/0><http://10.1.9.202:6789/0><http://10.1.9.202:6789/0><http://10.1.9.202:6789/0><http://10.1.9.202:6789/0> > > > > > > - > > > > > > conn(0x55dcee55be00 :6789 s=STATE_ACCEPTING pgs=0 cs=0 > > l=0)._process_connection sd=447 - > > 2018-09-07 20:08:08.657 7f57bcdd0700 10 mon.sephmon2@1(leader) e17 > > ms_verify_authorizer10.1.9.32:6823/4007 > > <http://10.1.9.32:6823/4007><http://10.1.9.32:6823/4007><http://10.1.9.32:6823/4007><http://10.1.9.32:6823/4007><http://10.1.9.32:6823/4007><http://10.1.9.32:6823/4007><http://10.1.9.32:6823/4007><http://10.1.9.32:6823/4007> > > osd protocol 0 > > 2018-09-07 20:08:08.662 7f57b92cd700 10 > > mon.sephmon2@1(leader).health > > check_member_health avail 79% total 40 GiB, used 8.4 GiB, avail 32 > > GiB > > 2018-09-07 20:08:08.662 7f57b92cd700 20 > > mon.sephmon2@1(leader).health > > check_leader_health > > 2018-09-07 20:08:08.662 7f57b92cd700 10 > > mon.sephmon2@1(leader).paxosservice(health 1534..1720) maybe_trim > > trim_to 1715 > > would only trim 181 < paxos_service_trim_min 250 > > 2018-09-07 20:08:08.662 7f57b92cd700 10 > > mon.sephmon2@1(leader).config tick > > 2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader) e17 > > sync_trim_providers > > 2018-09-07 20:08:08.662 7f57b92cd700 -1 mon.sephmon2@1(leader) e17 > > get_health_metrics reporting 1940 slow ops, oldest is > > osd_failure(failed > > timeout osd.7210.1.9.9:6800/68904 > > <http://10.1.9.9:6800/68904><http://10.1.9.9:6800/68904><http://10.1.9.9:6800/68904><http://10.1.9.9:6800/68904><http://10.1.9.9:6800/68904><http://10.1.9.9:6800/68904><http://10.1.9.9:6800/68904><http://10.1.9.9:6800/68904> > > for 317sec e987498 v987498) > > 2018-09-07 20:08:08.662 7f57b92cd700 1 > > mon.sephmon2@1(leader).paxos(paxos > > updating c 132917556..132918214) accept timeout, calling fresh > > election > > 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 > > bootstrap > > 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 > > sync_reset_requester > > 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 > > unregister_cluster_logger > > 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 > > cancel_probe_timeout (none scheduled) > > 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) > > e17 _reset > > 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) > > e17 > > cancel_probe_timeout (none scheduled) > > 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) > > e17 > > timecheck_finish > > 2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing) > > e17 > > health_tick_stop > > 2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing) > > e17 > > health_interval_stop > > 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) > > e17 > > scrub_event_cancel > > 2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) > > e17 > > scrub_reset > > 2018-09-07 20:08:08.662 7f57b92cd700 10 > > mon.sephmon2@1(probing).paxos(paxos > > updating c 132917556..132918214) restart -- canceling timeouts > > 2018-09-07 20:08:08.662 7f57b92cd700 10 > > mon.sephmon2@1(probing).paxosservice(mdsmap 13504..14242) restart > > 2018-09-07 20:08:08.662 7f57b92cd700 10 > > mon.sephmon2@1(probing).paxosservice(osdmap 917471..987503) > > restart > > > > > > > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx> > > <mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx><mailto:ceph-users@xxxxxxxxxxxxxx> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com