Re: 14.2.2 - OSD Crash

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Manuel,

as Brad pointed out timeouts and suicides are rather consequences of some other issues with OSDs.

I recall at least two recent relevant tickets:

https://tracker.ceph.com/issues/36482

https://tracker.ceph.com/issues/40741 (see last comments)

Both had massive and slow reads from RocksDB which caused timeouts..

Visible symptom for both cases was  unexpectedly high read I/O from underlying disks (main and/or DB).

You can use iotop for inspection..,

These were worsened by having significant part of DB at spinners due to spillovers. So wondering what's your layout in this respect:

what drives back troublesome OSDs, is there any spillover to slow device, how massive it is?

Also could you please inspect your OSD logs for the presence of lines containing "slow operation observed" substring. And share them if any..


Hope this helps.

Thanks,

Igor



On 8/7/2019 2:16 AM, EDH - Manuel Rios Fernandez wrote:

Hi

 

We got a pair of OSD located in  node that crash randomly since 14.2.2

 

OS Version : Centos 7.6

 

There’re a ton of lines before crash , I will unespected:

 

--

3045> 2019-08-07 00:39:32.013 7fe9a4996700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fe987e49700' had timed out after 15

-3044> 2019-08-07 00:39:32.013 7fe9a3994700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fe987e49700' had timed out after 15

-3043> 2019-08-07 00:39:32.033 7fe9a4195700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fe987e49700' had timed out after 15

-3042> 2019-08-07 00:39:32.033 7fe9a4996700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fe987e49700' had timed out after 15

--

-----

 

Some hundred lines of:

-164> 2019-08-07 00:47:36.628 7fe9a3994700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fe98964c700' had timed out after 60

  -163> 2019-08-07 00:47:36.632 7fe9a3994700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fe98964c700' had timed out after 60

  -162> 2019-08-07 00:47:36.632 7fe9a3994700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fe98964c700' had timed out after 60

-----

 

   -78> 2019-08-07 00:50:51.755 7fe995bfa700 10 monclient: tick

   -77> 2019-08-07 00:50:51.755 7fe995bfa700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-08-07 00:50:21.756453)

   -76> 2019-08-07 00:51:01.755 7fe995bfa700 10 monclient: tick

   -75> 2019-08-07 00:51:01.755 7fe995bfa700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-08-07 00:50:31.756604)

   -74> 2019-08-07 00:51:11.755 7fe995bfa700 10 monclient: tick

   -73> 2019-08-07 00:51:11.755 7fe995bfa700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-08-07 00:50:41.756788)

   -72> 2019-08-07 00:51:21.756 7fe995bfa700 10 monclient: tick

   -71> 2019-08-07 00:51:21.756 7fe995bfa700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-08-07 00:50:51.756982)

   -70> 2019-08-07 00:51:31.755 7fe995bfa700 10 monclient: tick

   -69> 2019-08-07 00:51:31.755 7fe995bfa700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-08-07 00:51:01.757206)

   -68> 2019-08-07 00:51:41.756 7fe995bfa700 10 monclient: tick

   -67> 2019-08-07 00:51:41.756 7fe995bfa700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-08-07 00:51:11.757364)

   -66> 2019-08-07 00:51:51.756 7fe995bfa700 10 monclient: tick

   -65> 2019-08-07 00:51:51.756 7fe995bfa700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-08-07 00:51:21.757535)

   -64> 2019-08-07 00:51:52.861 7fe987e49700  1 heartbeat_map clear_timeout 'OSD::osd_op_tp thread 0x7fe987e49700' had timed out after 15

   -63> 2019-08-07 00:51:52.861 7fe987e49700  1 heartbeat_map clear_timeout 'OSD::osd_op_tp thread 0x7fe987e49700' had suicide timed out after 150

   -62> 2019-08-07 00:51:52.948 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1721180160 mapped: 4297818112 old cache_size: 1994018210 new cache size: 1992784572

   -61> 2019-08-07 00:51:52.948 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 1992784572 kv_alloc: 763363328 kv_used: 749381098 meta_alloc: 763363328 meta_used: 654593191 data_alloc: 452984832 data_used: 455929856

   -60> 2019-08-07 00:51:57.923 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 1994110827 kv_alloc: 763363328 kv_used: 749381098 meta_alloc: 763363328 meta_used: 654590799 data_alloc: 452984832 data_used: 451538944

   -59> 2019-08-07 00:51:57.973 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 1994110827 new cache size: 1994442069

   -58> 2019-08-07 00:52:01.756 7fe995bfa700 10 monclient: tick

   -57> 2019-08-07 00:52:01.756 7fe995bfa700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-08-07 00:51:31.757684)

   -56> 2019-08-07 00:52:02.933 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 1995765747 kv_alloc: 763363328 kv_used: 749381098 meta_alloc: 763363328 meta_used: 654590799 data_alloc: 452984832 data_used: 451538944

   -55> 2019-08-07 00:52:02.983 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 1995765747 new cache size: 1996096345

   -54> 2019-08-07 00:52:07.943 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 1997417449 kv_alloc: 763363328 kv_used: 749381098 meta_alloc: 763363328 meta_used: 654590799 data_alloc: 452984832 data_used: 451538944

   -53> 2019-08-07 00:52:07.993 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 1997417449 new cache size: 1997747404

   -52> 2019-08-07 00:52:11.757 7fe995bfa700 10 monclient: tick

   -51> 2019-08-07 00:52:11.757 7fe995bfa700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-08-07 00:51:41.757855)

   -50> 2019-08-07 00:52:12.952 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 1999065941 kv_alloc: 763363328 kv_used: 749381098 meta_alloc: 763363328 meta_used: 654590799 data_alloc: 452984832 data_used: 451538944

   -49> 2019-08-07 00:52:13.002 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 1999065941 new cache size: 1999395254

   -48> 2019-08-07 00:52:17.962 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 2000711226 kv_alloc: 771751936 kv_used: 749381098 meta_alloc: 771751936 meta_used: 654590799 data_alloc: 452984832 data_used: 451538944

   -47> 2019-08-07 00:52:18.012 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 2000711226 new cache size: 2001039899

   -46> 2019-08-07 00:52:21.756 7fe995bfa700 10 monclient: tick

   -45> 2019-08-07 00:52:21.756 7fe995bfa700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-08-07 00:51:51.758043)

   -44> 2019-08-07 00:52:22.971 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 2002353314 kv_alloc: 771751936 kv_used: 749381098 meta_alloc: 771751936 meta_used: 654590799 data_alloc: 452984832 data_used: 451538944

   -43> 2019-08-07 00:52:23.022 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 2002353314 new cache size: 2002681348

   -42> 2019-08-07 00:52:27.982 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 2003992210 kv_alloc: 771751936 kv_used: 749381098 meta_alloc: 771751936 meta_used: 654590799 data_alloc: 452984832 data_used: 451538944

   -41> 2019-08-07 00:52:28.031 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 2003992210 new cache size: 2004319607

   -40> 2019-08-07 00:52:31.756 7fe995bfa700 10 monclient: tick

   -39> 2019-08-07 00:52:31.756 7fe995bfa700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-08-07 00:52:01.758219)

   -38> 2019-08-07 00:52:32.991 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 2005627920 kv_alloc: 771751936 kv_used: 749381098 meta_alloc: 771751936 meta_used: 654590799 data_alloc: 452984832 data_used: 451538944

   -37> 2019-08-07 00:52:33.041 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 2005627920 new cache size: 2005954680

   -36> 2019-08-07 00:52:38.001 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 2007260450 kv_alloc: 771751936 kv_used: 749381098 meta_alloc: 771751936 meta_used: 654590799 data_alloc: 452984832 data_used: 451538944

   -35> 2019-08-07 00:52:38.051 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 2007260450 new cache size: 2007586575

   -34> 2019-08-07 00:52:41.757 7fe995bfa700 10 monclient: tick

   -33> 2019-08-07 00:52:41.757 7fe995bfa700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-08-07 00:52:11.758447)

   -32> 2019-08-07 00:52:43.011 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 2008889806 kv_alloc: 771751936 kv_used: 749381098 meta_alloc: 771751936 meta_used: 654590799 data_alloc: 452984832 data_used: 451538944

   -31> 2019-08-07 00:52:43.061 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 2008889806 new cache size: 2009215297

   -30> 2019-08-07 00:52:48.021 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 2010515995 kv_alloc: 771751936 kv_used: 749381098 meta_alloc: 771751936 meta_used: 654590799 data_alloc: 452984832 data_used: 451538944

   -29> 2019-08-07 00:52:48.071 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 2010515995 new cache size: 2010840853

   -28> 2019-08-07 00:52:51.757 7fe995bfa700 10 monclient: tick

   -27> 2019-08-07 00:52:51.757 7fe995bfa700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-08-07 00:52:21.758631)

   -26> 2019-08-07 00:52:53.031 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 2012139023 kv_alloc: 771751936 kv_used: 749381098 meta_alloc: 771751936 meta_used: 654590799 data_alloc: 452984832 data_used: 451538944

   -25> 2019-08-07 00:52:53.081 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 2012139023 new cache size: 2012463250

   -24> 2019-08-07 00:52:58.042 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 2013758896 kv_alloc: 771751936 kv_used: 749381098 meta_alloc: 771751936 meta_used: 654590799 data_alloc: 452984832 data_used: 451538944

   -23> 2019-08-07 00:52:58.092 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 2013758896 new cache size: 2014082492

   -22> 2019-08-07 00:53:01.758 7fe995bfa700 10 monclient: tick

   -21> 2019-08-07 00:53:01.758 7fe995bfa700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-08-07 00:52:31.758799)

   -20> 2019-08-07 00:53:03.052 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 2015375620 kv_alloc: 771751936 kv_used: 749381098 meta_alloc: 771751936 meta_used: 654590799 data_alloc: 452984832 data_used: 451538944

   -19> 2019-08-07 00:53:03.102 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 2015375620 new cache size: 2015698587

   -18> 2019-08-07 00:53:08.062 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 2016989201 kv_alloc: 771751936 kv_used: 749381098 meta_alloc: 771751936 meta_used: 654590799 data_alloc: 452984832 data_used: 451538944

   -17> 2019-08-07 00:53:08.112 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 2016989201 new cache size: 2017311541

   -16> 2019-08-07 00:53:11.758 7fe995bfa700 10 monclient: tick

   -15> 2019-08-07 00:53:11.758 7fe995bfa700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-08-07 00:52:41.759013)

   -14> 2019-08-07 00:53:13.071 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 2018599645 kv_alloc: 771751936 kv_used: 749381098 meta_alloc: 771751936 meta_used: 654590799 data_alloc: 452984832 data_used: 451538944

   -13> 2019-08-07 00:53:13.121 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 2018599645 new cache size: 2018921358

   -12> 2019-08-07 00:53:18.081 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 2020206960 kv_alloc: 771751936 kv_used: 749381098 meta_alloc: 771751936 meta_used: 654590799 data_alloc: 452984832 data_used: 451538944

   -11> 2019-08-07 00:53:18.130 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 2020206960 new cache size: 2020528048

   -10> 2019-08-07 00:53:21.757 7fe995bfa700 10 monclient: tick

    -9> 2019-08-07 00:53:21.757 7fe995bfa700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-08-07 00:52:51.759214)

    -8> 2019-08-07 00:53:23.090 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 2021811150 kv_alloc: 780140544 kv_used: 749381098 meta_alloc: 780140544 meta_used: 654590799 data_alloc: 461373440 data_used: 451538944

    -7> 2019-08-07 00:53:23.140 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 2021811150 new cache size: 2022131613

    -6> 2019-08-07 00:53:28.100 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 2023412220 kv_alloc: 780140544 kv_used: 749381098 meta_alloc: 780140544 meta_used: 654590799 data_alloc: 461373440 data_used: 451538944

    -5> 2019-08-07 00:53:28.150 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 2023412220 new cache size: 2023732060

    -4> 2019-08-07 00:53:31.758 7fe995bfa700 10 monclient: tick

    -3> 2019-08-07 00:53:31.758 7fe995bfa700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-08-07 00:53:01.759334)

    -2> 2019-08-07 00:53:33.110 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _trim_shards cache_size: 2025010178 kv_alloc: 780140544 kv_used: 749381098 meta_alloc: 780140544 meta_used: 654590799 data_alloc: 461373440 data_used: 451538944

    -1> 2019-08-07 00:53:33.160 7fe99966c700  5 bluestore.MempoolThread(0x55ff04ad6a88) _tune_cache_size target: 4294967296 heap: 6018998272 unmapped: 1725702144 mapped: 4293296128 old cache_size: 2025010178 new cache size: 2025329397

     0> 2019-08-07 00:53:37.655 7fe987e49700 -1 *** Caught signal (Aborted) **

in thread 7fe987e49700 thread_name:tp_osd_tp

 

ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)

1: (()+0xf5d0) [0x7fe9a7cba5d0]

2: (pthread_kill()+0x31) [0x7fe9a7cb79d1]

3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, unsigned long)+0x466) [0x55fef8748176]

4: (ceph::HeartbeatMap::clear_timeout(ceph::heartbeat_handle_d*)+0x7b) [0x55fef874878b]

5: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0xa9e) [0x55fef86085de]

6: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0x55fef81cd7ff]

7: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, ThreadPool::TPHandle*)+0x58) [0x55fef8118298]

8: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x202) [0x55fef81767c2]

9: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x55fef83eb490]

10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0x55fef816aef4]

11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x55fef8769ce3]

12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55fef876cd80]

13: (()+0x7dd5) [0x7fe9a7cb2dd5]

14: (clone()+0x6d) [0x7fe9a6b7202d]

 

About server load: 

 

[root@CEPH008 ~]# top

top - 00:57:30 up 186 days, 22 min,  1 user,  load average: 11.65, 13.42, 13.51

Tasks: 316 total,   1 running, 315 sleeping,   0 stopped,   0 zombie

%Cpu(s):  2.3 us,  1.2 sy,  0.0 ni, 74.1 id, 22.4 wa,  0.0 hi,  0.1 si,  0.0 st

KiB Mem : 65737480 total,   431824 free, 49046608 used, 16259048 buff/cache

KiB Swap: 29241340 total, 19406504 free,  9834836 used. 15917556 avail Mem

 

 

Currently the server is doing some deep-scrub that we got off during the last two weeks due a node evict and a new node install.

 

 

 


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux