Re: 3 OSDs stopped and unable to restart

"ifedotov@xxxxxxx" <ifedotov@xxxxxxx> · Wed, 10 Jul 2019 22:32:39 +0300

You might want to try manual rocksdb compaction using ceph-kvstore-tool.. 

Sent from my Huawei tablet

-------- Original Message --------
Subject: Re: [ceph-users] 3 OSDs stopped and unable to restart
From: Brett Chancellor 
To: Igor Fedotov 
CC: Ceph Users 

Once backfilling finished, the cluster was super slow, most osd's were filled with heartbeat_map errors.  When an OSD restarts it causes a cascade of other osd's to follow suit and restart.. logs like..  -3> 2019-07-10 18:34:50.046 7f34abf5b700 -1 osd.69 1348581 get_health_metrics reporting 21 slow ops, oldest is osd_op(client.115295041.0:17575966 15.c37fa482 15.c37fa482 (undecoded) ack+ondisk+write+known_if_redirected e1348522)
    -2> 2019-07-10 18:34:50.967 7f34acf5d700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3493f2b700' had timed out after 90
    -1> 2019-07-10 18:34:50.967 7f34acf5d700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3493f2b700' had suicide timed out after 150
     0> 2019-07-10 18:34:51.025 7f3493f2b700 -1 *** Caught signal (Aborted) **
 in thread 7f3493f2b700 thread_name:tp_osd_tp

 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
 1: (()+0xf5d0) [0x7f34b57c25d0]
 2: (pread64()+0x33) [0x7f34b57c1f63]
 3: (KernelDevice::read_random(unsigned long, unsigned long, char*, bool)+0x238) [0x55bfdae5a448]
 4: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long, char*)+0xca) [0x55bfdae1271a]
 5: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const+0x20) [0x55bfdae3b440]
 6: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const+0x960) [0x55bfdb466ba0]
 7: (rocksdb::BlockFetcher::ReadBlockContents()+0x3e7) [0x55bfdb420c27]
 8: (()+0x11146a4) [0x55bfdb40d6a4]
 9: (rocksdb::BlockBasedTable::MaybeLoadDataBlockToCache(rocksdb::FilePrefetchBuffer*, rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::Slice, rocksdb::BlockBasedTable::CachableEntry<rocksdb::Block>*, bool, rocksdb::GetContext*)+0x2cc) [0x55bfdb40f63c]
 10: (rocksdb::DataBlockIter* rocksdb::BlockBasedTable::NewDataBlockIterator<rocksdb::DataBlockIter>(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::DataBlockIter*, bool, bool, bool, rocksdb::GetContext*, rocksdb::Status, rocksdb::FilePrefetchBuffer*)+0x169) [0x55bfdb41cb29]
 11: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::InitDataBlock()+0xc8) [0x55bfdb41e588]
 12: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::FindKeyForward()+0x8d) [0x55bfdb41e89d]
 13: (()+0x10adde9) [0x55bfdb3a6de9]
 14: (rocksdb::MergingIterator::Next()+0x44) [0x55bfdb4357c4]
 15: (rocksdb::DBIter::FindNextUserEntryInternal(bool, bool)+0x762) [0x55bfdb32a092]
 16: (rocksdb::DBIter::Next()+0x1d6) [0x55bfdb32b6c6]
 17: (RocksDBStore::RocksDBWholeSpaceIteratorImpl::next()+0x2d) [0x55bfdad9fa8d]
 18: (BlueStore::_collection_list(BlueStore::Collection*, ghobject_t const&, ghobject_t const&, int, std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*)+0xdf6) [0x55bfdad12466]
 19: (BlueStore::collection_list(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, ghobject_t const&, int, std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*)+0x9b) [0x55bfdad1393b]
 20: (PG::_delete_some(ObjectStore::Transaction*)+0x1e0) [0x55bfda984120]
 21: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38) [0x55bfda985598]
 22: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x16a) [0x55bfda9c45ca]
 23: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x5a) [0x55bfda9a20ca]
 24: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x119) [0x55bfda991389]
 25: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x55bfda8cb3c4]
 26: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0x234) [0x55bfda8cb804]
 27: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0x55bfda8bfb44]
 28: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x55bfdaeb9e93]
 29: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55bfdaebcf30]
 30: (()+0x7dd5) [0x7f34b57badd5]
 31: (clone()+0x6d) [0x7f34b4680ead]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.69.log
--- end dump of recent events ---

On Tue, Jul 9, 2019 at 1:38 PM Igor Fedotov <ifedotov@xxxxxxx> wrote:

    This will cap single bluefs space allocation. Currently it
      attempts to allocate 70Gb which seems to overflow some 32-bit
      length fields. With the adjustment such allocation should be
      capped at ~700MB.
    I doubt there is any relation between this specific failure and
      the pool. At least at the moment. 

    In short the history is: starting OSD tries to flush bluefs data
      to disk, detects lack of space and asks for more from main device
      - allocations succeeds but returned extent has length field set to
      0. 

    On 7/9/2019 8:33 PM, Brett Chancellor
      wrote:

      What does bluestore_bluefs_gift_ratio do?  I can't
        find any documentation on it.  Also do you think this could be
        related to the .rgw.meta pool having too many objects per PG?
        The disks that die always seem to be backfilling a pg from that
        pool, and they have ~550k objects per PG.

        -Brett

        On Tue, Jul 9, 2019 at 1:03 PM
          Igor Fedotov <ifedotov@xxxxxxx> wrote:

            Please try to set bluestore_bluefs_gift_ratio to 0.0002

            On
              7/9/2019 7:39 PM, Brett Chancellor wrote:

              Too large for pastebin.. The problem is
                continually crashing new OSDs. Here is the latest one.

                On Tue, Jul 9, 2019 at
                  11:46 AM Igor Fedotov <ifedotov@xxxxxxx>
                  wrote:

                    could you please set debug bluestore to 20 and
                      collect startup log for this specific OSD once
                      again?

                    On
                      7/9/2019 6:29 PM, Brett Chancellor wrote:

                      I restarted most of the OSDs with
                        the stupid allocator (6 of them wouldn't start
                        unless bitmap allocator was set), but I'm still
                        seeing issues with OSDs crashing.  Interestingly
                        it seems that the dying OSDs are always working
                        on a pg from the .rgw.meta pool when they crash.

                        Log : https://pastebin.com/yuJKcPvX

                        On Tue, Jul 9,
                          2019 at 5:14 AM Igor Fedotov <ifedotov@xxxxxxx>
                          wrote:

                            Hi Brett,
                            in Nautilus you can do that via
                            ceph config set osd.N bluestore_allocator
                              stupid
                            ceph config set osd.N bluefs_allocator
                              stupid
                            See https://ceph.com/community/new-mimic-centralized-configuration-management/
                              for more details on a new way of
                              configuration options setting.

                            A known issue with Stupid allocator is
                              gradual write request latency increase
                              (occurred within several days after OSD
                              restart). Seldom observed though. There
                              were some posts about that behavior in the
                              mail list  this year.

                            Thanks,
                            Igor.

                            On 7/8/2019 8:33 PM, Brett Chancellor
                              wrote:

                              I'll give that a try.  Is
                                it something like...
                                ceph tell 'osd.*'
                                  bluestore_allocator stupid

                                  ceph tell 'osd.*'
                                    bluefs_allocator stupid

                                And should I expect any issues
                                  doing this?

                                On
                                  Mon, Jul 8, 2019 at 1:04 PM Igor
                                  Fedotov <ifedotov@xxxxxxx>
                                  wrote:

                                    I should read call stack more
                                      carefully... It's not about
                                      lacking free space - this is
                                      rather the bug from this ticket:
                                    http://tracker.ceph.com/issues/40080

                                    You should upgrade to v14.2.2
                                      (once it's available) or
                                      temporarily switch to stupid
                                      allocator as a workaround.

                                    Thanks,
                                    Igor

                                    On
                                      7/8/2019 8:00 PM, Igor Fedotov
                                      wrote:

                                      Hi Brett,
                                      looks like BlueStore is unable
                                        to allocate additional space for
                                        BlueFS at main device. It's
                                        either lacking free space or
                                        it's too fragmented...
                                      Would you share osd log,
                                        please?
                                      Also please run
                                        "ceph-bluestore-tool --path
                                        <substitute with
                                        path-to-osd!!!>
                                        bluefs-bdev-sizes" and share the
                                        output.

                                      Thanks,
                                      Igor
                                      On
                                        7/3/2019 9:59 PM, Brett
                                        Chancellor wrote:

                                        Hi All! Today
                                          I've had 3 OSDs stop
                                          themselves and are unable to
                                          restart, all with the same
                                          error. These OSDs are all on
                                          different hosts. All are
                                          running 14.2.1

                                          I did try the following
                                            two commands
                                          - ceph-kvstore-tool
                                            bluestore-kv
                                            /var/lib/ceph/osd/ceph-80
                                            list > keys
                                            ## This failed with the
                                            same error below
                                          - ceph-bluestore-tool
                                            --path
                                            /var/lib/ceph/osd/ceph-80
                                            fsck
                                           ## After a couple of
                                            hours returned...
                                          2019-07-03 18:30:02.095
                                            7fe7c1c1ef00 -1
                                            bluestore(/var/lib/ceph/osd/ceph-80)
                                            fsck warning: legacy statfs
                                            record found, suggest to run
                                            store repair to get
                                            consistent statistic reports

                                            fsck success

                                            ## Error when trying to
                                              start one of the OSDs
                                               -12>
                                                2019-07-03 18:36:57.450
                                                7f5e42366700 -1 ***
                                                Caught signal (Aborted)
                                                **

                                                 in thread 7f5e42366700
                                                thread_name:rocksdb:low0

                                                 ceph version 14.2.1
                                                (d555a9489eb35f84f2e1ef49b77e19da9d113972)
                                                nautilus (stable)

                                                 1: (()+0xf5d0)
                                                [0x7f5e50bd75d0]

                                                 2: (gsignal()+0x37)
                                                [0x7f5e4f9ce207]

                                                 3: (abort()+0x148)
                                                [0x7f5e4f9cf8f8]

                                                 4:
                                                (ceph::__ceph_assert_fail(char
                                                const*, char const*,
                                                int, char const*)+0x199)
                                                [0x55a7aaee96ab]

                                                 5:
                                                (ceph::__ceph_assertf_fail(char
                                                const*, char const*,
                                                int, char const*, char
                                                const*, ...)+0)
                                                [0x55a7aaee982a]

                                                 6:
                                                (interval_set<unsigned
                                                long,
                                                std::map<unsigned
                                                long, unsigned long,
                                                std::less<unsigned
                                                long>,
std::allocator<std::pair<unsigned long const, unsigned long>
                                                > >
                                                >::insert(unsigned
                                                long, unsigned long,
                                                unsigned long*, unsigned
                                                long*)+0x3c6)
                                                [0x55a7ab212a66]

                                                 7:
                                                (BlueStore::allocate_bluefs_freespace(unsigned
                                                long, unsigned long,
                                                std::vector<bluestore_pextent_t,
mempool::pool_allocator<(mempool::pool_index_t)4,
                                                bluestore_pextent_t>
                                                >*)+0x74e)
                                                [0x55a7ab48253e]

                                                 8:
                                                (BlueFS::_expand_slow_device(unsigned
                                                long,
                                                std::vector<bluestore_pextent_t,
mempool::pool_allocator<(mempool::pool_index_t)4,
                                                bluestore_pextent_t>
                                                >&)+0x111)
                                                [0x55a7ab59e921]

                                                 9:
                                                (BlueFS::_allocate(unsigned
                                                char, unsigned long,
                                                bluefs_fnode_t*)+0x68b)
                                                [0x55a7ab59f68b]

                                                 10:
                                                (BlueFS::_flush_range(BlueFS::FileWriter*,
                                                unsigned long, unsigned
                                                long)+0xe5)
                                                [0x55a7ab59fce5]

                                                 11:
                                                (BlueFS::_flush(BlueFS::FileWriter*,
                                                bool)+0x10b)
                                                [0x55a7ab5a1b4b]

                                                 12:
                                                (BlueRocksWritableFile::Flush()+0x3d)
                                                [0x55a7ab5bf84d]

                                                 13:
                                                (rocksdb::WritableFileWriter::Flush()+0x19e)
                                                [0x55a7abbedd0e]

                                                 14:
                                                (rocksdb::WritableFileWriter::Sync(bool)+0x2e)
                                                [0x55a7abbedfee]

                                                 15:
                                                (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status
                                                const&,
                                                rocksdb::CompactionJob::SubcompactionState*,
rocksdb::RangeDelAggregator*, CompactionIterationStats*, rocksdb::Slice
                                                const*)+0xbaa)
                                                [0x55a7abc3b73a]

                                                 16:
(rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d0)
                                                [0x55a7abc3f150]

                                                 17:
                                                (rocksdb::CompactionJob::Run()+0x298)
                                                [0x55a7abc40618]

                                                 18:
                                                (rocksdb::DBImpl::BackgroundCompaction(bool*,
                                                rocksdb::JobContext*,
                                                rocksdb::LogBuffer*,
                                                rocksdb::DBImpl::PrepickedCompaction*)+0xcb7)
                                                [0x55a7aba7fb67]

                                                 19:
(rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*,
rocksdb::Env::Priority)+0xd0) [0x55a7aba813c0]

                                                 20:
                                                (rocksdb::DBImpl::BGWorkCompaction(void*)+0x3a)
                                                [0x55a7aba8190a]

                                                 21:
                                                (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned
                                                long)+0x264)
                                                [0x55a7abc8d9c4]

                                                 22:
                                                (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x4f)
                                                [0x55a7abc8db4f]

                                                 23: (()+0x129dfff)
                                                [0x55a7abd1afff]

                                                 24: (()+0x7dd5)
                                                [0x7f5e50bcfdd5]

                                                 25: (clone()+0x6d)
                                                [0x7f5e4fa95ead]

                                                 NOTE: a copy of the
                                                executable, or `objdump
                                                -rdS <executable>`
                                                is needed to interpret
                                                this.

                                        _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                                      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com