Re: ceph-osd crash

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi. I'm just wanted to add a little more info as it might be related.
I'm running Luminous and I've had my OSDs disappear a few times too.
I've been able to restart them by rebooting, or restarting the
/etc/init.d/ceph file. At the time, I did not know about the
/var/log/ceph/ceph-osd.*.log files so I can't provide more info. I'll
upload them next time I can isolate this. The missing daemons seem to
favor a specific machine even tho I've seen the problem on multiple
different hosts. Freshly installed Ubuntu 16.04 LTS, 4GB mem, running
OSD only. I saw OOM errors on a few other OSD hosts, so i thought it
might be related to that, but I don't know if I had OOM errors when
the daemon died.

On Wed, Sep 6, 2017 at 7:34 PM, Wyllys Ingersoll
<wyllys.ingersoll@xxxxxxxxxxxxxx> wrote:
> Jewel 10.2.7 on Ubuntu 16.04.2,
>
> I have an OSD that keeps going down, these are the messages in the
> log.  Is this a known bug?
>
>     -3> 2017-09-06 22:33:14.509181 7f7e18429700  5 osd.16 pg_epoch:
> 107818 pg[9.cd( v 35684'19200 (35250'16102,35684'19200]
> local-les=107801 n=17 ec=1076 les/c/f 107801/107802/0
> 107816/107818/107818) [16,1,65] r=0 lpr=107818 pi=107484-107817/30
> crt=35684'19200 lcod 0'0 mlcod 0'0 peering] exit
> Started/Primary/Peering/GetMissing 0.000008 0 0.000000
>     -2> 2017-09-06 22:33:14.509193 7f7e18429700  5 osd.16 pg_epoch:
> 107818 pg[9.cd( v 35684'19200 (35250'16102,35684'19200]
> local-les=107801 n=17 ec=1076 les/c/f 107801/107802/0
> 107816/107818/107818) [16,1,65] r=0 lpr=107818 pi=107484-107817/30
> crt=35684'19200 lcod 0'0 mlcod 0'0 peering] enter
> Started/Primary/Peering/WaitUpThru
>     -1> 2017-09-06 22:33:14.525481 7f7e9cb37700  1 leveldb: Level-0
> table #803535: 2374544 bytes OK
>      0> 2017-09-06 22:33:14.526739 7f7e19c2c700 -1 *** Caught signal
> (Aborted) **
>  in thread 7f7e19c2c700 thread_name:tp_osd
>
>  ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>  1: (()+0x9770ae) [0x56326b73d0ae]
>  2: (()+0x11390) [0x7f7ea9c64390]
>  3: (gsignal()+0x38) [0x7f7ea7c02428]
>  4: (abort()+0x16a) [0x7f7ea7c0402a]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x26b) [0x56326b83d54b]
>  6: (PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&,
> pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&,
> bool&)+0x1a12) [0x56326b3f5ff2]
>  7: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&,
> pg_shard_t)+0xcd) [0x56326b2034ad]
>  8: (PG::proc_master_log(ObjectStore::Transaction&, pg_info_t&,
> pg_log_t&, pg_missing_t&, pg_shard_t)+0xc8) [0x56326b20d948]
>  9: (PG::RecoveryState::GetLog::react(PG::RecoveryState::GotLog
> const&)+0x1e7) [0x56326b22faf7]
>  10: (boost::statechart::simple_state<PG::RecoveryState::GetLog,
> PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
> const&, void const*)+0x1fe) [0x56326b271d1e]
>  11: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
> PG::RecoveryState::Initial, std::allocator<void>,
> boost::statechart::null_exception_translator>::process_queued_events()+0x131)
> [0x56326b24f891]
>  12: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
> PG::RecoveryState::Initial, std::allocator<void>,
> boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
> const&)+0xee) [0x56326b24fdce]
>  13: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>,
> PG::RecoveryCtx*)+0x395) [0x56326b223025]
>  14: (OSD::process_peering_events(std::__cxx11::list<PG*,
> std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2d4)
> [0x56326b16fdf4]
>  15: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*,
> ThreadPool::TPHandle&)+0x25) [0x56326b1b88e5]
>  16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x56326b82f531]
>  17: (ThreadPool::WorkThread::entry()+0x10) [0x56326b830630]
>  18: (()+0x76ba) [0x7f7ea9c5a6ba]
>  19: (clone()+0x6d) [0x7f7ea7cd382d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux