Re: [Jewel 10.2.11] OSD Segmentation fault

ceph@xxxxxxxxxx · Mon, 13 Aug 2018 07:23:56 +0200

Am 3. August 2018 12:03:17 MESZ schrieb Alexandru Cucu <me@xxxxxxxxxxx>:
>Hello,
>

Hello Alex,

>Another OSD started randomly crashing with segmentation fault. Haven't
>managed to add the last 3 OSDs back to the cluster as the daemons keep
>crashing.
>

An idea could be to remove the osds completely from the Cluster and add it again after zapping the Disks.

Hth 
- Mehmet

>---
>
>    -2> 2018-08-03 12:12:52.670076 7f12b6b15700  4 rocksdb:
>EVENT_LOG_v1 {"time_micros": 1533287572670073, "job": 3, "event":
>"table_file_deletion", "file_number": 4350}
>  -1> 2018-08-03 12:12:53.146753 7f12c38d0a80  0 osd.154 89917 load_pgs
>     0> 2018-08-03 12:12:57.526910 7f12c38d0a80 -1 *** Caught signal
>(Segmentation fault) **
> in thread 7f12c38d0a80 thread_name:ceph-osd
> ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)
> 1: (()+0x9f1c2a) [0x7f12c42ddc2a]
> 2: (()+0xf5e0) [0x7f12c1dc85e0]
> 3: (()+0x34484) [0x7f12c34a6484]
> 4: (rocksdb::BlockBasedTable::NewIndexIterator(rocksdb::ReadOptions
>const&, rocksdb::BlockIter*,
>rocksdb::BlockBasedTable::CachableEntry<rocksdb::BlockBasedTable::IndexReader>*)+0x466)
>[0x7f12c41e40d6]
> 5: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&,
>rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x297)
>[0x7f12c41e4b27]
> 6: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&,
>rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&,
>rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*,
>bool
>, int)+0x2a4) [0x7f12c429ff94]
> 7: (rocksdb::Version::Get(rocksdb::ReadOptions const&,
>rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::Status*,
>rocksdb::MergeContext*, rocksdb::RangeDelAggregator*, bool*, bool*,
>unsigned l
>ong*)+0x810) [0x7f12c419bb80]
> 8: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&,
>rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,
>rocksdb::PinnableSlice*, bool*)+0x5a4) [0x7f12c424e494]
> 9: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&,
>rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,
>rocksdb::PinnableSlice*)+0x19) [0x7f12c424ea19]
> 10: (rocksdb::DB::Get(rocksdb::ReadOptions const&,
>rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,
>std::string*)+0x95) [0x7f12c4252a45]
> 11: (rocksdb::DB::Get(rocksdb::ReadOptions const&, rocksdb::Slice
>const&, std::string*)+0x4a) [0x7f12c4251eea]
> 12: (RocksDBStore::get(std::string const&, std::string const&,
>ceph::buffer::list*)+0xff) [0x7f12c415c31f]
> 13: (DBObjectMap::_lookup_map_header(DBObjectMap::MapHeaderLock
>const&, ghobject_t const&)+0x5e4) [0x7f12c4110814]
> 14: (DBObjectMap::get_values(ghobject_t const&, std::set<std::string,
>std::less<std::string>, std::allocator<std::string> > const&,
>std::map<std::string, ceph::buffer::list, std::less<std::string>,
>std::
>allocator<std::pair<std::string const, ceph::buffer::list> > >*)+0x5f)
>[0x7f12c411111f]
> 15: (FileStore::omap_get_values(coll_t const&, ghobject_t const&,
>std::set<std::string, std::less<std::string>,
>std::allocator<std::string> > const&, std::map<std::string,
>ceph::buffer::list, std::less<s
>td::string>, std::allocator<std::pair<std::string const,
>ceph::buffer::list> > >*)+0x197) [0x7f12c4031f77]
>16: (PG::_has_removal_flag(ObjectStore*, spg_t)+0x151) [0x7f12c3d8f7c1]
> 17: (OSD::load_pgs()+0x5d5) [0x7f12c3cf43e5]
> 18: (OSD::init()+0x2086) [0x7f12c3d07096]
> 19: (main()+0x2c18) [0x7f12c3c1e088]
> 20: (__libc_start_main()+0xf5) [0x7f12c0374c05]
> 21: (()+0x3c8847) [0x7f12c3cb4847]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>needed to interpret this.
>---
>
>Any help would be appreciated.
>
>Thanks,
>Alex Cucu
>
>On Mon, Jul 30, 2018 at 4:55 PM Alexandru Cucu <me@xxxxxxxxxxx> wrote:
>>
>> Hello Ceph users,
>>
>> We have updated our cluster from 10.2.7 to 10.2.11. A few hours after
>> the update, 1 OSD crashed.
>> When trying to add the OSD back to the cluster, other 2 OSDs started
>> crashing with segmentation fault. Had to mark all 3 OSDs as down as
>we
>> had stuck PGs and blocked operations and the cluster status was
>> HEALTH_ERR.
>>
>> We have tried various ways to re-add the OSDs back to the cluster but
>> after a while they start crashing and won't start anymore. After a
>> while they can be started again and marked as in but after some
>> rebalancing they will start the crashing imediately after starting.
>>
>> Here are some logs:
>> https://pastebin.com/nCRamgRU
>>
>> Do you know of any existing bug report that might be related? (I
>> couldn't find anything).
>>
>> I will happily provide any information that would help solving this
>issue.
>>
>> Thank you,
>> Alex Cucu
>_______________________________________________
>ceph-users mailing list
>ceph-users@xxxxxxxxxxxxxx
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com