Mimic (13.2.6) OSD daemon won't start up after system restart, with failed assert...

aoanla@xxxxxxxxx · Thu, 21 Nov 2019 16:57:42 -0000

Hi everyone, 

I'm looking for some advice on diagnosing an OSD issue. 

We have a Mimic cluster, not very full, with Bluestore OSDs.

We recently had to bring the cluster down to allow power testing in the host datacentre, and when we brought things up again, 1 OSD daemon would not start. 

The log shows (cut to useful context):

-314> 2019-11-21 15:55:15.561 7efdc049dd80  4 rocksdb:                               Options.ttl: 0
  -314> 2019-11-21 15:55:15.563 7efdc049dd80  4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:3362] Recovered from manifest file:db/MANIFEST-000127 succeeded,manifest_file_number is 127, next_file_number is 264, last_sequence is 21956004, log_number is 0,prev_log_number is 0,max_column_family is 0,deleted_log_number is 123

  -314> 2019-11-21 15:55:15.563 7efdc049dd80  4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:3370] Column family [default] (ID 0), log number is 255

  -314> 2019-11-21 15:55:15.563 7efdc049dd80  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1574351715564768, "job": 1, "event": "recovery_started", "log_files": [252, 255]}
  -314> 2019-11-21 15:55:15.563 7efdc049dd80  4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/rocksdb/db/db_impl_open.cc:551] Recovering log #252 mode 0
  -314> 2019-11-21 15:55:16.722 7efdc049dd80  4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/rocksdb/db/db_impl_open.cc:551] Recovering log #255 mode 0
  -314> 2019-11-21 15:55:17.885 7efdc049dd80 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/os/bluestore/KernelDevice.cc: In function 'virtual int KernelDevice::read(uint64_t, uint64_t, ceph::bufferlist*, IOContext*, bool)' thread 7efdc049dd80 time 2019-11-21 15:55:17.870632
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/os/bluestore/KernelDevice.cc: 825: FAILED assert((uint64_t)r == len)

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14b) [0x7efdb788036b]
 2: (()+0x26e4f7) [0x7efdb78804f7]
 3: (KernelDevice::read(unsigned long, unsigned long, ceph::buffer::list*, IOContext*, bool)+0x4b4) [0x5619ab313144]
 4: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*)+0x3c2) [0x5619ab2d59a2]
 5: (BlueRocksSequentialFile::Read(unsigned long, rocksdb::Slice*, char*)+0x34) [0x5619ab2f88f4]
 6: (rocksdb::SequentialFileReader::Read(unsigned long, rocksdb::Slice*, char*)+0x6b) [0x5619ab4e541b]
 7: (rocksdb::log::Reader::ReadMore(unsigned long*, int*)+0xd8) [0x5619ab3f3148]
 8: (rocksdb::log::Reader::ReadPhysicalRecord(rocksdb::Slice*, unsigned long*)+0x70) [0x5619ab3f3240]
 9: (rocksdb::log::Reader::ReadRecord(rocksdb::Slice*, std::string*, rocksdb::WALRecoveryMode)+0x12b) [0x5619ab3f351b]
 10: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool)+0xea2) [0x5619ab3a3bf2]
 11: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0xa59) [0x5619ab3a54e9]
 12: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::string const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool)+0x689) [0x5619ab3a6299]
 13: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::string const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x22) [0x5619ab3a7ac2]
 14: (RocksDBStore::do_open(std::ostream&, bool, std::vector<KeyValueDB::ColumnFamily, std::allocator<KeyValueDB::ColumnFamily> > const*)+0x164e) [0x5619ab27a43e]
 15: (BlueStore::_open_db(bool, bool)+0xd6a) [0x5619ab205f9a]
 16: (BlueStore::_mount(bool, bool)+0x4d1) [0x5619ab237071]
 17: (OSD::init()+0x28f) [0x5619aaddeedf]
 18: (main()+0x23a3) [0x5619aacbd7a3]
 19: (__libc_start_main()+0xf5) [0x7efdb33f2505]

The disk behind this OSD is very new, and hasn't been stressed very much, so I am not convinced it's a disk failure issue. 
Is this a known bug in Mimic (it's hard to find a similar bug in the bug tracker)... how should I diagnose this?

Sam
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx