Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I've opened a tracker issue at https://tracker.ceph.com/issues/41240

Background: Cluster of 13 hosts, 5 of which contain 14 SSD OSDs between them. 409 HDDs in as well.

The SSDs contain the RGW index and log pools, and some smaller pools
The HDDs ccontain all other pools, including the RGW data pool

The RGW instance contains just over 1 billion objects across about 65k buckets. I don't know of any action on the cluster that would have caused this. There have been no changes to the crush map in months, but HDDs were added a couple weeks ago and backfilling is still in progress but in the home stretch.

I don't know what I can do at this point, though something points to the osdmap on these being wrong and/or corrupted? Log excerpt from crash included below. All of the OSD logs I checked look very similar.




2019-08-13 18:09:52.913 7f76484e9d80 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm /el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:3362] Recovered from manifest file:db/MANIFEST-245361 succeeded,manifest_file_number is 245361, next_file_number is 245364, last_sequence is 606668564 6, log_number is 0,prev_log_number is 0,max_column_family is 0,deleted_log_number is 245359

2019-08-13 18:09:52.913 7f76484e9d80 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm /el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:3370] Column family [default] (ID 0), log number is 245360

2019-08-13 18:09:52.918 7f76484e9d80 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1565719792920682, "job": 1, "event": "recovery_started", "log_files": [245362]} 2019-08-13 18:09:52.918 7f76484e9d80 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm /el7/BUILD/ceph-13.2.6/src/rocksdb/db/db_impl_open.cc:551] Recovering log #245362 mode 0 2019-08-13 18:09:52.919 7f76484e9d80 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm /el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:2863] Creating manifest 245364

2019-08-13 18:09:52.933 7f76484e9d80 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1565719792935329, "job": 1, "event": "recovery_finished"} 2019-08-13 18:09:52.951 7f76484e9d80 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm /el7/BUILD/ceph-13.2.6/src/rocksdb/db/db_impl_open.cc:1218] DB pointer 0x56445a6c8000 2019-08-13 18:09:52.951 7f76484e9d80 1 bluestore(/var/lib/ceph/osd/ceph-46) _open_db opened rocksdb path db options compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=
1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152
2019-08-13 18:09:52.964 7f76484e9d80  1 freelist init
2019-08-13 18:09:52.976 7f76484e9d80 1 bluestore(/var/lib/ceph/osd/ceph-46) _open_alloc opening allocation metadata 2019-08-13 18:09:53.119 7f76484e9d80 1 bluestore(/var/lib/ceph/osd/ceph-46) _open_alloc loaded 926 GiB in 13292 extents
2019-08-13 18:09:53.133 7f76484e9d80 -1 *** Caught signal (Aborted) **
 in thread 7f76484e9d80 thread_name:ceph-osd

ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (()+0xf5d0) [0x7f763c4455d0]
 2: (gsignal()+0x37) [0x7f763b466207]
 3: (abort()+0x148) [0x7f763b4678f8]
 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f763bd757d5]
 5: (()+0x5e746) [0x7f763bd73746]
 6: (()+0x5e773) [0x7f763bd73773]
 7: (__cxa_rethrow()+0x49) [0x7f763bd739e9]
8: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x18b8) [0x7f763fcb48d8]
 9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x4ad) [0x7f763fa924ad]
 10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f763fa94db1]
 11: (OSDService::try_get_map(unsigned int)+0x4f8) [0x5644576e1e08]
 12: (OSDService::get_map(unsigned int)+0x1e) [0x564457743dae]
 13: (OSD::init()+0x1d32) [0x5644576ef982]
 14: (main()+0x23a3) [0x5644575cc7a3]
 15: (__libc_start_main()+0xf5) [0x7f763b4523d5]
 16: (()+0x385900) [0x5644576a4900]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux