Hi list,
Cheers,
In conjunction with taking a new storage server online we observed that a whole bunch of the SSD OSDs we use for metadata went offline, and crash every time they try to restart with an abort signal in OSDMap::decode - brief log below.
We have seen this at least once in the past, and I suspect it might be related to high load (?) in the servers when lots of PGs are peering and/or large amounts of backfilling happens. In that case it was only a single disk, so we "fixed" it by just recreating that OSD - but this time we need to get them working to avoid losing metadata :-)
Based on previous posts to the mailing list and the bugtracker, I would guess this might be due to a corrupt osdmap for these OSDs.
Should we try to replace the osdmap, and if so: how do we do that for bluestore OSDs?
Cheers,
Erik
2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/version_set.cc:3362] Recovered from manifest file:db/MANIFEST-001493 succeeded,manifest_file_number is 1493, next_file_number is 1496, last_sequence is 45904669, log_number is 0,prev_log_number is 0,max_column_family is 0,deleted_log_number is 1491
2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/version_set.cc:3370] Column family [default] (ID 0), log number is 1492
2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1556294168125624, "job": 1, "event": "recovery_started", "log_files": [1494]}
2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/db_impl_open.cc:551] Recovering log #1494 mode 0
2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/version_set.cc:2863] Creating manifest 1496
2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1556294168126875, "job": 1, "event": "recovery_finished"}
2019-04-26 17:56:08.127 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/db_impl_open.cc:1218] DB pointer 0x5634c2f60000
2019-04-26 17:56:08.127 7f4f2956ae00 1 bluestore(/var/lib/ceph/osd/ceph-126) _open_db opened rocksdb path db options compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152
2019-04-26 17:56:08.135 7f4f2956ae00 1 freelist init
2019-04-26 17:56:08.143 7f4f2956ae00 1 bluestore(/var/lib/ceph/osd/ceph-126) _open_alloc opening allocation metadata
2019-04-26 17:56:08.147 7f4f2956ae00 1 bluestore(/var/lib/ceph/osd/ceph-126) _open_alloc loaded 223 GiB in 233 extents
2019-04-26 17:56:08.151 7f4f2956ae00 -1 *** Caught signal (Aborted) **
in thread 7f4f2956ae00 thread_name:ceph-osd
ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)
1: (()+0x92b730) [0x5634c0151730]
2: (()+0x12890) [0x7f4f1f02b890]
3: (gsignal()+0xc7) [0x7f4f1df06e97]
4: (abort()+0x141) [0x7f4f1df08801]
5: (()+0x8c8b7) [0x7f4f1e8fb8b7]
6: (()+0x92a06) [0x7f4f1e901a06]
7: (()+0x92a41) [0x7f4f1e901a41]
8: (()+0x92c74) [0x7f4f1e901c74]
9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x1864) [0x7f4f20aff694]
10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f4f20b00af1]
11: (OSDService::try_get_map(unsigned int)+0x508) [0x5634bfbf73a8]
12: (OSDService::get_map(unsigned int)+0x1e) [0x5634bfc56ffe]
13: (OSD::init()+0x1d5f) [0x5634bfc048ef]
14: (main()+0x383d) [0x5634bfaef8cd]
15: (__libc_start_main()+0xe7) [0x7f4f1dee9b97]
16: (_start()+0x2a) [0x5634bfbb97aa]
Erik
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com