I've opened a tracker issue at https://tracker.ceph.com/issues/41240
Background: Cluster of 13 hosts, 5 of which contain 14 SSD OSDs between
them. 409 HDDs in as well.
The SSDs contain the RGW index and log pools, and some smaller pools
The HDDs ccontain all other pools, including the RGW data pool
The RGW instance contains just over 1 billion objects across about 65k
buckets. I don't know of any action on the cluster that would have
caused this. There have been no changes to the crush map in months, but
HDDs were added a couple weeks ago and backfilling is still in progress
but in the home stretch.
I don't know what I can do at this point, though something points to the
osdmap on these being wrong and/or corrupted? Log excerpt from crash
included below. All of the OSD logs I checked look very similar.
2019-08-13 18:09:52.913 7f76484e9d80 4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
/el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:3362] Recovered
from manifest file:db/MANIFEST-245361 succeeded,manifest_file_number is
245361, next_file_number is 245364, last_sequence is 606668564
6, log_number is 0,prev_log_number is 0,max_column_family is
0,deleted_log_number is 245359
2019-08-13 18:09:52.913 7f76484e9d80 4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
/el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:3370] Column family
[default] (ID 0), log number is 245360
2019-08-13 18:09:52.918 7f76484e9d80 4 rocksdb: EVENT_LOG_v1
{"time_micros": 1565719792920682, "job": 1, "event": "recovery_started",
"log_files": [245362]}
2019-08-13 18:09:52.918 7f76484e9d80 4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
/el7/BUILD/ceph-13.2.6/src/rocksdb/db/db_impl_open.cc:551] Recovering
log #245362 mode 0
2019-08-13 18:09:52.919 7f76484e9d80 4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
/el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:2863] Creating
manifest 245364
2019-08-13 18:09:52.933 7f76484e9d80 4 rocksdb: EVENT_LOG_v1
{"time_micros": 1565719792935329, "job": 1, "event": "recovery_finished"}
2019-08-13 18:09:52.951 7f76484e9d80 4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
/el7/BUILD/ceph-13.2.6/src/rocksdb/db/db_impl_open.cc:1218] DB pointer
0x56445a6c8000
2019-08-13 18:09:52.951 7f76484e9d80 1
bluestore(/var/lib/ceph/osd/ceph-46) _open_db opened rocksdb path db
options
compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=
1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152
2019-08-13 18:09:52.964 7f76484e9d80 1 freelist init
2019-08-13 18:09:52.976 7f76484e9d80 1
bluestore(/var/lib/ceph/osd/ceph-46) _open_alloc opening allocation metadata
2019-08-13 18:09:53.119 7f76484e9d80 1
bluestore(/var/lib/ceph/osd/ceph-46) _open_alloc loaded 926 GiB in 13292
extents
2019-08-13 18:09:53.133 7f76484e9d80 -1 *** Caught signal (Aborted) **
in thread 7f76484e9d80 thread_name:ceph-osd
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
(stable)
1: (()+0xf5d0) [0x7f763c4455d0]
2: (gsignal()+0x37) [0x7f763b466207]
3: (abort()+0x148) [0x7f763b4678f8]
4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f763bd757d5]
5: (()+0x5e746) [0x7f763bd73746]
6: (()+0x5e773) [0x7f763bd73773]
7: (__cxa_rethrow()+0x49) [0x7f763bd739e9]
8: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x18b8)
[0x7f763fcb48d8]
9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x4ad) [0x7f763fa924ad]
10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f763fa94db1]
11: (OSDService::try_get_map(unsigned int)+0x4f8) [0x5644576e1e08]
12: (OSDService::get_map(unsigned int)+0x1e) [0x564457743dae]
13: (OSD::init()+0x1d32) [0x5644576ef982]
14: (main()+0x23a3) [0x5644575cc7a3]
15: (__libc_start_main()+0xf5) [0x7f763b4523d5]
16: (()+0x385900) [0x5644576a4900]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com