Hi David,
this issue looks like the one reported here:
http://tracker.ceph.com/issues/37282
Could you please comment there which will raise bug priority.
Besides that could you please share disk layout for this specific OSD:
what volumes do you have (main, DB, WAL)? which drives stand behind them?
Did you run 'ceph-bluestore-tool fsck' for this specific OSD?
Also how did you make sure these drives don't have HW defects?
Thanks,
Igor
On 2/1/2019 2:45 PM, David Sieger wrote:
Hi everyone,
I am facing an OSD in a crash loop and following the troubleshooting
procedure leads me to contacting the mailing list. (As far as I can
tell, this is neither a hardware nor a configuration issue. The system
has been running in this setup for months. Several other OSDs running on
the same host are fine.) Also I have one PG that is currently marked
inconsistent. This PG is not is not mentioned in the log file, though.
The OSD in question crashes about 15 to 20 seconds after starting up.
The logged reasons for the crash are, as far as I can tell:
-1> 2019-02-01 12:22:46.111821 7fe079d53700 -1 rocksdb:
submit_transaction error: Corruption: block checksum mismatch code = 2
Rocksdb transaction:
Put( Prefix = O key =
0x7f80000000000000021600000021213dfffffffffffffffeffffffffffffffff'o'
Value size = 30)
0> 2019-02-01 12:22:46.117761 7fe079d53700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.9/rpm/el7/BUILD/ceph-12.2.9/src/os/bluestore/BlueStore.cc:
In function 'void BlueStore::_kv_sync_thread()' thread 7fe079d53700 time
2019-02-01 12:22:46.111884
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.9/rpm/el7/BUILD/ceph-12.2.9/src/os/bluestore/BlueStore.cc:
8717: FAILED assert(r == 0)
ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x110) [0x562af51e5e90]
2: (BlueStore::_kv_sync_thread()+0x3482) [0x562af5090162]
3: (BlueStore::KVSyncThread::entry()+0xd) [0x562af50d701d]
4: (()+0x7e25) [0x7fe089e12e25]
5: (clone()+0x6d) [0x7fe088f03bad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
I have a full log of the crash cycle available, if it helps.
Is there anything I can do to fix this? Is this a bug that I should
report somewhere else?
David Sieger