Re: Crashing OSD: "rocksdb: submit_transaction error: Corruption: block checksum mismatch code"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Igor,
thank you for your response.
I created an account for the bugtracker and will copy the relevant parts
of my email to the ticket.

This specific OSD was running its main section on a spinning drive and
had the database on a partition of an SSD. The node contains a total of
8 spinning drives and one SSD. The one SSD is split into eight
partitions and each  partition is used for one OSD. Since the other
seven OSDs on the node work fine, I think it quite unlikely that there
is a hardware issue with the SSD.
I did not check the HDD in depth, but there were no logs around the time
of the initial failure in my syslog, SMART values are fine and I can
still look around the file system on the disk. While all of this is not
conclusive, there also does not seem to be further indication of a
hardware failure. If there is something obvious I could also check, I'd
be glad if you let me know.

I had not run "ceph-bluestore-tool fsck" when I wrote my first mail, I
did in the meantime. (I did not run it, by the way, because it wasn't
listed in the troubleshooting guide. Maybe we should add it?)
The output looks like this:

$ sudo ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-25
2019-02-01 14:00:16.482736 7f39b18cbec0 -1
bluestore(/var/lib/ceph/osd/ceph-25) fsck error: stray shard 0x300000
2019-02-01 14:00:16.482753 7f39b18cbec0 -1
bluestore(/var/lib/ceph/osd/ceph-25) fsck error:
0x7f8000000000000002df7e2d92217262'.0.5a6ca9.238e1f29.0000000082fc!='0xfffffffffffffffeffffffffffffffff6f00300000'x'
is unexpected
2019-02-01 14:00:16.482773 7f39b18cbec0 -1
bluestore(/var/lib/ceph/osd/ceph-25) fsck error: stray shard 0x380000
2019-02-01 14:00:16.482774 7f39b18cbec0 -1
bluestore(/var/lib/ceph/osd/ceph-25) fsck error:
0x7f8000000000000002df7e2d92217262'.0.5a6ca9.238e1f29.0000000082fc!='0xfffffffffffffffeffffffffffffffff6f00380000'x'
is unexpected
2019-02-01 14:00:44.644396 7f39b18cbec0 -1
bluestore(/var/lib/ceph/osd/ceph-25) fsck error: actual
store_statfs(0x49108d0000/0xe8e0c00000, stored
0x9dfe74e1aa/0x9f90320000, compress 0x0/0x0/0x0) != expected
store_statfs(0x49108d0000/0xe8e0c00000, stored
0x9dfe34e1aa/0x9f8ff20000, compress 0x0/0x0/0x0)
2019-02-01 14:00:46.974661 7f39b18cbec0 -1
bluestore(/var/lib/ceph/osd/ceph-25) fsck error: leaked extent
0xb29b0a0000~400000
fsck success

...but the OSD won't start, showing the same error as before:

    -1> 2019-02-01 14:01:11.774223 7f2bc78bb700 -1 rocksdb:
submit_transaction error: Corruption: block checksum mismatch code = 2
Rocksdb transaction:
Put( Prefix = O key =
0x7f8000000000000002c000000021213dfffffffffffffffeffffffffffffffff'o'
Value size = 29)
     0> 2019-02-01 14:01:11.778249 7f2bc78bb700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.9/rpm/el7/BU
ILD/ceph-12.2.9/src/os/bluestore/BlueStore.cc: In function 'void
BlueStore::_kv_sync_thread()' thread 7f2bc78bb700 time 2019-02-01
14:01:11.774284
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.9/rpm/el7/BUILD/ceph-12.2.9/src/os/bluestore/BlueStore.cc:
8717
: FAILED assert(r == 0)

 ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x110) [0x5635d3657e90]
 2: (BlueStore::_kv_sync_thread()+0x3482) [0x5635d3502162]
 3: (BlueStore::KVSyncThread::entry()+0xd) [0x5635d354901d]
 4: (()+0x7e25) [0x7f2bd797ae25]
 5: (clone()+0x6d) [0x7f2bd6a6bbad]

Anything else I can do?

David Sieger


On 2/1/19 1:18 PM, Igor Fedotov wrote:
> Hi David,
> 
> this issue looks like the one reported here:
> 
> http://tracker.ceph.com/issues/37282
> 
> 
> Could you please comment there which will raise bug priority.
> 
> Besides that could you please share disk layout for this specific OSD:
> what volumes do you have (main, DB, WAL)? which drives stand behind them?
> 
> Did you run 'ceph-bluestore-tool fsck' for this specific OSD?
> 
> Also how did you make sure these drives don't have HW defects?
> 
> 
> Thanks,
> 
> Igor
> 
> On 2/1/2019 2:45 PM, David Sieger wrote:
>> Hi everyone,
>> I am facing an OSD in a crash loop and following the troubleshooting
>> procedure leads me to contacting the mailing list. (As far as I can
>> tell, this is neither a hardware nor a configuration issue. The system
>> has been running in this setup for months. Several other OSDs running on
>> the same host are fine.) Also I have one PG that is currently marked
>> inconsistent. This PG is not is not mentioned in the log file, though.
>> The OSD in question crashes about 15 to 20 seconds after starting up.
>> The logged reasons for the crash are, as far as I can tell:
>>
>>      -1> 2019-02-01 12:22:46.111821 7fe079d53700 -1 rocksdb:
>> submit_transaction error: Corruption: block checksum mismatch code = 2
>> Rocksdb transaction:
>> Put( Prefix = O key =
>> 0x7f80000000000000021600000021213dfffffffffffffffeffffffffffffffff'o'
>> Value size = 30)
>>       0> 2019-02-01 12:22:46.117761 7fe079d53700 -1
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.9/rpm/el7/BUILD/ceph-12.2.9/src/os/bluestore/BlueStore.cc:
>>
>> In function 'void BlueStore::_kv_sync_thread()' thread 7fe079d53700 time
>> 2019-02-01 12:22:46.111884
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.9/rpm/el7/BUILD/ceph-12.2.9/src/os/bluestore/BlueStore.cc:
>>
>> 8717: FAILED assert(r == 0)
>>
>>   ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous
>> (stable)
>>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x110) [0x562af51e5e90]
>>   2: (BlueStore::_kv_sync_thread()+0x3482) [0x562af5090162]
>>   3: (BlueStore::KVSyncThread::entry()+0xd) [0x562af50d701d]
>>   4: (()+0x7e25) [0x7fe089e12e25]
>>   5: (clone()+0x6d) [0x7fe088f03bad]
>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.
>>
>>
>> I have a full log of the crash cycle available, if it helps.
>>
>> Is there anything I can do to fix this? Is this a bug that I should
>> report somewhere else?
>>
>> David Sieger

-- 
iTerra GmbH
Böhmestraße 18          Geschäftsführung: Dr. Peter Brodersen
25899 Niebüll           Prokurist: Christian Feddersen
Tel. 04661 18540-40     www.iterra-gmbh.de

Rechtsform: Gesellschaft mit begrenzter Haftung, Sitz der Gesellschaft:
Niebüll, Amtsgericht Flensburg HRB 1225 NI



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux