Re: Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



It sounds like OSD is "recovering" after checksum error.

I.e. just failed OSD shows no errors in fsck and is able to restart and process new write requests for long enough period (longer than just a couple of minutes). Are these statements true? If so I can suppose this is accidental/volatile issue rather than data-at-rest corruption. Something like data incorrectly read from disk.

Are you using standalone disk drive for DB/WAL or it's shared with main one? Just in case as a low handing fruit - I'd suggest checking with dmesg and smartctl for drive errors...

FYI: one more reference for the similar issue: https://tracker.ceph.com/issues/24968

HW issue this time...


Also I recall an issue with some kernels that caused occasional invalid data reads under high memory pressure/swapping: https://tracker.ceph.com/issues/22464

IMO memory usage worth checking as well...


Igor


On 8/27/2019 4:52 PM, Stefan Priebe - Profihost AG wrote:
see inline

Am 27.08.19 um 15:43 schrieb Igor Fedotov:
see inline

On 8/27/2019 4:41 PM, Stefan Priebe - Profihost AG wrote:
Hi Igor,

Am 27.08.19 um 14:11 schrieb Igor Fedotov:
Hi Stefan,

this looks like a duplicate for

https://tracker.ceph.com/issues/37282

Actually the root cause selection might be quite wide.

  From HW issues to broken logic in RocksDB/BlueStore/BlueFS etc.

As far as I understand you have different OSDs which are failing, right?
Yes i've seen this on around 50 different OSDs running different HW but
all run ceph 12.2.12. I've not seen this with 12.2.10 which we were
running before.

Is the set of these broken OSDs limited somehow?
No at least i'm not able to find


Any specific subset which is failing or something? E.g. just N of them
are failing from time to time.
No seems totally random.

Any similarities for broken OSDs (e.g. specific hardware)?
All run intel xeon CPUs and all run linux ;-)

Did you run fsck for any of broken OSDs? Any reports?
Yes but no reports.
Are you saying that fsck is fine for OSDs that showed this sort of errors?
Yes fsck does not show a single error - everything is fine.

Any other errors/crashes in logs before these sort of issues happens?
No


Just in case - what allocator are you using?
tcmalloc
I meant BlueStore allocator - is it stupid or bitmap?
ah the default one i think this is stupid.

Greets,
Stefan

Greets,
Stefan

Thanks,

Igor



On 8/27/2019 1:03 PM, Stefan Priebe - Profihost AG wrote:
Hello,

since some month all our bluestore OSDs keep crashing from time to
time.
Currently about 5 OSDs per day.

All of them show the following trace:
Trace:
2019-07-24 08:36:48.995397 7fb19a711700 -1 rocksdb: submit_transaction
error: Corruption: block checksum mismatch code = 2 Rocksdb
transaction:
Put( Prefix = M key =
0x00000000000009a5'.0000916366.00000000000074680351' Value size = 184)
Put( Prefix = M key = 0x00000000000009a5'._fastinfo' Value size = 186)
Put( Prefix = O key =
0x7f8000000000000003bb605f'd!rbd_data.afe49a6b8b4567.0000000000003c11!='0xfffffffffffffffeffffffffffffffff6f00240000'x'


Value size = 530)
Put( Prefix = O key =
0x7f8000000000000003bb605f'd!rbd_data.afe49a6b8b4567.0000000000003c11!='0xfffffffffffffffeffffffffffffffff'o'


Value size = 510)
Put( Prefix = L key = 0x0000000010ba60f1 Value size = 4135)
2019-07-24 08:36:49.012110 7fb19a711700 -1
/build/ceph/src/os/bluestore/BlueStore.cc: In function 'void
BlueStore::_kv_sync_thread()' thread 7fb19a711700 time 2019-07-24
08:36:48.995415
/build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r == 0)

ceph version 12.2.12-7-g1321c5e91f
(1321c5e91f3d5d35dd5aa5a0029a54b9a8ab9498) luminous (stable)
    1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x5653a010e222]
    2: (BlueStore::_kv_sync_thread()+0x24c5) [0x56539ff964b5]
    3: (BlueStore::KVSyncThread::entry()+0xd) [0x56539ffd708d]
    4: (()+0x7494) [0x7fb1ab2f6494]
    5: (clone()+0x3f) [0x7fb1aa37dacf]

I already opend up a tracker:
https://tracker.ceph.com/issues/41367

Can anybody help? Is this known?

Greets,
Stefan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux