Hello Igor, thanks for all your feedback and all your help. The first thing i'll try is to upgrade a bunch of system from 4.19.66 kernel to 4.19.97 and see what happens. I'll report back in 7-10 days to verify whether this helps. Greets, Stefan Am 20.01.20 um 13:12 schrieb Igor Fedotov: > Hi Stefan, > > these lines are result of transaction dump performed on a failure during > transaction submission (which is shown as > > "submit_transaction error: Corruption: block checksum mismatch code = 2" > > Most probably they are out of interest (checksum errors are unlikely to > be caused by transaction content) and hence we need earlier stuff to > learn what caused that > > checksum mismatch. > > It's hard to give any formal overview of what you should look for, from > my troubleshooting experience generally one may try to find: > > - some previous error/warning indications (e.g. allocation, disk access, > etc) > > - prior OSD crashes (sometimes they might have different causes/stack > traces/assertion messages) > > - any timeout or retry indications > > - any uncommon log patterns which aren't present during regular running > but happen each time before the crash/failure. > > > Anyway I think the inspection depth should be much(?) deeper than > presumably it is (from what I can see from your log snippets). > > Ceph keeps last 10000 log events with an increased log level and dumps > them on crash with negative index starting at -9999 up to -1 as a prefix. > > -1> 2020-01-16 01:10:13.404090 7f3350a14700 -1 rocksdb: > > > It would be great If you share several log snippets for different > crashes containing these last 10000 lines. > > > Thanks, > > Igor > > > On 1/19/2020 9:42 PM, Stefan Priebe - Profihost AG wrote: >> Hello Igor, >> >> there's absolutely nothing in the logs before. >> >> What do those lines mean: >> Put( Prefix = O key = >> 0x7f8000000000000001cc45c881217262'd_data.4303206b8b4567.0000000000009632!='0xfffffffffffffffeffffffffffffffff6f00120000'x' >> >> Value size = 480) >> Put( Prefix = O key = >> 0x7f8000000000000001cc45c881217262'd_data.4303206b8b4567.0000000000009632!='0xfffffffffffffffeffffffffffffffff'o' >> >> Value size = 510) >> >> on the right size i always see 0xfffffffffffffffeffffffffffffffff on all >> failed OSDs. >> >> greets, >> Stefan >> Am 19.01.20 um 14:07 schrieb Stefan Priebe - Profihost AG: >>> Yes, except that this happens on 8 different clusters with different >>> hw but same ceph version and same kernel version. >>> >>> Greets, >>> Stefan >>> >>>> Am 19.01.2020 um 11:53 schrieb Igor Fedotov <ifedotov@xxxxxxx>: >>>> >>>> So the intermediate summary is: >>>> >>>> Any OSD in the cluster can experience interim RocksDB checksum >>>> failure. Which isn't present after OSD restart. >>>> >>>> No HW issues observed, no persistent artifacts (except OSD log) >>>> afterwards. >>>> >>>> And looks like the issue is rather specific to the cluster as no >>>> similar reports from other users seem to be present. >>>> >>>> >>>> Sorry, I'm out of ideas other then collect all the failure logs and >>>> try to find something common in them. May be this will shed some >>>> light.. >>>> >>>> BTW from my experience it might make sense to inspect OSD log prior >>>> to failure (any error messages and/or prior restarts, etc) sometimes >>>> this might provide some hints. >>>> >>>> >>>> Thanks, >>>> >>>> Igor >>>> >>>> >>>>> On 1/17/2020 2:30 PM, Stefan Priebe - Profihost AG wrote: >>>>> HI Igor, >>>>> >>>>>> Am 17.01.20 um 12:10 schrieb Igor Fedotov: >>>>>> hmmm.. >>>>>> >>>>>> Just in case - suggest to check H/W errors with dmesg. >>>>> this happens on around 80 nodes - i don't expect all of those have not >>>>> identified hw errors. Also all of them are monitored - no dmesg >>>>> outpout >>>>> contains any errors. >>>>> >>>>>> Also there are some (not very much though) chances this is another >>>>>> incarnation of the following bug: >>>>>> https://tracker.ceph.com/issues/22464 >>>>>> https://github.com/ceph/ceph/pull/24649 >>>>>> >>>>>> The corresponding PR works around it for main device reads (user data >>>>>> only!) but theoretically it might still happen >>>>>> >>>>>> either for DB device or DB data at main device. >>>>>> >>>>>> Can you observe any bluefs spillovers? Are there any correlation >>>>>> between >>>>>> failing OSDs and spillover presence if any, e.g. failing OSDs always >>>>>> have a spillover. While OSDs without spillovers never face the >>>>>> issue... >>>>>> >>>>>> To validate this hypothesis one can try to monitor/check (e.g. once a >>>>>> day for a week or something) "bluestore_reads_with_retries" >>>>>> counter over >>>>>> OSDs to learn if the issue is happening >>>>>> >>>>>> in the system. Non-zero values mean it's there for user data/main >>>>>> device and hence is likely to happen for DB ones as well (which >>>>>> doesn't >>>>>> have any workaround yet). >>>>> OK i checked bluestore_reads_with_retries on 360 osds but all of >>>>> them say 0. >>>>> >>>>> >>>>>> Additionally you might want to monitor memory usage as the above >>>>>> mentioned PR denotes high memory pressure as potential trigger for >>>>>> these >>>>>> read errors. So if such pressure happens the hypothesis becomes >>>>>> more valid. >>>>> we already do this heavily and have around 10GB of memory per OSD. >>>>> Also >>>>> no of those machines show any io pressure at all. >>>>> >>>>> All hosts show a constant rate of around 38GB to 45GB mem available in >>>>> /proc/meminfo. >>>>> >>>>> Stefan >>>>> >>>>>> Thanks, >>>>>> >>>>>> Igor >>>>>> >>>>>> PS. Everything above is rather a speculation for now. Available >>>>>> information is definitely not enough for extensive troubleshooting >>>>>> the >>>>>> cases which happens that rarely. >>>>>> >>>>>> You might want to start collecting failure-related information >>>>>> (including but not limited to failure logs, perf counter dumps, >>>>>> system >>>>>> resource reports etc) for future analysis. >>>>>> >>>>>> >>>>>> >>>>>> On 1/16/2020 11:58 PM, Stefan Priebe - Profihost AG wrote: >>>>>>> Hi Igor, >>>>>>> >>>>>>> answers inline. >>>>>>> >>>>>>> Am 16.01.20 um 21:34 schrieb Igor Fedotov: >>>>>>>> you may want to run fsck against failing OSDs. Hopefully it will >>>>>>>> shed >>>>>>>> some light. >>>>>>> fsck just says everything fine: >>>>>>> >>>>>>> # ceph-bluestore-tool --command fsck --path >>>>>>> /var/lib/ceph/osd/ceph-27/ >>>>>>> fsck success >>>>>>> >>>>>>> >>>>>>>> Also wondering if OSD is able to recover (startup and proceed >>>>>>>> working) >>>>>>>> after facing the issue? >>>>>>> no recover needed. It just runs forever after restarting. >>>>>>> >>>>>>>> If so do you have any one which failed multiple times? Do you >>>>>>>> have logs >>>>>>>> for these occurrences? >>>>>>> may be but there are most probably weeks or month between those >>>>>>> failures >>>>>>> - most probably logs are already deleted. >>>>>>> >>>>>>>> Also please note that patch you mentioned doesn't fix previous >>>>>>>> issues >>>>>>>> (i.e. duplicate allocations), it prevents from new ones only. >>>>>>>> >>>>>>>> But fsck should show them if any... >>>>>>> None showed. >>>>>>> >>>>>>> Stefan >>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Igor >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 1/16/2020 10:04 PM, Stefan Priebe - Profihost AG wrote: >>>>>>>>> Hi Igor, >>>>>>>>> >>>>>>>>> ouch sorry. Here we go: >>>>>>>>> >>>>>>>>> -1> 2020-01-16 01:10:13.404090 7f3350a14700 -1 rocksdb: >>>>>>>>> submit_transaction error: Corruption: block checksum mismatch >>>>>>>>> code = 2 >>>>>>>>> Rocksdb transaction: >>>>>>>>> Put( Prefix = M key = >>>>>>>>> 0x0000000000000402'.OBJ_0000000000000002.953BFD0A.bb85c.rbd%udata%e3e8eac6b8b4567%e0000000000001f2e..' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 97) >>>>>>>>> Put( Prefix = M key = >>>>>>>>> 0x0000000000000402'.MAP_00000000000BB85C_0000000000000002.953BFD0A.bb85c.rbd%udata%e3e8eac6b8b4567%e0000000000001f2e..' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 93) >>>>>>>>> Put( Prefix = M key = >>>>>>>>> 0x0000000000000916'.0000823257.00000000000073922044' Value size >>>>>>>>> = 196) >>>>>>>>> Put( Prefix = M key = >>>>>>>>> 0x0000000000000916'.0000823257.00000000000073922045' Value size >>>>>>>>> = 184) >>>>>>>>> Put( Prefix = M key = 0x0000000000000916'._info' Value size = 899) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f00000000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 418) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f00030000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 474) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f0007c000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 392) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f00090000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 317) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f000a0000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 521) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f000f4000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 558) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f00130000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 649) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f00194000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 449) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f001cc000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 580) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f00200000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 435) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f00240000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 569) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f00290000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 465) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f002e0000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 710) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f00300000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 599) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f0036c000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 372) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f003a6000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 130) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f003b4000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 540) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f003fc000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 47) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff'o' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 1731) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0xfffffffffffffffeffffffffffffffff6f00040000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 675) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0xfffffffffffffffeffffffffffffffff6f00080000'x' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 395) >>>>>>>>> Put( Prefix = O key = >>>>>>>>> 0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0xfffffffffffffffeffffffffffffffff'o' >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Value size = 1328) >>>>>>>>> Put( Prefix = X key = 0x0000000018a38deb Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x0000000018a38dea Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x000000000d7a035b Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x000000000d7a035c Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x000000000d7a0355 Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x000000000d7a0356 Value size = 17) >>>>>>>>> Put( Prefix = X key = 0x000000001a54f6e4 Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x000000001b1c061e Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x000000000d7a038f Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x000000000d7a0389 Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x000000000d7a0358 Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x000000000d7a035f Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x000000000d7a0357 Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x000000000d7a0387 Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x000000000d7a038a Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x000000000d7a0388 Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x00000000134c3fbe Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x00000000134c3fb5 Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x000000000d7a036e Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x000000000d7a036d Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x00000000134c3fb8 Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x000000000d7a0371 Value size = 14) >>>>>>>>> Put( Prefix = X key = 0x000000000d7a036a Value size = 14) >>>>>>>>> 0> 2020-01-16 01:10:13.413759 7f3350a14700 -1 >>>>>>>>> /build/ceph/src/os/bluestore/BlueStore.cc: In function 'void >>>>>>>>> BlueStore::_kv_sync_thread()' thread 7f3350a14700 time 2020-01-16 >>>>>>>>> 01:10:13.404113 >>>>>>>>> /build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED >>>>>>>>> assert(r == 0) >>>>>>>>> >>>>>>>>> ceph version 12.2.12-11-gd3eae83543 >>>>>>>>> (d3eae83543bffc0fc6c43823feb637fa851b6213) luminous (stable) >>>>>>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, >>>>>>>>> char >>>>>>>>> const*)+0x102) [0x55c9a712d232] >>>>>>>>> 2: (BlueStore::_kv_sync_thread()+0x24c5) [0x55c9a6fb54b5] >>>>>>>>> 3: (BlueStore::KVSyncThread::entry()+0xd) [0x55c9a6ff608d] >>>>>>>>> 4: (()+0x7494) [0x7f33615f9494] >>>>>>>>> 5: (clone()+0x3f) [0x7f3360680acf] >>>>>>>>> >>>>>>>>> I already picked those: >>>>>>>>> https://github.com/ceph/ceph/pull/28644 >>>>>>>>> >>>>>>>>> Greets, >>>>>>>>> Stefan >>>>>>>>> Am 16.01.20 um 17:00 schrieb Igor Fedotov: >>>>>>>>>> Hi Stefan, >>>>>>>>>> >>>>>>>>>> would you please share log snippet prior the assertions? Looks >>>>>>>>>> like >>>>>>>>>> RocksDB is failing during transaction submission... >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Igor >>>>>>>>>> >>>>>>>>>> On 1/16/2020 11:56 AM, Stefan Priebe - Profihost AG wrote: >>>>>>>>>>> Hello, >>>>>>>>>>> >>>>>>>>>>> does anybody know a fix for this ASSERT / crash? >>>>>>>>>>> >>>>>>>>>>> 2020-01-16 02:02:31.316394 7f8c3f5ab700 -1 >>>>>>>>>>> /build/ceph/src/os/bluestore/BlueStore.cc: In function 'void >>>>>>>>>>> BlueStore::_kv_sync_thread()' thread 7f8c3f5ab700 time >>>>>>>>>>> 2020-01-16 >>>>>>>>>>> 02:02:31.304993 >>>>>>>>>>> /build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r >>>>>>>>>>> == 0) >>>>>>>>>>> >>>>>>>>>>> ceph version 12.2.12-11-gd3eae83543 >>>>>>>>>>> (d3eae83543bffc0fc6c43823feb637fa851b6213) luminous (stable) >>>>>>>>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, >>>>>>>>>>> int, char >>>>>>>>>>> const*)+0x102) [0x55e6df9d9232] >>>>>>>>>>> 2: (BlueStore::_kv_sync_thread()+0x24c5) [0x55e6df8614b5] >>>>>>>>>>> 3: (BlueStore::KVSyncThread::entry()+0xd) [0x55e6df8a208d] >>>>>>>>>>> 4: (()+0x7494) [0x7f8c50190494] >>>>>>>>>>> 5: (clone()+0x3f) [0x7f8c4f217acf] >>>>>>>>>>> >>>>>>>>>>> all bluestore OSDs are randomly crashing sometimes (once a >>>>>>>>>>> week). >>>>>>>>>>> >>>>>>>>>>> Greets, >>>>>>>>>>> Stefan >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> ceph-users mailing list >>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com