Re: Luminous Bluestore OSDs crashing with ASSERT

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Stefan,

the proper Ceph way of sending log for developer analysis is ceph-post-file but I'm not good in retrieving them from there...

Ideally I'd prefer to start with log snippets covering 20K lines prior to crash. 3 or 4 of them. This wouldn't take so much space and you can send them by email, attach to a ticket or share via some publicly available URL. Accessing the whole logs also works for me.

If this doesn't work then let's go ceph-post-file way, I have to pass this trail one day...


Thanks,

Igor

On 1/28/2020 3:09 PM, Stefan Priebe - Profihost AG wrote:
Hello Igor,

i updated all servers to latest 4.19.97 kernel but this doesn't fix the
situation.

I can provide you with all those logs - any idea where to upload / how
to sent them to you?

Greets,
Stefan

Am 20.01.20 um 13:12 schrieb Igor Fedotov:
Hi Stefan,

these lines are result of transaction dump performed on a failure during
transaction submission (which is shown as

"submit_transaction error: Corruption: block checksum mismatch code = 2"

Most probably they are out of interest (checksum errors are unlikely to
be caused by transaction content) and hence we need earlier stuff to
learn what caused that

checksum mismatch.

It's hard to give any formal overview of what you should look for, from
my troubleshooting experience generally one may try to find:

- some previous error/warning indications (e.g. allocation, disk access,
etc)

- prior OSD crashes (sometimes they might have different causes/stack
traces/assertion messages)

- any timeout or retry indications

- any uncommon log patterns which aren't present during regular running
but happen each time before the crash/failure.

Anyway I think the inspection depth should be much(?) deeper than
presumably it is (from what I can see from your log snippets).

Ceph keeps last 10000 log events with an increased log level and dumps
them on crash with negative index starting at -9999 up to -1 as a prefix.

-1> 2020-01-16 01:10:13.404090 7f3350a14700 -1 rocksdb:


It would be great If you share several log snippets for different
crashes containing these last 10000 lines.


Thanks,

Igor


On 1/19/2020 9:42 PM, Stefan Priebe - Profihost AG wrote:
Hello Igor,

there's absolutely nothing in the logs before.

What do those lines mean:
Put( Prefix = O key =
0x7f8000000000000001cc45c881217262'd_data.4303206b8b4567.0000000000009632!='0xfffffffffffffffeffffffffffffffff6f00120000'x'

Value size = 480)
Put( Prefix = O key =
0x7f8000000000000001cc45c881217262'd_data.4303206b8b4567.0000000000009632!='0xfffffffffffffffeffffffffffffffff'o'

Value size = 510)

on the right size i always see 0xfffffffffffffffeffffffffffffffff on all
failed OSDs.

greets,
Stefan
Am 19.01.20 um 14:07 schrieb Stefan Priebe - Profihost AG:
Yes, except that this happens on 8 different clusters with different
hw but same ceph version and same kernel version.

Greets,
Stefan

Am 19.01.2020 um 11:53 schrieb Igor Fedotov <ifedotov@xxxxxxx>:

So the intermediate summary is:

Any OSD in the cluster can experience interim RocksDB checksum
failure. Which isn't present after OSD restart.

No HW issues observed, no persistent artifacts (except OSD log)
afterwards.

And looks like the issue is rather specific to the cluster as no
similar reports from other users seem to be present.


Sorry, I'm out of ideas other then collect all the failure logs and
try to find something common in them. May be this will shed some
light..

BTW from my experience it might make sense to inspect OSD log prior
to failure (any error messages and/or prior restarts, etc) sometimes
this might provide some hints.


Thanks,

Igor


On 1/17/2020 2:30 PM, Stefan Priebe - Profihost AG wrote:
HI Igor,

Am 17.01.20 um 12:10 schrieb Igor Fedotov:
hmmm..

Just in case - suggest to check H/W errors with dmesg.
this happens on around 80 nodes - i don't expect all of those have not
identified hw errors. Also all of them are monitored - no dmesg
outpout
contains any errors.

Also there are some (not very much though) chances this is another
incarnation of the following bug:
https://tracker.ceph.com/issues/22464
https://github.com/ceph/ceph/pull/24649

The corresponding PR works around it for main device reads (user data
only!) but theoretically it might still happen

either for DB device or DB data at main device.

Can you observe any bluefs spillovers? Are there any correlation
between
failing OSDs and spillover presence if any, e.g. failing OSDs always
have a spillover. While OSDs without spillovers never face the
issue...

To validate this hypothesis one can try to monitor/check (e.g. once a
day for a week or something) "bluestore_reads_with_retries"
counter over
OSDs to learn if the issue is happening

in the system.  Non-zero values mean it's there for user data/main
device and hence is likely to happen for DB ones as well (which
doesn't
have any workaround yet).
OK i checked bluestore_reads_with_retries on 360 osds but all of
them say 0.


Additionally you might want to monitor memory usage as the above
mentioned PR denotes high memory pressure as potential trigger for
these
read errors. So if such pressure happens the hypothesis becomes
more valid.
we already do this heavily and have around 10GB of memory per OSD.
Also
no of those machines show any io pressure at all.

All hosts show a constant rate of around 38GB to 45GB mem available in
/proc/meminfo.

Stefan

Thanks,

Igor

PS. Everything above is rather a speculation for now. Available
information is definitely not enough for extensive troubleshooting
the
cases which happens that rarely.

You might want to start collecting failure-related information
(including but not limited to failure logs, perf counter dumps,
system
resource reports etc) for future analysis.



On 1/16/2020 11:58 PM, Stefan Priebe - Profihost AG wrote:
Hi Igor,

answers inline.

Am 16.01.20 um 21:34 schrieb Igor Fedotov:
you may want to run fsck against failing OSDs. Hopefully it will
shed
some light.
fsck just says everything fine:

# ceph-bluestore-tool --command fsck --path
/var/lib/ceph/osd/ceph-27/
fsck success


Also wondering if OSD is able to recover (startup and proceed
working)
after facing the issue?
no recover needed. It just runs forever after restarting.

If so do you have any one which failed multiple times? Do you
have logs
for these occurrences?
may be but there are most probably weeks or month between those
failures
- most probably logs are already deleted.

Also please note that patch you mentioned doesn't fix previous
issues
(i.e. duplicate allocations), it prevents from new ones only.

But fsck should show them if any...
None showed.

Stefan

Thanks,

Igor



On 1/16/2020 10:04 PM, Stefan Priebe - Profihost AG wrote:
Hi Igor,

ouch sorry. Here we go:

       -1> 2020-01-16 01:10:13.404090 7f3350a14700 -1 rocksdb:
submit_transaction error: Corruption: block checksum mismatch
code = 2
Rocksdb transaction:
Put( Prefix = M key =
0x0000000000000402'.OBJ_0000000000000002.953BFD0A.bb85c.rbd%udata%e3e8eac6b8b4567%e0000000000001f2e..'



Value size = 97)
Put( Prefix = M key =
0x0000000000000402'.MAP_00000000000BB85C_0000000000000002.953BFD0A.bb85c.rbd%udata%e3e8eac6b8b4567%e0000000000001f2e..'



Value size = 93)
Put( Prefix = M key =
0x0000000000000916'.0000823257.00000000000073922044' Value size
= 196)
Put( Prefix = M key =
0x0000000000000916'.0000823257.00000000000073922045' Value size
= 184)
Put( Prefix = M key = 0x0000000000000916'._info' Value size = 899)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f00000000'x'



Value size = 418)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f00030000'x'



Value size = 474)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f0007c000'x'



Value size = 392)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f00090000'x'



Value size = 317)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f000a0000'x'



Value size = 521)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f000f4000'x'



Value size = 558)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f00130000'x'



Value size = 649)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f00194000'x'



Value size = 449)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f001cc000'x'



Value size = 580)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f00200000'x'



Value size = 435)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f00240000'x'



Value size = 569)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f00290000'x'



Value size = 465)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f002e0000'x'



Value size = 710)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f00300000'x'



Value size = 599)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f0036c000'x'



Value size = 372)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f003a6000'x'



Value size = 130)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f003b4000'x'



Value size = 540)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff6f003fc000'x'



Value size = 47)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0x00000000000bb85cffffffffffffffff'o'



Value size = 1731)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0xfffffffffffffffeffffffffffffffff6f00040000'x'



Value size = 675)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0xfffffffffffffffeffffffffffffffff6f00080000'x'



Value size = 395)
Put( Prefix = O key =
0x7f80000000000000029acdfb05217262'd_data.3e8eac6b8b4567.0000000000001f2e!='0xfffffffffffffffeffffffffffffffff'o'



Value size = 1328)
Put( Prefix = X key = 0x0000000018a38deb Value size = 14)
Put( Prefix = X key = 0x0000000018a38dea Value size = 14)
Put( Prefix = X key = 0x000000000d7a035b Value size = 14)
Put( Prefix = X key = 0x000000000d7a035c Value size = 14)
Put( Prefix = X key = 0x000000000d7a0355 Value size = 14)
Put( Prefix = X key = 0x000000000d7a0356 Value size = 17)
Put( Prefix = X key = 0x000000001a54f6e4 Value size = 14)
Put( Prefix = X key = 0x000000001b1c061e Value size = 14)
Put( Prefix = X key = 0x000000000d7a038f Value size = 14)
Put( Prefix = X key = 0x000000000d7a0389 Value size = 14)
Put( Prefix = X key = 0x000000000d7a0358 Value size = 14)
Put( Prefix = X key = 0x000000000d7a035f Value size = 14)
Put( Prefix = X key = 0x000000000d7a0357 Value size = 14)
Put( Prefix = X key = 0x000000000d7a0387 Value size = 14)
Put( Prefix = X key = 0x000000000d7a038a Value size = 14)
Put( Prefix = X key = 0x000000000d7a0388 Value size = 14)
Put( Prefix = X key = 0x00000000134c3fbe Value size = 14)
Put( Prefix = X key = 0x00000000134c3fb5 Value size = 14)
Put( Prefix = X key = 0x000000000d7a036e Value size = 14)
Put( Prefix = X key = 0x000000000d7a036d Value size = 14)
Put( Prefix = X key = 0x00000000134c3fb8 Value size = 14)
Put( Prefix = X key = 0x000000000d7a0371 Value size = 14)
Put( Prefix = X key = 0x000000000d7a036a Value size = 14)
         0> 2020-01-16 01:10:13.413759 7f3350a14700 -1
/build/ceph/src/os/bluestore/BlueStore.cc: In function 'void
BlueStore::_kv_sync_thread()' thread 7f3350a14700 time 2020-01-16
01:10:13.404113
/build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED
assert(r == 0)

     ceph version 12.2.12-11-gd3eae83543
(d3eae83543bffc0fc6c43823feb637fa851b6213) luminous (stable)
     1: (ceph::__ceph_assert_fail(char const*, char const*, int,
char
const*)+0x102) [0x55c9a712d232]
     2: (BlueStore::_kv_sync_thread()+0x24c5) [0x55c9a6fb54b5]
     3: (BlueStore::KVSyncThread::entry()+0xd) [0x55c9a6ff608d]
     4: (()+0x7494) [0x7f33615f9494]
     5: (clone()+0x3f) [0x7f3360680acf]

I already picked those:
https://github.com/ceph/ceph/pull/28644

Greets,
Stefan
Am 16.01.20 um 17:00 schrieb Igor Fedotov:
Hi Stefan,

would you please share log snippet prior the assertions? Looks
like
RocksDB is failing during transaction submission...


Thanks,

Igor

On 1/16/2020 11:56 AM, Stefan Priebe - Profihost AG wrote:
Hello,

does anybody know a fix for this ASSERT / crash?

2020-01-16 02:02:31.316394 7f8c3f5ab700 -1
/build/ceph/src/os/bluestore/BlueStore.cc: In function 'void
BlueStore::_kv_sync_thread()' thread 7f8c3f5ab700 time
2020-01-16
02:02:31.304993
/build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r
== 0)

      ceph version 12.2.12-11-gd3eae83543
(d3eae83543bffc0fc6c43823feb637fa851b6213) luminous (stable)
      1: (ceph::__ceph_assert_fail(char const*, char const*,
int, char
const*)+0x102) [0x55e6df9d9232]
      2: (BlueStore::_kv_sync_thread()+0x24c5) [0x55e6df8614b5]
      3: (BlueStore::KVSyncThread::entry()+0xd) [0x55e6df8a208d]
      4: (()+0x7494) [0x7f8c50190494]
      5: (clone()+0x3f) [0x7f8c4f217acf]

all bluestore OSDs are randomly crashing sometimes (once a
week).

Greets,
Stefan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux