Re: osd crashing and rocksdb corruption

Francois Legrand <fleg@xxxxxxxxxxxxxx> · Thu, 30 Apr 2020 12:29:59 +0200

Thanks again for your reactivity and your advices. You saved our lives !

We reactivate recovery/backfilling/rebalancing and it starts the 
recovery. We now have to wait to see how it will evolve.

Last question : We noticed (a few days ago and it still occurs) that 
after ~1h the recovery was stopping (Recovery Throughput drop to 0). We 
could restart it by restarting the ceph-mgr.target on the active 
manager... for another ~1 hour ! It's strange because I cannot see any 
crash or relevant info in the logs !
Moreover the ceph crash command hangs and no way to get output. Maybe 
it's because of the huge number of failures on the osds !
Do you think that this two problems could be related to the osd crashing 
? I will continue to investigate and maybe open a new different thread 
on this topic.
F.

Le 30/04/2020 à 10:57, Igor Fedotov a écrit :
I created the following ticket and PR to track/fix the issue with 
incomplete large writes when bluefs_buffereed_io=1.

https://tracker.ceph.com/issues/45337

https://github.com/ceph/ceph/pull/34836

But In fact setting bluefs_buffered_io to false is the mainstream for 
now, see https://github.com/ceph/ceph/pull/34224

Francois, you can proceed with OSD.21 &.49 I can reproduce the issue 
locally hence no much need in them now.

Still investigating what's happening with OSD.8...

As for reactivating recovery/backfill/rebalancing - I can say for sure 
whether it's safe or not.

Thanks,

Igor

On 4/30/2020 1:39 AM, Francois Legrand wrote:
Hello,
We set bluefs_buffered_io to false for the whole cluster except 2 osd 
(21 and 49) for which we decided to keep the value to true for future 
experiments/troubleshooting as you asked.
We then restarted all the 25 downs osd and they started... except one 
(number 8) which still continue to crash with the same kind of errors.
I tryed a fsck on this osd which ended by a success. I set the debug 
to 20 and recorded the logs.
You will find the logs there if you want to have a look : 
https://we.tl/t-GDvvvi2Gmm
Now we plan to reactivate the recovery, backfill and rebalancing if 
you think it's safe.
F.

Le 29/04/2020 à 16:45, Igor Fedotov a écrit :
So the crash seems to be caused by the same issue - big (and 
presumably incomplete) write and subsequent read failure.

I've managed to repro this locally. So bluefs_buffered_io seems to 
be a remedy for now.

But additionally I can observe multiple slow ops indications in this 
new log and I think they cause those big writes.

And I presume some RGW-triggered ops are in flight - bucket 
resharding or removal or something..

I've seen multiple reports about this stuff causing OSD slowdown, 
high memory utilization and  finally huge reads and/or writes from 
RocksDB.

Don't know how to deal with this at the moment...

Thanks,

Igor

On 4/29/2020 5:33 PM, Francois Legrand wrote:
Here are the logs of the newly crashed osd.
F.

Le 29/04/2020 à 16:21, Igor Fedotov a écrit :
Sounds interesting - could you please share the crash log for 
these new OSDs? They presumably suffer from another issue. At 
least that first crash is caused by something else.

"bluefs buffered io" can be injected on the fly but I expect it to 
help when OSD isn't starting up only.

On 4/29/2020 5:17 PM, Francois Legrand wrote:
Ok we will try that.
Indeed, restarting osd.5 triggered the falling down of two other 
osds in the cluster.
Thus we will set bluefs buffered io = false for all osds and 
force bluefs buffered io = true for one of the downs osds.
Is that modification needs to use injectargs or changing it in 
the configuration is enougth to have it applied on the fly ?
F.

Le 29/04/2020 à 15:56, Igor Fedotov a écrit :
That's bluefs buffered io = false which did the trick.

It modified write path and this presumably has fixed large 
write(s).

Trying to reproduce locally but please preserve at least one 
failing OSD (i.e. do not start it with the disabled buffered io) 
for future experiments/troubleshooting for a while if possible.

Thanks,

Igor

On 4/29/2020 4:50 PM, Francois Legrand wrote:
Hi,
It seems much better with theses options. The osd is now up 
since 10mn without crashing (before it was rebooting after ~1mn).
F.

Le 29/04/2020 à 15:16, Igor Fedotov a écrit :
Hi Francois,

I'll write a more thorough response a bit later.

Meanwhile could you please try OSD startup with the following 
settings now:

debug-bluefs abd debug-bdev = 20

bluefs sync write = false

bluefs buffered io = false

Thanks,

Igor

On 4/29/2020 3:35 PM, Francois Legrand wrote:
Hi Igor,
Here is what we did :
First, as other osd were falling down, we stopped all 
operations with
ceph osd set norecover
ceph osd set norebalance
ceph osd set nobackfill
ceph osd set pause

to avoid other crashs !

Then we moved to your recommandations (still testing on osd 5):
in /etc/ceph/ceph.conf we added:
[osd.5]
    debug bluefs = 20
    debug bdev = 20

We ran :
ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-5 -l 
/var/log/ceph/bluestore-tool-fsck-osd-5.log --log-level 20 > 
/var/log/ceph/bluestore-tool-fsck-osd-5.out 2>&1

it ended with fsck success

It seems that the default value for bluefs sync write is 
false 
(https://github.com/ceph/ceph/blob/v14.2.8/src/common/options.cc), 
thus
we changed /etc/ceph/ceph.conf to :
[osd.5]
    debug bluefs = 20
    debug bdev = 20
    bluefs sync write = true

and restarted the osd. It crashed !
We tryed to change explicitely bluefs sync write = false and 
restarted... same result !

The logs are here : https://we.tl/t-HMiFDu22XH

Moreover, we have a rados gateway pool with hundreds of 4GB 
files. Can this be the origin of the problem ?
Do you thing that removing this pool can help ?

Thanks again for your expertise.
F.

Le 28/04/2020 à 18:52, Igor Fedotov a écrit :
Short update - please treat bluefs_sync_write parameter 
instead of bdev-aio. Changing the latter isn't supported in 
fact.

On 4/28/2020 7:35 PM, Igor Fedotov wrote:
Francious,

here are some observations got from your log.

1) Rocksdb reports error on the following .sst file:

   -35> 2020-04-28 15:23:47.612 7f4856e82a80 -1 rocksdb: 
Corruption: Bad table magic number: expected 986351839
0377041911, found 12950032858166034944 in db/068269.sst

2) which relates to BlueFS ino 53361:

  -50> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs 
open_for_read db/068269.sst (random)
   -49> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs 
open_for_read h 0x557914fb80b0 on file(ino 53361 size 0x
c496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 allocated 
c4970000 extents [1:0x383db280000~c4970000])

3) and failed read happens to the end (0xc496f8e4~35, last 
0x35 bytes) of this huge(3+GB) file:

 -44> 2020-04-28 15:23:47.514 7f4856e82a80 10 bluefs 
_read_random h 0x557914fb80b0 0xc496f8e4~35 from file(in
o 53361 size 0xc496f919 mtime 2020-04-28 15:23:39.827515 
bdev 1 allocated c4970000 extents [1:0x383db280000~c49
70000])
   -43> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs 
_read_random left 0x71c 0xc496f8e4~35
   -42> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs 
_read_random got 53

4) This .sst file was created from scratch shortly before 
with a single-shot 3+GB write:

 -88> 2020-04-28 15:23:35.661 7f4856e82a80 10 bluefs 
open_for_write db/068269.sst
   -87> 2020-04-28 15:23:35.661 7f4856e82a80 20 bluefs 
open_for_write mapping db/068269.sst to bdev 1
   -86> 2020-04-28 15:23:35.662 7f4856e82a80 10 bluefs 
open_for_write h 0x5579145e7a40 on file(ino 53361 size 0
x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 
extents [])
   -85> 2020-04-28 15:23:39.826 7f4856e82a80 10 bluefs 
_flush 0x5579145e7a40 0x0~c496f919 to file(ino 53361 siz
e 0x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 
extents [])

5) Presumably RocksDB creates this file in an attempt to 
recover/compact/process another existing file (ino 52405) 
which is pretty large as well

Please find multiple earlier reads, the last one:

   -92> 2020-04-28 15:23:29.857 7f4856e82a80 10 bluefs 
_read h 0x5579147286e0 0xc6788000~8000 from file(ino 524
05 size 0xc67888a0 mtime 2020-04-25 13:34:55.325699 bdev 0 
allocated c6790000 extents [1:0x381c8220000~10000,1:

The rationales for binding these two files are pretty 
uncommon file sizes.

So you have 3+GB single-shot BlueFS write and immediate 
read from the end of the written extent which returns 
unexpected magic.

It's well known in the software world that large (2+GB) 
data processing implementations tend to be error-prone. And 
Ceph is not an exception. Here is a couple of recent 
examples which are pretty close to your case:

https://github.com/ceph/ceph/commit/4d33114a40d5ae0d541c36175977ca22789a3b88 

https://github.com/ceph/ceph/commit/207806abaa91259d9bb726c2555e7e21ac541527 

Although they are already fixed in Nautilus 14.2.8 there 
might be others present along the write path (including H/W 
firmware).

The good news are that failure happens on a newly written 
file (remember invalid magic is read at the end(!) of this 
new large file shortly(!) after the write). So looks like 
data hasn't yet landed to a disk. And hopefully original 
data are untouched.

You can try to run ceph-bluestore-tool's fsck command on 
the broken OSD. This will open DB in read-only and 
hopefully proves the idea that the issue is on the write 
path rather than already corrupted data.

Also you might want to set bdev-aio to false (or true if 
it's already at false) and try OSD startup once more - this 
will go via a bit different write path and may be provide a 
workaround.

Also please collect debug logs for OSD startup (with both 
current and updated bdev-aio parameter) and 
--debug-bdev/debug-bluefs set to 20. You can omit 
--debug-bluestore increase for now to reduce log size.

Thanks,

Igor

On 4/28/2020 5:16 PM, Francois Legrand wrote:
Here is the output of ceph-bluestore-tool bluefs-bdev-sizes
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-5/block -> /dev/dm-17
1 : device size 0x746c0000000 : own 
0x[37e1eb00000~4a82900000] = 0x4a82900000 : using 
0x5bc780000(23 GiB)

the result of the debug-bluestore (and debug-bluefs) set 
to 20 for osd.5
is at the following address (28MB).

https://wetransfer.com/downloads/a193ab15ab5e2395fe2462c963507a7f20200428141355/5da2ebf0d33750a5fde85bf662cf0e6d20200428141415/55849f?utm_campaign=WT_email_tracking&utm_content=general&utm_medium=download_button&utm_source=notify_recipient_email 

Thanks for your help.
F.

Le 28/04/2020 à 13:33, Igor Fedotov a écrit :
Hi Francois,

Could you please share OSD startup log with 
debug-bluestore (and debug-bluefs) set to 20.

Also please run ceph-bluestore-tool's bluefs-bdev-sizes 
command and share the output.

Thanks,

Igor

On 4/28/2020 12:55 AM, Francois Legrand wrote:
Hi all,

*** Short version ***
Is there a way to repair a rocksdb from errors 
"Encountered error while reading data from compression 
dictionary block Corruption: block checksum mismatch" 
and "_open_db erroring opening db" ?

*** Long version ***
We operate a nautilus ceph cluster (with 100 disks of 
8TB in 6 servers + 4 mons/mgr + 3 mds).
We recently (Monday 20) upgraded from 14.2.7 to 14.2.8. 
This triggered a rebalancing of some data.
Two days later (Wednesday 22) we had a very short power 
outage. Only one of the osd servers went down (and 
unfortunately died).
This triggered a reconstruction of the losts osds. 
Operations went fine until Saturday 25 where some osds 
in the 5 remaining servers started to crash apparently 
with no reasons.
We tryed to restart them, but they crashed again. We 
ended with 18 osd down (+ 16 in the dead server so 34 
osd downs out of 100).
Looking at the logs we found for all the crashed osd :

-237> 2020-04-25 16:32:51.835 7f1f45527a80 3 rocksdb: 
[table/block_based_table_reader.cc:1117] Encountered 
error while reading data from compression dictionary 
block Corruption: block checksum mismatch: expected 0, 
got 2729370997 in db/181355.sst offset 
18446744073709551615 size 18446744073709551615

and

2020-04-25 16:05:47.251 7fcbd1e46a80 -1 
bluestore(/var/lib/ceph/osd/ceph-3) _open_db erroring 
opening db:

We also noticed that the "Encountered error while 
reading data from compression dictionary block 
Corruption: block checksum mismatch" was present few 
days before the crash.
We also have some osd with this error but still up.

We tryed to repair with :
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-3 
repair
But no success (it ends with _open_db erroring opening db).

Thus does somebody have an idea to fix this or at least 
know if it's possible to repair and correct the 
"Encountered error while reading data from compression 
dictionary block Corruption: block checksum mismatch" 
and "_open_db erroring opening db" !
Thanks for your help (we are desperate because we will 
loose datas and are fighting to save something) !!!
F.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx