Re: osd crashing and rocksdb corruption

Mark Nelson <mnelson@xxxxxxxxxx> · Tue, 28 Apr 2020 11:46:54 -0500

Excellent analysis Igor!

Mark

On 4/28/20 11:35 AM, Igor Fedotov wrote:
Francious,

here are some observations got from your log.

1) Rocksdb reports error on the following .sst file:

   -35> 2020-04-28 15:23:47.612 7f4856e82a80 -1 rocksdb: Corruption: 
Bad table magic number: expected 986351839
0377041911, found 12950032858166034944 in db/068269.sst

2) which relates to BlueFS ino 53361:

  -50> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs open_for_read 
db/068269.sst (random)
   -49> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs open_for_read h 
0x557914fb80b0 on file(ino 53361 size 0x
c496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 allocated c4970000 
extents [1:0x383db280000~c4970000])

3) and failed read happens to the end (0xc496f8e4~35, last 0x35 bytes) 
of this huge(3+GB) file:

 -44> 2020-04-28 15:23:47.514 7f4856e82a80 10 bluefs _read_random h 
0x557914fb80b0 0xc496f8e4~35 from file(in
o 53361 size 0xc496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 
allocated c4970000 extents [1:0x383db280000~c49
70000])
   -43> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs _read_random 
left 0x71c 0xc496f8e4~35
   -42> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs _read_random 
got 53

4) This .sst file was created from scratch shortly before with a 
single-shot 3+GB write:

 -88> 2020-04-28 15:23:35.661 7f4856e82a80 10 bluefs open_for_write 
db/068269.sst
   -87> 2020-04-28 15:23:35.661 7f4856e82a80 20 bluefs open_for_write 
mapping db/068269.sst to bdev 1
   -86> 2020-04-28 15:23:35.662 7f4856e82a80 10 bluefs open_for_write 
h 0x5579145e7a40 on file(ino 53361 size 0
x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 extents [])
   -85> 2020-04-28 15:23:39.826 7f4856e82a80 10 bluefs _flush 
0x5579145e7a40 0x0~c496f919 to file(ino 53361 siz
e 0x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 extents [])

5) Presumably RocksDB creates this file in an attempt to 
recover/compact/process another existing  file (ino 52405) which is 
pretty large as well

Please find multiple earlier reads, the last one:

   -92> 2020-04-28 15:23:29.857 7f4856e82a80 10 bluefs _read h 
0x5579147286e0 0xc6788000~8000 from file(ino 524
05 size 0xc67888a0 mtime 2020-04-25 13:34:55.325699 bdev 0 allocated 
c6790000 extents [1:0x381c8220000~10000,1:

The rationales for binding these two files are pretty uncommon file 
sizes.

So you have 3+GB single-shot BlueFS write and immediate read from the 
end of the written extent which returns unexpected magic.

It's well known in the software world that large (2+GB) data 
processing implementations tend to be error-prone. And Ceph is not an 
exception. Here is a couple of recent examples which are pretty close 
to your case:

https://github.com/ceph/ceph/commit/4d33114a40d5ae0d541c36175977ca22789a3b88 

https://github.com/ceph/ceph/commit/207806abaa91259d9bb726c2555e7e21ac541527 

Although they are already fixed in Nautilus 14.2.8 there might be 
others present along the write path (including H/W firmware).

The good news are that failure happens on a newly written file 
(remember invalid magic is read at the end(!) of this new large file 
shortly(!) after the write). So looks like data hasn't yet landed to a 
disk. And hopefully original data are untouched.

You can try to run ceph-bluestore-tool's fsck  command on the broken 
OSD. This will open DB in read-only and hopefully proves the idea that 
the issue is on the write path rather than already corrupted data.

Also you might want to set bdev-aio to false (or true if it's already 
at false) and try OSD startup once more - this will go via a bit 
different write path and may be provide a workaround.

Also please collect debug logs for OSD startup (with both current and 
updated bdev-aio parameter) and --debug-bdev/debug-bluefs set to 20. 
You can omit --debug-bluestore increase for now to reduce log size.

Thanks,

Igor

On 4/28/2020 5:16 PM, Francois Legrand wrote:
Here is the output of ceph-bluestore-tool bluefs-bdev-sizes
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-5/block -> /dev/dm-17
1 : device size 0x746c0000000 : own 0x[37e1eb00000~4a82900000] = 
0x4a82900000 : using 0x5bc780000(23 GiB)

the result of the debug-bluestore (and debug-bluefs) set to 20 for osd.5
is at the following address (28MB).

https://wetransfer.com/downloads/a193ab15ab5e2395fe2462c963507a7f20200428141355/5da2ebf0d33750a5fde85bf662cf0e6d20200428141415/55849f?utm_campaign=WT_email_tracking&utm_content=general&utm_medium=download_button&utm_source=notify_recipient_email 

Thanks for your help.
F.

Le 28/04/2020 à 13:33, Igor Fedotov a écrit :
Hi Francois,

Could you please share OSD startup log with debug-bluestore (and 
debug-bluefs) set to 20.

Also please run ceph-bluestore-tool's bluefs-bdev-sizes command and 
share the output.

Thanks,

Igor

On 4/28/2020 12:55 AM, Francois Legrand wrote:
Hi all,

*** Short version ***
Is there a way to repair a rocksdb from errors "Encountered error 
while reading data from compression dictionary block Corruption: 
block checksum mismatch" and "_open_db erroring opening db" ?

*** Long version ***
We operate a nautilus ceph cluster (with 100 disks of 8TB in 6 
servers + 4 mons/mgr + 3 mds).
We recently (Monday 20) upgraded from 14.2.7 to 14.2.8. This 
triggered a rebalancing of some data.
Two days later (Wednesday 22) we had a very short power outage. 
Only one of the osd servers went down (and unfortunately died).
This triggered a reconstruction of the losts osds. Operations went 
fine until Saturday 25 where some osds in the 5 remaining servers 
started to crash apparently with no reasons.
We tryed to restart them, but they crashed again. We ended with 18 
osd down (+ 16 in the dead server so 34 osd downs out of 100).
Looking at the logs we found for all the crashed osd :

-237> 2020-04-25 16:32:51.835 7f1f45527a80  3 rocksdb: 
[table/block_based_table_reader.cc:1117] Encountered error while 
reading data from compression dictionary block Corruption: block 
checksum mismatch: expected 0, got 2729370997  in db/181355.sst 
offset 18446744073709551615 size 18446744073709551615

and

2020-04-25 16:05:47.251 7fcbd1e46a80 -1 
bluestore(/var/lib/ceph/osd/ceph-3) _open_db erroring opening db:

We also noticed that the "Encountered error while reading data from 
compression dictionary block Corruption: block checksum mismatch" 
was present few days before the crash.
We also have some osd with this error but still up.

We tryed to repair with :
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-3 repair
But no success (it ends with _open_db erroring opening db).

Thus does somebody have an idea to fix this or at least know if 
it's possible to repair and correct the "Encountered error while 
reading data from compression dictionary block Corruption: block 
checksum mismatch" and "_open_db erroring opening db" !
Thanks for your help (we are desperate because we will loose datas 
and are fighting to save something) !!!
F.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx