Re: rocksdb: Corruption: missing start of fragmented record

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 21 Nov 2017 02:16:52 +1100

On Mon, Nov 20, 2017 at 9:27 AM, Michael Schmid <meheschmid@xxxxxx> wrote:
> Gregory Farnum wrote:
>> Your hardware and configuration is very relevant.
>> [...]
>> I'd look at whether you have a writeback cache somewhere that isn't
>> reflecting ordering requirements, or if your disk passes a crash consistency
>> tester. (No, I don't know one off-hand. But many disks lie horribly even
>> about stuff like flushes.)
> It may certainly have had something to do with how I managed to end up with
> the broken rocksdb WAL log. Maybe this is not the best possible behavior
> possible when one simulates a crash or drive disconnect. Perhaps if I can
> get this OSD back in action, and the issue occurs entirely predictably on
> another test, I'll eventually start to see a pattern where / how it happens
> & maybe even find out what hardware / configuration changes might be needed
> to prevent the WAL from corrupting. Perhaps.
>
> --
>
> However, my actual & immediate Ceph + ceph-users relevant problem with this
> is basically only that I cannot seem to figure out how one could deal with
> such an already broken rocksdb WAL log.
>
> 1. Ceph's tooling and rocksdb don't *appear* to be capable to deal be able
> to deal with this corrupted WAL file once it has been corrupted, certainly
> not with the commands that I tried.
> I initially had hoped for some tool to be able to do something - drop the
> log, revert to an earlier backup of a consistent db - any option like that
> that I might have missed. Judging by this ML so far, I'm going to guess
> there is no such thing? So the subsequent problem is:

The error is pretty clear: "Corruption: missing start of fragmented record(2)"
What that says to me is that rocksdb has a journal entry saying that
record *does* exist, but it's missing the opening block or something.
ie, during an atomic write it (1) wrote down a lookaside block, (2)
flushed that block to disk, (3), journaled that it had written the
block. But now on restart, it's finding out that (2) apparently didn't
happen.

>
> 2. I do not know how I can get manual, filewise access to the rocksdb WAL
> logs. This may be immensely simple, but I simply don't know how.
> I don't have any indication that either 1. or 2. is failing due to hardware
> or configuration specifics (...beyond having this broken WAL log) so far.

Rocksdb may offer repair tools for this (I have no idea), but the
fundamental issue is that as far as the program can tell, the
underlying hardware lied, the disk state is corrupted, and it has no
idea what data it can trust or not at this point. Same with Ceph; the
OSD has no desire to believe anything a corrupted disk tells it since
that can break all of our invariants.
BlueStore is a custom block device-managing system; we have a way to
mount and poke at it via FUSE but that assumes the data on disk makes
any sense. In this case, it doesn't (RocksDB stores the disk layout
metadata.) Somebody more familiar with bluestore development may know
if there's a way to mount only the "BlueFS" portion that RocksDB
writes its own data to; if there is it's just a bunch of .ldb files or
whatever, but those are again a custom data format that you'll need
rocksdb expertise to do anything with...

Toss this disk and let Ceph do its recovery thing. Look hard at what
your hardware configuration is doing to make sure it doesn't happen
again. *shrug*
-Greg

>
>> As you note, the WAL should be able to handle being incompletely-written
> Yes, I'd also have thought so? But it apparently just isn't able to deal
> with this log file corruption. Maybe it is not an extremely specific bug.
> Maybe a lot of possible WAL corruptions might throw a comparable error and
> prevent replay.
>
>> and both Ceph and RocksDB are designed to handle failures mid-write.
> As far as I can tell, not in this WAL log case, no. It would certainly be
> really interesting to see at this point if just moving or deleting that WAL
> log allows everything to continue and the OSD to go online, and if then
> doing a scrub fixes the entirety of this issue. Maybe everything is
> essentially fine apart from JUST the WAL log replay and maybe one or another
> bit of a page on the OSD.
>
>> That RocksDB *isn't* doing that here implies either 1) there's a fatal bug
>> in rocksdb
> Not so sure. Ultimately rocksdb does seem to throw a fairly indicative
> error: "db/001005.log: dropping 3225 bytes; Corruption: missing start of
> fragmented record(2)".
> Maybe they intend that users use a repair tool at
> https://github.com/facebook/rocksdb/wiki/RocksDB-Repairer . Or maybe it's a
> case for manual interaction with the file.
>
> But my point 2. -namely that I don't even understand how to get filewise
> access to rocksdb's files- has so far prevented me from trying either.
>
>
>
> Thanks for your input!
>
> -Michael
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com