Re: 15.2.2 Upgrade - Corruption: error in middle of record

Chris Palmer <chris.palmer@xxxxxxxxx> · Sat, 23 May 2020 08:53:42 +0100

Hi Ashley

Igor has done a great job of tracking down the problem, and we have 
finally shown evidence of the type of corruption it would produce in one 
of my WALs. Our feeling at the moment is that the problem can be 
detoured by setting bluefs_preextend_wal_files to false on affected OSDs 
while they are running (but see below), although Igor does note that 
there is a small risk in doing this. I've agreed a plan of action based 
on this route, recreating the failed OSDs, and then cycling through the 
others until all are healthy. I've started this now, and so far it looks 
promising, although of course I have to wait for recovery/rebalancing. 
This is the fastest route to recovery, although there other options.

I'll post as it progresses. The good news seems to be that there 
shouldn't be any actual data corruption or loss, providing that this can 
be done before OSDs are taken down (other than as part of this process). 
My understanding is that there will some degree of performance penalty 
until the root cause is fixed in the next release and preextending can 
be turned back on. However it does seem like I can get back to a 
stable/safe position without waiting for a software release.

I'm just working through this at the moment though, so please don't take 
the above as any form of recommendation. It is important not to try to 
restart OSDs though in the meantime. I'm sure Igor will publish some 
more expert recommendations in due course...

Regards, Chris

On 23/05/2020 06:54, Ashley Merrick wrote:
Thanks Igor,

Do you have any idea on a e.t.a or plan for people that are running 
15.2.2 to be able to patch / fix the issue.

I had a read of the ticket and seems the corruption is happening but 
the WAL is not read till OSD restart, so I imagine we will need some 
form of fix / patch we can apply to a running OSD before we then 
restart the OSD, as a normal OSD upgrade will require the OSD to 
restart to apply the code resulting in a corrupt OSD.

Thanks

---- On Sat, 23 May 2020 00:12:59 +0800 *Igor Fedotov 
<ifedotov@xxxxxxx>* wrote ----

    Status update:

    Finally we have the first patch to fix the issue in master:
    https://github.com/ceph/ceph/pull/35201

    And ticket has been updated with root cause
    analysis:https://tracker.ceph.com/issues/45613On 5/21/2020 2:07
    PM, Igor
    Fedotov wrote:

    @Chris - unfortunately it looks like the corruption is permanent
    since
    valid WAL data are presumably overwritten with another stuff. Hence I
    don't know any way to recover - perhaps you can try cutting

    WAL file off which will allow OSD to start. With some latest ops
    lost.
    Once can use exported BlueFS as a drop in replacement for regular DB
    volume but I'm not aware of details.

    And the above are just speculations, can't say for sure if it
    helps...

    I can't explain why WAL doesn't have zero block in your case though.
    Little chances this is a different issue. Just in case - could you
    please search for 32K zero blocks over the whole file? And the
    same for
    another OSD?

    Thanks,

    Igor

    > Short update on the issue:
    >
    > Finally we're able to reproduce the issue in master (not octopus),
    > investigating further..
    >
    > @Chris - to make sure you're facing the same issue could you please
    > check the content of the broken file. To do so:
    >
    > 1) run "ceph-bluestore-tool --path <path-to-osd> --our-dir <target
    > dir> --command bluefs-export
    >
    > This will export bluefs files to <target dir>
    >
    > 2) Check the content for file db.wal/002040.log at offset 0x470000
    >
    > This will presumably contain 32K of zero bytes. Is this the case?
    >
    >
    > No hurry as I'm just making sure symptoms in Octopus are the
    same...
    >
    >
    > Thanks,
    >
    > Igor
    >
    > On 5/20/2020 5:24 PM, Igor Fedotov wrote:
    >> Chris,
    >>
    >> got them, thanks!
    >>
    >> Investigating....
    >>
    >>
    >> Thanks,
    >>
    >> Igor
    >>
    >> On 5/20/2020 5:23 PM, Chris Palmer wrote:
    >>> Hi Igor
    >>> I've sent you these directly as they're a bit chunky. Let me
    know if
    >>> you haven't got them.
    >>> Thx, Chris
    >>>
    >>> On 20/05/2020 14:43, Igor Fedotov wrote:
    >>>> Hi Cris,
    >>>>
    >>>> could you please share the full log prior to the first failure?
    >>>>
    >>>> Also if possible please set debug-bluestore/debug bluefs to
    20 and
    >>>> collect another one for failed OSD startup.
    >>>>
    >>>>
    >>>> Thanks,
    >>>>
    >>>> Igor
    >>>>
    >>>>
    >>>> On 5/20/2020 4:39 PM, Chris Palmer wrote:
    >>>>> I'm getting similar errors after rebooting a node. Cluster was
    >>>>> upgraded 15.2.1 -> 15.2.2 yesterday. No problems after
    rebooting
    >>>>> during upgrade.
    >>>>>
    >>>>> On the node I just rebooted, 2/4 OSDs won't restart. Similar
    logs
    >>>>> from both. Logs from one below.
    >>>>> Neither OSDs have compression enabled, although there is a
    >>>>> compression-related error in the log.
    >>>>> Both are replicated x3. One has data on HDD & separate
    WAL/DB on
    >>>>> NVMe partition, the other is everything on NVMe partition only.
    >>>>>
    >>>>> Feeling kinda nervous here - advice welcomed!!
    >>>>>
    >>>>> Thx, Chris
    >>>>>
    >>>>>
    >>>>>
    >>>>> 2020-05-20T13:14:00.837+0100 7f2e0d273700  3 rocksdb:
    >>>>> [table/block_based_table_reader.cc:1117] Encountered error
    while
    >>>>> reading data from compression dictionary block Corruption:
    block
    >>>>> checksum mismatch: expected 0, got 3423870535  in db/000304.sst
    >>>>> offset 18446744073709551615 size 18446744073709551615
    >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4 rocksdb:
    >>>>> [db/version_set.cc:3757] Recovered from manifest
    >>>>> file:db/MANIFEST-000312 succeeded,manifest_file_number is 312,
    >>>>> next_file_number is 314, last_sequence is 22320582,
    log_number is
    >>>>> 309,prev_log_number is 0,max_column_family is
    >>>>> 0,min_log_number_to_keep is 0
    >>>>>
    >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4 rocksdb:
    >>>>> [db/version_set.cc:3766] Column family [default] (ID 0), log
    >>>>> number is 309
    >>>>>
    >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4 rocksdb:
    EVENT_LOG_v1
    >>>>> {"time_micros": 1589976840843199, "job": 1, "event":
    >>>>> "recovery_started", "log_files": [313]}
    >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4 rocksdb:
    >>>>> [db/db_impl_open.cc:583] Recovering log #313 mode 0
    >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  3 rocksdb:
    >>>>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping 9044
    bytes;
    >>>>> Corruption: error in middle of record
    >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  3 rocksdb:
    >>>>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping 86 bytes;
    >>>>> Corruption: missing start of fragmented record(2)
    >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  4 rocksdb:
    >>>>> [db/db_impl.cc:390] Shutdown: canceling all background work
    >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  4 rocksdb:
    >>>>> [db/db_impl.cc:563] Shutdown complete
    >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 rocksdb:
    Corruption:
    >>>>> error in middle of record
    >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1
    >>>>> bluestore(/var/lib/ceph/osd/ceph-9) _open_db erroring
    opening db:
    >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  1 bluefs umount
    >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  1 fbmap_alloc
    >>>>> 0x55daf2b3a900 shutdown
    >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  1
    bdev(0x55daf3838700
    >>>>> /var/lib/ceph/osd/ceph-9/block) close
    >>>>> 2020-05-20T13:14:01.093+0100 7f2e1957ee00  1
    bdev(0x55daf3838000
    >>>>> /var/lib/ceph/osd/ceph-9/block) close
    >>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 osd.9 0 OSD:init:
    >>>>> unable to mount object store
    >>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 ESC[0;31m **
    ERROR:
    >>>>> osd init failed: (5) Input/output errorESC[0m
    >>>>> _______________________________________________
    >>>>> ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>
    >>>
    >> _______________________________________________
    >> ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>
    > _______________________________________________
    > ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    > To unsubscribe send an email to ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    To unsubscribe send an email to ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx