Re: 15.2.2 Upgrade - Corruption: error in middle of record

Chris Palmer <chris.palmer@xxxxxxxxx> · Sat, 23 May 2020 11:06:54 +0100

Hi Ashley

Setting bluefs_preextend_wal_files to false should stop any further 
corruption of the WAL (subject to the small risk of doing this while the 
OSD is active). Over time WAL blocks will be recycled and overwritten 
with new good blocks, so the extent of the corruption may decrease or 
even eliminate. However you can't tell whether this has happened. But 
leaving each running for a while may decrease the chances of having to 
recreate it.

Having tried changing the parameter on one, then another, I've taken the 
risk of resetting it on all (running) OSDs, and nothing untoward seems 
to have happened. I have removed and recreated both failed OSDs (both on 
the node that was rebooted). They are in different crush device classes 
so I know that they are used by discrete sets of pgs. osd.9 has been 
recreated, backfilled, and stopped/started without issue. osd,2 has been 
recreated and is currently backfilling. When that has finished I will 
restart osd.2 and expect that the restart will not find any corruption.

Following that I will cycle through all other OSDs, stopping and 
starting each in turn. If one fails to restart, I will replace it, wait 
until it backfills, then stop/start it.

Do be aware that you can set the parameter globally (for all OSDs) 
and/or individually. I made sure the global setting was in place before 
creating new OSDs. (There might be other ways to achieve this on the 
command line for creating a new one).

Hope that's clear. But once again, please don't take this as advice on 
what you should do. That should come from the experts!

Regards, Chris

On 23/05/2020 10:03, Ashley Merrick wrote:
Hello Chris,

Great to hear, few questions.

Once you have injected the bluefs_preextend_wal_files to false, are 
you just rebuilding the OSD's that failed? Or are you going through 
and rebuilding every OSD even the working one's?

Or does setting the bluefs_preextend_wal_files value to false and 
leaving the OSD running fix the WAL automatically?

Thanks

---- On Sat, 23 May 2020 15:53:42 +0800 *Chris Palmer 
<chris.palmer@xxxxxxxxx>* wrote ----

    Hi Ashley

    Igor has done a great job of tracking down the problem, and we
    have finally shown evidence of the type of corruption it would
    produce in one of my WALs. Our feeling at the moment is that the
    problem can be detoured by setting bluefs_preextend_wal_files to
    false on affected OSDs while they are running (but see below),
    although Igor does note that there is a small risk in doing this.
    I've agreed a plan of action based on this route, recreating the
    failed OSDs, and then cycling through the others until all are
    healthy. I've started this now, and so far it looks promising,
    although of course I have to wait for recovery/rebalancing. This
    is the fastest route to recovery, although there other options.

    I'll post as it progresses. The good news seems to be that there
    shouldn't be any actual data corruption or loss, providing that
    this can be done before OSDs are taken down (other than as part of
    this process). My understanding is that there will some degree of
    performance penalty until the root cause is fixed in the next
    release and preextending can be turned back on. However it does
    seem like I can get back to a stable/safe position without waiting
    for a software release.

    I'm just working through this at the moment though, so please
    don't take the above as any form of recommendation. It is
    important not to try to restart OSDs though in the meantime. I'm
    sure Igor will publish some more expert recommendations in due
    course...

    Regards, Chris

    On 23/05/2020 06:54, Ashley Merrick wrote:

        Thanks Igor,

        Do you have any idea on a e.t.a or plan for people that are
        running 15.2.2 to be able to patch / fix the issue.

        I had a read of the ticket and seems the corruption is
        happening but the WAL is not read till OSD restart, so I
        imagine we will need some form of fix / patch we can apply to
        a running OSD before we then restart the OSD, as a normal OSD
        upgrade will require the OSD to restart to apply the code
        resulting in a corrupt OSD.

        Thanks

        ---- On Sat, 23 May 2020 00:12:59 +0800 *Igor Fedotov
        <ifedotov@xxxxxxx> <mailto:ifedotov@xxxxxxx>* wrote ----

            Status update:

            Finally we have the first patch to fix the issue in master:
            https://github.com/ceph/ceph/pull/35201

            And ticket has been updated with root cause
            analysis:https://tracker.ceph.com/issues/45613On 5/21/2020
            2:07 PM, Igor
            Fedotov wrote:

            @Chris - unfortunately it looks like the corruption is
            permanent since
            valid WAL data are presumably overwritten with another
            stuff. Hence I
            don't know any way to recover - perhaps you can try cutting

            WAL file off which will allow OSD to start. With some
            latest ops lost.
            Once can use exported BlueFS as a drop in replacement for
            regular DB
            volume but I'm not aware of details.

            And the above are just speculations, can't say for sure if
            it helps...

            I can't explain why WAL doesn't have zero block in your
            case though.
            Little chances this is a different issue. Just in case -
            could you
            please search for 32K zero blocks over the whole file? And
            the same for
            another OSD?

            Thanks,

            Igor

            > Short update on the issue:
            >
            > Finally we're able to reproduce the issue in master (not
            octopus),
            > investigating further..
            >
            > @Chris - to make sure you're facing the same issue could
            you please
            > check the content of the broken file. To do so:
            >
            > 1) run "ceph-bluestore-tool --path <path-to-osd>
            --our-dir <target
            > dir> --command bluefs-export
            >
            > This will export bluefs files to <target dir>
            >
            > 2) Check the content for file db.wal/002040.log at
            offset 0x470000
            >
            > This will presumably contain 32K of zero bytes. Is this
            the case?
            >
            >
            > No hurry as I'm just making sure symptoms in Octopus are
            the same...
            >
            >
            > Thanks,
            >
            > Igor
            >
            > On 5/20/2020 5:24 PM, Igor Fedotov wrote:
            >> Chris,
            >>
            >> got them, thanks!
            >>
            >> Investigating....
            >>
            >>
            >> Thanks,
            >>
            >> Igor
            >>
            >> On 5/20/2020 5:23 PM, Chris Palmer wrote:
            >>> Hi Igor
            >>> I've sent you these directly as they're a bit chunky.
            Let me know if
            >>> you haven't got them.
            >>> Thx, Chris
            >>>
            >>> On 20/05/2020 14:43, Igor Fedotov wrote:
            >>>> Hi Cris,
            >>>>
            >>>> could you please share the full log prior to the
            first failure?
            >>>>
            >>>> Also if possible please set debug-bluestore/debug
            bluefs to 20 and
            >>>> collect another one for failed OSD startup.
            >>>>
            >>>>
            >>>> Thanks,
            >>>>
            >>>> Igor
            >>>>
            >>>>
            >>>> On 5/20/2020 4:39 PM, Chris Palmer wrote:
            >>>>> I'm getting similar errors after rebooting a node.
            Cluster was
            >>>>> upgraded 15.2.1 -> 15.2.2 yesterday. No problems
            after rebooting
            >>>>> during upgrade.
            >>>>>
            >>>>> On the node I just rebooted, 2/4 OSDs won't restart.
            Similar logs
            >>>>> from both. Logs from one below.
            >>>>> Neither OSDs have compression enabled, although
            there is a
            >>>>> compression-related error in the log.
            >>>>> Both are replicated x3. One has data on HDD &
            separate WAL/DB on
            >>>>> NVMe partition, the other is everything on NVMe
            partition only.
            >>>>>
            >>>>> Feeling kinda nervous here - advice welcomed!!
            >>>>>
            >>>>> Thx, Chris
            >>>>>
            >>>>>
            >>>>>
            >>>>> 2020-05-20T13:14:00.837+0100 7f2e0d273700  3 rocksdb:
            >>>>> [table/block_based_table_reader.cc:1117] Encountered
            error while
            >>>>> reading data from compression dictionary block
            Corruption: block
            >>>>> checksum mismatch: expected 0, got 3423870535  in
            db/000304.sst
            >>>>> offset 18446744073709551615 size 18446744073709551615
            >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4 rocksdb:
            >>>>> [db/version_set.cc:3757] Recovered from manifest
            >>>>> file:db/MANIFEST-000312
            succeeded,manifest_file_number is 312,
            >>>>> next_file_number is 314, last_sequence is 22320582,
            log_number is
            >>>>> 309,prev_log_number is 0,max_column_family is
            >>>>> 0,min_log_number_to_keep is 0
            >>>>>
            >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4 rocksdb:
            >>>>> [db/version_set.cc:3766] Column family [default] (ID
            0), log
            >>>>> number is 309
            >>>>>
            >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4
            rocksdb: EVENT_LOG_v1
            >>>>> {"time_micros": 1589976840843199, "job": 1, "event":
            >>>>> "recovery_started", "log_files": [313]}
            >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4 rocksdb:
            >>>>> [db/db_impl_open.cc:583] Recovering log #313 mode 0
            >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  3 rocksdb:
            >>>>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping
            9044 bytes;
            >>>>> Corruption: error in middle of record
            >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  3 rocksdb:
            >>>>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping
            86 bytes;
            >>>>> Corruption: missing start of fragmented record(2)
            >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  4 rocksdb:
            >>>>> [db/db_impl.cc:390] Shutdown: canceling all
            background work
            >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  4 rocksdb:
            >>>>> [db/db_impl.cc:563] Shutdown complete
            >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1
            rocksdb: Corruption:
            >>>>> error in middle of record
            >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1
            >>>>> bluestore(/var/lib/ceph/osd/ceph-9) _open_db
            erroring opening db:
            >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  1 bluefs
            umount
            >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  1
            fbmap_alloc
            >>>>> 0x55daf2b3a900 shutdown
            >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  1
            bdev(0x55daf3838700
            >>>>> /var/lib/ceph/osd/ceph-9/block) close
            >>>>> 2020-05-20T13:14:01.093+0100 7f2e1957ee00  1
            bdev(0x55daf3838000
            >>>>> /var/lib/ceph/osd/ceph-9/block) close
            >>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 osd.9 0
            OSD:init:
            >>>>> unable to mount object store
            >>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1
            ESC[0;31m ** ERROR:
            >>>>> osd init failed: (5) Input/output errorESC[0m
            >>>>> _______________________________________________
            >>>>> ceph-users mailing list -- ceph-users@xxxxxxx
            <mailto:ceph-users@xxxxxxx>
            >>>>> To unsubscribe send an email to
            ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
            >>>
            >> _______________________________________________
            >> ceph-users mailing list -- ceph-users@xxxxxxx
            <mailto:ceph-users@xxxxxxx>
            >> To unsubscribe send an email to
            ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
            > _______________________________________________
            > ceph-users mailing list -- ceph-users@xxxxxxx
            <mailto:ceph-users@xxxxxxx>
            > To unsubscribe send an email to ceph-users-leave@xxxxxxx
            <mailto:ceph-users-leave@xxxxxxx>
            _______________________________________________
            ceph-users mailing list -- ceph-users@xxxxxxx
            <mailto:ceph-users@xxxxxxx>
            To unsubscribe send an email to ceph-users-leave@xxxxxxx
            <mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx