Re: 15.2.2 Upgrade - Corruption: error in middle of record

Igor Fedotov <ifedotov@xxxxxxx> · Wed, 20 May 2020 15:48:58 +0300

Do you still have any original failure logs?

On 5/20/2020 3:45 PM, Ashley Merrick wrote:
Is a single shared main device.

Sadly I had already rebuilt the failed OSD's to bring me back in the 
green after a while.
I have just tried a few restarts and none are failing (seems after a 
rebuild using 15.2.2 they are stable?)

I don't have any other servers/OSD's I am willing to risk not starting 
right this minute ,if it does happen again I will grab the logs.

*@Dan* yeah is using lz4

Thanks
---- On Wed, 20 May 2020 20:30:27 +0800 *Igor Fedotov 
<ifedotov@xxxxxxx>* wrote ----

    I don't believe compression is related to be honest.

    Wondering if these OSDs have standalone WAL and/or DB devices or
    just a single shared main device.

    Also could you please set debug-bluefs/debug-bluestore to 20 and
    collect startup log for broken OSD.

    Kind regards,

    Igor

    On 5/20/2020 3:27 PM, Ashley Merrick wrote:

        Thanks, fyi the OSD's that went down back two pools, an
        Erasure code Meta (RBD) and cephFS Meta. The cephFS Pool does
        have compresison enabled ( I noticed it mentioned in the ceph
        tracker)

        Thanks

        ---- On Wed, 20 May 2020 20:17:33 +0800 *Igor Fedotov
        <ifedotov@xxxxxxx> <mailto:ifedotov@xxxxxxx>* wrote ----

            Hi Ashley,

            looks like this is a regression. Neha observed similar
            error(s) during
            here QA run, see https://tracker.ceph.com/issues/45613

            Please preserve broken OSDs for a while if possible,
            likely I'll come
            back to you for more information to troubleshoot.

            Thanks,

            Igor

            On 5/20/2020 1:26 PM, Ashley Merrick wrote:

            > So reading online it looked a dead end error, so I
            recreated the 3 OSD's on that node and now working fine
            after a reboot.
            >
            >
            >
            > However I restarted the next server with 3 OSD's and one
            of them is now facing the same issue.
            >
            >
            >
            > Let me know if you need any more logs.
            >
            >
            >
            > Thanks
            >
            >
            >
            > ---- On Wed, 20 May 2020 17:02:31 +0800 Ashley Merrick
            <singapore@xxxxxxxxxxxxxx
            <mailto:singapore@xxxxxxxxxxxxxx>> wrote ----
            >
            >
            > I just upgraded a cephadm cluster from 15.2.1 to 15.2.2.
            >
            >
            >
            > Everything went fine on the upgrade, however after
            restarting one node that has 3 OSD's for ecmeta two of the
            3 ODS's now wont boot with the following error:
            >
            >
            >
            > May 20 08:29:42 sn-m01 bash[6833]: debug
            2020-05-20T08:29:42.598+0000 7fbcc46f7ec0  4 rocksdb:
            [db/version_set.cc:3757] Recovered from manifest
            file:db/MANIFEST-002768 succeeded,manifest_file_number is
            2768, next_file_number is 2775, last_sequence is
            188026749, log_number is 2767,prev_log_number is
            0,max_column_family is 0,min_log_number_to_keep is 0
            >
            > May 20 08:29:42 sn-m01 bash[6833]: debug
            2020-05-20T08:29:42.598+0000 7fbcc46f7ec0  4 rocksdb:
            [db/version_set.cc:3766] Column family [default] (ID 0),
            log number is 2767
            >
            > May 20 08:29:42 sn-m01 bash[6833]: debug
            2020-05-20T08:29:42.598+0000 7fbcc46f7ec0  4 rocksdb:
            EVENT_LOG_v1 {"time_micros": 1589963382599157, "job": 1,
            "event": "recovery_started", "log_files": [2769]}
            >
            > May 20 08:29:42 sn-m01 bash[6833]: debug
            2020-05-20T08:29:42.598+0000 7fbcc46f7ec0  4 rocksdb:
            [db/db_impl_open.cc:583] Recovering log #2769 mode 0
            >
            > May 20 08:29:42 sn-m01 bash[6833]: debug
            2020-05-20T08:29:42.598+0000 7fbcc46f7ec0  3 rocksdb:
            [db/db_impl_open.cc:518] db/002769.log: dropping 537526
            bytes; Corruption: error in middle of record
            >
            > May 20 08:29:42 sn-m01 bash[6833]: debug
            2020-05-20T08:29:42.598+0000 7fbcc46f7ec0  3 rocksdb:
            [db/db_impl_open.cc:518] db/002769.log: dropping 32757
            bytes; Corruption: missing start of fragmented record(1)
            >
            > May 20 08:29:42 sn-m01 bash[6833]: debug
            2020-05-20T08:29:42.602+0000 7fbcc46f7ec0  3 rocksdb:
            [db/db_impl_open.cc:518] db/002769.log: dropping 32757
            bytes; Corruption: missing start of fragmented record(1)
            >
            > May 20 08:29:42 sn-m01 bash[6833]: debug
            2020-05-20T08:29:42.602+0000 7fbcc46f7ec0  3 rocksdb:
            [db/db_impl_open.cc:518] db/002769.log: dropping 32757
            bytes; Corruption: missing start of fragmented record(1)
            >
            > May 20 08:29:42 sn-m01 bash[6833]: debug
            2020-05-20T08:29:42.602+0000 7fbcc46f7ec0  3 rocksdb:
            [db/db_impl_open.cc:518] db/002769.log: dropping 32757
            bytes; Corruption: missing start of fragmented record(1)
            >
            > May 20 08:29:42 sn-m01 bash[6833]: debug
            2020-05-20T08:29:42.602+0000 7fbcc46f7ec0  3 rocksdb:
            [db/db_impl_open.cc:518] db/002769.log: dropping 32757
            bytes; Corruption: missing start of fragmented record(1)
            >
            > May 20 08:29:42 sn-m01 bash[6833]: debug
            2020-05-20T08:29:42.602+0000 7fbcc46f7ec0  3 rocksdb:
            [db/db_impl_open.cc:518] db/002769.log: dropping 32757
            bytes; Corruption: missing start of fragmented record(1)
            >
            > May 20 08:29:42 sn-m01 bash[6833]: debug
            2020-05-20T08:29:42.602+0000 7fbcc46f7ec0  3 rocksdb:
            [db/db_impl_open.cc:518] db/002769.log: dropping 23263
            bytes; Corruption: missing start of fragmented record(2)
            >
            > May 20 08:29:42 sn-m01 bash[6833]: debug
            2020-05-20T08:29:42.602+0000 7fbcc46f7ec0  4 rocksdb:
            [db/db_impl.cc:390] Shutdown: canceling all background work
            >
            > May 20 08:29:42 sn-m01 bash[6833]: debug
            2020-05-20T08:29:42.602+0000 7fbcc46f7ec0  4 rocksdb:
            [db/db_impl.cc:563] Shutdown complete
            >
            > May 20 08:29:42 sn-m01 bash[6833]: debug
            2020-05-20T08:29:42.602+0000 7fbcc46f7ec0 -1 rocksdb:
            Corruption: error in middle of record
            >
            > May 20 08:29:42 sn-m01 bash[6833]: debug
            2020-05-20T08:29:42.602+0000 7fbcc46f7ec0 -1
            bluestore(/var/lib/ceph/osd/ceph-0) _open_db erroring
            opening db:
            >
            > May 20 08:29:42 sn-m01 bash[6833]: debug
            2020-05-20T08:29:42.602+0000 7fbcc46f7ec0  1
            bdev(0x558a28dd0700 /var/lib/ceph/osd/ceph-0/block) close
            >
            > May 20 08:29:42 sn-m01 bash[6833]: debug
            2020-05-20T08:29:42.870+0000 7fbcc46f7ec0  1
            bdev(0x558a28dd0000 /var/lib/ceph/osd/ceph-0/block) close
            >
            > May 20 08:29:43 sn-m01 bash[6833]: debug
            2020-05-20T08:29:43.118+0000 7fbcc46f7ec0 -1 osd.0 0
            OSD:init: unable to mount object store
            >
            > May 20 08:29:43 sn-m01 bash[6833]: debug
            2020-05-20T08:29:43.118+0000 7fbcc46f7ec0 -1  ** ERROR:
            osd init failed: (5) Input/output error
            >
            >
            >
            > Have I hit a bug, or is there something I can do to try
            and fix these OSD's?
            >
            >
            >
            > Thanks
            > _______________________________________________
            > ceph-users mailing list -- mailto:ceph-users@xxxxxxx
            <mailto:ceph-users@xxxxxxx>
            > To unsubscribe send an email to
            mailto:ceph-users-leave@xxxxxxx
            <mailto:ceph-users-leave@xxxxxxx>
            > _______________________________________________
            > ceph-users mailing list -- ceph-users@xxxxxxx
            <mailto:ceph-users@xxxxxxx>
            > To unsubscribe send an email to ceph-users-leave@xxxxxxx
            <mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx