Re: 15.2.2 Upgrade - Corruption: error in middle of record

Chris Palmer <chris.palmer@xxxxxxxxx> · Sat, 23 May 2020 16:47:15 +0100

Hi Ashley

The command to reset the flag for ALL OSDs is

      ceph config set osd bluefs_preextend_wal_files false

And for just an individual OSD:

      ceph config set osd.5 bluefs_preextend_wal_files false

And to remove it from an individual one (so you just have the global one 
left):

      ceph config rm osd.5 bluefs_preextend_wal_files

BUT: I can't stress enough how important it is to only take down ONE OSD 
AT A TIME. And not to take any others down until that one is properly 
back up (replaced and backfilled if necessary). *Rebooting nodes without 
doing this may very well cause irretrievable data loss, no matter how 
long it has been since you reset that parameter.* This all seems to have 
worked for me but you should get expert advice.

Regards, Chris

On 23/05/2020 16:32, Ashley Merrick wrote:
Hello,

Great news can you confirm the exact command you used to inject the 
value so I can replicate you exact steps.

I will do that and then leave it a good couple of days before trying a 
reboot to make sure the WAL is completely flushed

Thanks
Ashley

---- On Sat, 23 May 2020 23:20:45 +0800 *chris.palmer@xxxxxxxxx * 
wrote ----

    Status date:

    We seem to have success. I followed the steps below. Only one more
    OSD
    (on node3) failed to restart, showing the same WAL corruption
    messages.
    After replacing that & backfilling I could then restart it. So we
    have a
    healthy cluster with restartable OSDs again, with
    bluefs_preextend_wal_files=false until its deemed safe to
    re-enable it.

    Many thanks Igor!

    Regards, Chris

    On 23/05/2020 11:06, Chris Palmer wrote:
    > Hi Ashley
    >
    > Setting bluefs_preextend_wal_files to false should stop any further
    > corruption of the WAL (subject to the small risk of doing this
    while
    > the OSD is active). Over time WAL blocks will be recycled and
    > overwritten with new good blocks, so the extent of the
    corruption may
    > decrease or even eliminate. However you can't tell whether this has
    > happened. But leaving each running for a while may decrease the
    > chances of having to recreate it.
    >
    > Having tried changing the parameter on one, then another, I've
    taken
    > the risk of resetting it on all (running) OSDs, and nothing
    untoward
    > seems to have happened. I have removed and recreated both failed
    OSDs
    > (both on the node that was rebooted). They are in different crush
    > device classes so I know that they are used by discrete sets of
    pgs.
    > osd.9 has been recreated, backfilled, and stopped/started without
    > issue. osd,2 has been recreated and is currently backfilling. When
    > that has finished I will restart osd.2 and expect that the restart
    > will not find any corruption.
    >
    > Following that I will cycle through all other OSDs, stopping and
    > starting each in turn. If one fails to restart, I will replace it,
    > wait until it backfills, then stop/start it.
    >
    > Do be aware that you can set the parameter globally (for all OSDs)
    > and/or individually. I made sure the global setting was in place
    > before creating new OSDs. (There might be other ways to achieve
    this
    > on the command line for creating a new one).
    >
    > Hope that's clear. But once again, please don't take this as
    advice on
    > what you should do. That should come from the experts!
    >
    > Regards, Chris
    >
    > On 23/05/2020 10:03, Ashley Merrick wrote:
    >> Hello Chris,
    >>
    >> Great to hear, few questions.
    >>
    >> Once you have injected the bluefs_preextend_wal_files to false,
    are
    >> you just rebuilding the OSD's that failed? Or are you going
    through
    >> and rebuilding every OSD even the working one's?
    >>
    >> Or does setting the bluefs_preextend_wal_files value to false and
    >> leaving the OSD running fix the WAL automatically?
    >>
    >> Thanks
    >>
    >>
    >> ---- On Sat, 23 May 2020 15:53:42 +0800 *Chris Palmer
    >> <chris.palmer@xxxxxxxxx <mailto:chris.palmer@xxxxxxxxx>>* wrote
    ----
    >>
    >>     Hi Ashley
    >>
    >>     Igor has done a great job of tracking down the problem, and we
    >>     have finally shown evidence of the type of corruption it would
    >>     produce in one of my WALs. Our feeling at the moment is
    that the
    >>     problem can be detoured by setting
    bluefs_preextend_wal_files to
    >>     false on affected OSDs while they are running (but see below),
    >>     although Igor does note that there is a small risk in doing
    this.
    >>     I've agreed a plan of action based on this route,
    recreating the
    >>     failed OSDs, and then cycling through the others until all are
    >>     healthy. I've started this now, and so far it looks promising,
    >>     although of course I have to wait for recovery/rebalancing.
    This
    >>     is the fastest route to recovery, although there other
    options.
    >>
    >>     I'll post as it progresses. The good news seems to be that
    there
    >>     shouldn't be any actual data corruption or loss, providing
    that
    >>     this can be done before OSDs are taken down (other than as
    part of
    >>     this process). My understanding is that there will some
    degree of
    >>     performance penalty until the root cause is fixed in the next
    >>     release and preextending can be turned back on. However it
    does
    >>     seem like I can get back to a stable/safe position without
    waiting
    >>     for a software release.
    >>
    >>     I'm just working through this at the moment though, so please
    >>     don't take the above as any form of recommendation. It is
    >>     important not to try to restart OSDs though in the
    meantime. I'm
    >>     sure Igor will publish some more expert recommendations in due
    >>     course...
    >>
    >>     Regards, Chris
    >>
    >>
    >>     On 23/05/2020 06:54, Ashley Merrick wrote:
    >>
    >>
    >>         Thanks Igor,
    >>
    >>         Do you have any idea on a e.t.a or plan for people that
    are
    >>         running 15.2.2 to be able to patch / fix the issue.
    >>
    >>         I had a read of the ticket and seems the corruption is
    >>         happening but the WAL is not read till OSD restart, so I
    >>         imagine we will need some form of fix / patch we can
    apply to
    >>         a running OSD before we then restart the OSD, as a
    normal OSD
    >>         upgrade will require the OSD to restart to apply the code
    >>         resulting in a corrupt OSD.
    >>
    >>         Thanks
    >>
    >>
    >>         ---- On Sat, 23 May 2020 00:12:59 +0800 *Igor Fedotov
    >>         <ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>>
    <mailto:ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>>* wrote ----
    >>
    >>             Status update:
    >>
    >>             Finally we have the first patch to fix the issue in
    master:
    >> https://github.com/ceph/ceph/pull/35201
    >>
    >>             And ticket has been updated with root cause
    >>             analysis:https://tracker.ceph.com/issues/45613On
    5/21/2020
    >>             2:07 PM, Igor
    >>             Fedotov wrote:
    >>
    >>             @Chris - unfortunately it looks like the corruption is
    >>             permanent since
    >>             valid WAL data are presumably overwritten with another
    >>             stuff. Hence I
    >>             don't know any way to recover - perhaps you can try
    cutting
    >>
    >>             WAL file off which will allow OSD to start. With some
    >>             latest ops lost.
    >>             Once can use exported BlueFS as a drop in
    replacement for
    >>             regular DB
    >>             volume but I'm not aware of details.
    >>
    >>             And the above are just speculations, can't say for
    sure if
    >>             it helps...
    >>
    >>             I can't explain why WAL doesn't have zero block in
    your
    >>             case though.
    >>             Little chances this is a different issue. Just in
    case -
    >>             could you
    >>             please search for 32K zero blocks over the whole
    file? And
    >>             the same for
    >>             another OSD?
    >>
    >>
    >>             Thanks,
    >>
    >>             Igor
    >>
    >>             > Short update on the issue:
    >>             >
    >>             > Finally we're able to reproduce the issue in
    master (not
    >>             octopus),
    >>             > investigating further..
    >>             >
    >>             > @Chris - to make sure you're facing the same
    issue could
    >>             you please
    >>             > check the content of the broken file. To do so:
    >>             >
    >>             > 1) run "ceph-bluestore-tool --path <path-to-osd>
    >>             --our-dir <target
    >>             > dir> --command bluefs-export
    >>             >
    >>             > This will export bluefs files to <target dir>
    >>             >
    >>             > 2) Check the content for file db.wal/002040.log at
    >>             offset 0x470000
    >>             >
    >>             > This will presumably contain 32K of zero bytes.
    Is this
    >>             the case?
    >>             >
    >>             >
    >>             > No hurry as I'm just making sure symptoms in
    Octopus are
    >>             the same...
    >>             >
    >>             >
    >>             > Thanks,
    >>             >
    >>             > Igor
    >>             >
    >>             > On 5/20/2020 5:24 PM, Igor Fedotov wrote:
    >>             >> Chris,
    >>             >>
    >>             >> got them, thanks!
    >>             >>
    >>             >> Investigating....
    >>       ;      >>
    >>             >>
    >>             >> Thanks,
    >>             >>
    >>             >> Igor
    >>             >>
    >>             >> On 5/20/2020 5:23 PM, Chris Palmer wrote:
    >>             >>> Hi Igor
    >>             >>> I've sent you these directly as they're a bit
    chunky.
    >>             Let me know if
    >>             >>> you haven't got them.
    >>             >>> Thx, Chris
    >>             >>>
    >>             >>> On 20/05/2020 14:43, Igor Fedotov wrote:
    >>             >>>> Hi Cris,
    >>             >>>>
    >>             >>>> could you please share the full log prior to the
    >>             first failure?
    >>             >>>>
    >>             >>>> Also if possible please set debug-bluestore/debug
    >>             bluefs to 20 and
    >>             >>>> collect another one for failed OSD startup.
    >>             >>>>
    >>             >>>>
    >>             >>>> Thanks,
    >>             >>>>
    >>             >>>> Igor
    >>             >>>>
    >>             >>>>
    >>             >>>> On 5/20/2020 4:39 PM, Chris Palmer wrote:
    >>             >>>>> I'm getting similar errors after rebooting a
    node.
    >>             Cluster was
    >>             >>>>> upgraded 15.2.1 -> 15.2.2 yesterday. No problems
    >>             after rebooting
    >>             >>>>> during upgrade.
    >>             >>>>>
    >>             >>>>> On the node I just rebooted, 2/4 OSDs won't
    restart.
    >>             Similar logs
    >>             >>>>> from both. Logs from one below.
    >>             >>>>> Neither OSDs have compression enabled, although
    >>             there is a
    >>             >>>>> compression-related error in the log.
    >>             >>>>> Both are replicated x3. One has data on HDD &
    >>             separate WAL/DB on
    >>             >>>>> NVMe partition, the other is everything on NVMe
    >>             partition only.
    >>             >>>>>
    >>             >>>>> Feeling kinda nervous here - advice welcomed!!
    >>             >>>>>
    >>             >>>>> Thx, Chris
    >>             >>>>>
    >>             >>>>>
    >>             >>>>>
    >>             >>>>> 2020-05-20T13:14:00.837+0100 7f2e0d273700  3
    rocksdb:
    >>             >>>>> [table/block_based_table_reader.cc:1117]
    Encountered
    >>             error while
    >>             >>>>> reading data from compression dictionary block
    >>             Corruption: block
    >>             >>>>> checksum mismatch: expected 0, got
    3423870535  in
    >>             db/000304.sst
    >>             >>>>> offset 18446744073709551615 size
    18446744073709551615
    >>             >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4
    rocksdb:
    >>             >>>>> [db/version_set.cc:3757] Recovered from manifest
    >>             >>>>> file:db/MANIFEST-000312
    >>             succeeded,manifest_file_number is 312,
    >>             >>>>> next_file_number is 314, last_sequence is
    22320582,
    >>             log_number is
    >>             >>>>> 309,prev_log_number is 0,max_column_family is
    >>             >>>>> 0,min_log_number_to_keep is 0
    >>             >>>>>
    >>             >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4
    rocksdb:
    >>             >>>>> [db/version_set.cc:3766] Column family
    [default] (ID
    >>             0), log
    >>             >>>>> number is 309
    >>             >>>>>
    >>             >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4
    >>             rocksdb: EVENT_LOG_v1
    >>             >>>>> {"time_micros": 1589976840843199, "job": 1,
    "event":
    >>             >>>>> "recovery_started", "log_files": [313]}
    >>             >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4
    rocksdb:
    >>             >>>>> [db/db_impl_open.cc:583] Recovering log #313
    mode 0
    >>             >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  3
    rocksdb:
    >>             >>>>> [db/db_impl_open.cc:518] db.wal/000313.log:
    dropping
    >>             9044 bytes;
    >>             >>>>> Corruption: error in middle of record
    >>             >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  3
    rocksdb:
    >>             >>>>> [db/db_impl_open.cc:518] db.wal/000313.log:
    dropping
    >>             86 bytes;
    >>             >>>>> Corruption: missing start of fragmented
    record(2)
    >>             >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  4
    rocksdb:
    >>             >>>>> [db/db_impl.cc:390] Shutdown: canceling all
    >>             background work
    >>             >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  4
    rocksdb:
    >>             >>>>> [db/db_impl.cc:563] Shutdown complete
    >>             >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1
    >>             rocksdb: Corruption:
    >>             >>>>> error in middle of record
    >>             >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1
    >>             >>>>> bluestore(/var/lib/ceph/osd/ceph-9) _open_db
    >>             erroring opening db:
    >>             >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  1
    bluefs
    >>             umount
    >>             >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  1
    >>             fbmap_alloc
    >>             >>>>> 0x55daf2b3a900 shutdown
    >>             >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  1
    >>             bdev(0x55daf3838700
    >>             >>>>> /var/lib/ceph/osd/ceph-9/block) close
    >>             >>>>> 2020-05-20T13:14:01.093+0100 7f2e1957ee00  1
    >>             bdev(0x55daf3838000
    >>             >>>>> /var/lib/ceph/osd/ceph-9/block) close
    >>             >>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1
    osd.9 0
    >>             OSD:init:
    >>             >>>>> unable to mount object store
    >>             >>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1
    >>             ESC[0;31m ** ERROR:
    >>             >>>>> osd init failed: (5) Input/output errorESC[0m
    >>             >>>>> _______________________________________________
    >>             >>>>> ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    >>             <mailto:ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>>
    >>             >>>>> To unsubscribe send an email to
    >> ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
    <mailto:ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>>
    >>             >>>
    >>             >> _______________________________________________
    >>             >> ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    >>             <mailto:ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>>
    >>             >> To unsubscribe send an email to
    >> ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
    <mailto:ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>>
    >>             > _______________________________________________
    >>             > ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    >>             <mailto:ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>>
    >>             > To unsubscribe send an email to
    ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
    >>             <mailto:ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>>
    >> _______________________________________________
    >>             ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    >>             <mailto:ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>>
    >>             To unsubscribe send an email to
    ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
    >>             <mailto:ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>>
    >>
    >>
    >>
    >>
    >>
    >
    > _______________________________________________
    > ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    > To unsubscribe send an email to ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    To unsubscribe send an email to ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx