Re: 15.2.2 Upgrade - Corruption: error in middle of record

Ashley Merrick <singapore@xxxxxxxxxxxxxx> · Sat, 23 May 2020 23:32:34 +0800

Hello,Great news can you confirm the exact command you used to inject the value so I can replicate you exact steps.I will do that and then leave it a good couple of days before trying a reboot to make sure the WAL is completely flushed Thanks Ashley ---- On Sat, 23 May 2020 23:20:45 +0800  chris.palmer@xxxxxxxxx  wrote ----Status date:

We seem to have success. I followed the steps below. Only one more OSD 
(on node3) failed to restart, showing the same WAL corruption messages. 
After replacing that & backfilling I could then restart it. So we have a 
healthy cluster with restartable OSDs again, with 
bluefs_preextend_wal_files=false until its deemed safe to re-enable it.

Many thanks Igor!

Regards, Chris

On 23/05/2020 11:06, Chris Palmer wrote:
> Hi Ashley
>
> Setting bluefs_preextend_wal_files to false should stop any further 
> corruption of the WAL (subject to the small risk of doing this while 
> the OSD is active). Over time WAL blocks will be recycled and 
> overwritten with new good blocks, so the extent of the corruption may 
> decrease or even eliminate. However you can't tell whether this has 
> happened. But leaving each running for a while may decrease the 
> chances of having to recreate it.
>
> Having tried changing the parameter on one, then another, I've taken 
> the risk of resetting it on all (running) OSDs, and nothing untoward 
> seems to have happened. I have removed and recreated both failed OSDs 
> (both on the node that was rebooted). They are in different crush 
> device classes so I know that they are used by discrete sets of pgs. 
> osd.9 has been recreated, backfilled, and stopped/started without 
> issue. osd,2 has been recreated and is currently backfilling. When 
> that has finished I will restart osd.2 and expect that the restart 
> will not find any corruption.
>
> Following that I will cycle through all other OSDs, stopping and 
> starting each in turn. If one fails to restart, I will replace it, 
> wait until it backfills, then stop/start it.
>
> Do be aware that you can set the parameter globally (for all OSDs) 
> and/or individually. I made sure the global setting was in place 
> before creating new OSDs. (There might be other ways to achieve this 
> on the command line for creating a new one).
>
> Hope that's clear. But once again, please don't take this as advice on 
> what you should do. That should come from the experts!
>
> Regards, Chris
>
> On 23/05/2020 10:03, Ashley Merrick wrote:
>> Hello Chris,
>>
>> Great to hear, few questions.
>>
>> Once you have injected the bluefs_preextend_wal_files to false, are 
>> you just rebuilding the OSD's that failed? Or are you going through 
>> and rebuilding every OSD even the working one's?
>>
>> Or does setting the bluefs_preextend_wal_files value to false and 
>> leaving the OSD running fix the WAL automatically?
>>
>> Thanks
>>
>>
>> ---- On Sat, 23 May 2020 15:53:42 +0800 *Chris Palmer 
>> <chris.palmer@xxxxxxxxx>* wrote ----
>>
>>     Hi Ashley
>>
>>     Igor has done a great job of tracking down the problem, and we
>>     have finally shown evidence of the type of corruption it would
>>     produce in one of my WALs. Our feeling at the moment is that the
>>     problem can be detoured by setting bluefs_preextend_wal_files to
>>     false on affected OSDs while they are running (but see below),
>>     although Igor does note that there is a small risk in doing this.
>>     I've agreed a plan of action based on this route, recreating the
>>     failed OSDs, and then cycling through the others until all are
>>     healthy. I've started this now, and so far it looks promising,
>>     although of course I have to wait for recovery/rebalancing. This
>>     is the fastest route to recovery, although there other options.
>>
>>     I'll post as it progresses. The good news seems to be that there
>>     shouldn't be any actual data corruption or loss, providing that
>>     this can be done before OSDs are taken down (other than as part of
>>     this process). My understanding is that there will some degree of
>>     performance penalty until the root cause is fixed in the next
>>     release and preextending can be turned back on. However it does
>>     seem like I can get back to a stable/safe position without waiting
>>     for a software release.
>>
>>     I'm just working through this at the moment though, so please
>>     don't take the above as any form of recommendation. It is
>>     important not to try to restart OSDs though in the meantime. I'm
>>     sure Igor will publish some more expert recommendations in due
>>     course...
>>
>>     Regards, Chris
>>
>>
>>     On 23/05/2020 06:54, Ashley Merrick wrote:
>>
>>
>>         Thanks Igor,
>>
>>         Do you have any idea on a e.t.a or plan for people that are
>>         running 15.2.2 to be able to patch / fix the issue.
>>
>>         I had a read of the ticket and seems the corruption is
>>         happening but the WAL is not read till OSD restart, so I
>>         imagine we will need some form of fix / patch we can apply to
>>         a running OSD before we then restart the OSD, as a normal OSD
>>         upgrade will require the OSD to restart to apply the code
>>         resulting in a corrupt OSD.
>>
>>         Thanks
>>
>>
>>         ---- On Sat, 23 May 2020 00:12:59 +0800 *Igor Fedotov
>>         <ifedotov@xxxxxxx> <mailto:ifedotov@xxxxxxx>* wrote ----
>>
>>             Status update:
>>
>>             Finally we have the first patch to fix the issue in master:
>>             https://github.com/ceph/ceph/pull/35201
>>
>>             And ticket has been updated with root cause
>>             analysis:https://tracker.ceph.com/issues/45613On 5/21/2020
>>             2:07 PM, Igor
>>             Fedotov wrote:
>>
>>             @Chris - unfortunately it looks like the corruption is
>>             permanent since
>>             valid WAL data are presumably overwritten with another
>>             stuff. Hence I
>>             don't know any way to recover - perhaps you can try cutting
>>
>>             WAL file off which will allow OSD to start. With some
>>             latest ops lost.
>>             Once can use exported BlueFS as a drop in replacement for
>>             regular DB
>>             volume but I'm not aware of details.
>>
>>             And the above are just speculations, can't say for sure if
>>             it helps...
>>
>>             I can't explain why WAL doesn't have zero block in your
>>             case though.
>>             Little chances this is a different issue. Just in case -
>>             could you
>>             please search for 32K zero blocks over the whole file? And
>>             the same for
>>             another OSD?
>>
>>
>>             Thanks,
>>
>>             Igor
>>
>>             > Short update on the issue:
>>             >
>>             > Finally we're able to reproduce the issue in master (not
>>             octopus),
>>             > investigating further..
>>             >
>>             > @Chris - to make sure you're facing the same issue could
>>             you please
>>             > check the content of the broken file. To do so:
>>             >
>>             > 1) run "ceph-bluestore-tool --path <path-to-osd>
>>             --our-dir <target
>>             > dir> --command bluefs-export
>>             >
>>             > This will export bluefs files to <target dir>
>>             >
>>             > 2) Check the content for file db.wal/002040.log at
>>             offset 0x470000
>>             >
>>             > This will presumably contain 32K of zero bytes. Is this
>>             the case?
>>             >
>>             >
>>             > No hurry as I'm just making sure symptoms in Octopus are
>>             the same...
>>             >
>>             >
>>             > Thanks,
>>             >
>>             > Igor
>>             >
>>             > On 5/20/2020 5:24 PM, Igor Fedotov wrote:
>>             >> Chris,
>>             >>
>>             >> got them, thanks!
>>             >>
>>             >> Investigating....
>>       ;      >>
>>             >>
>>             >> Thanks,
>>             >>
>>             >> Igor
>>             >>
>>             >> On 5/20/2020 5:23 PM, Chris Palmer wrote:
>>             >>> Hi Igor
>>             >>> I've sent you these directly as they're a bit chunky.
>>             Let me know if
>>             >>> you haven't got them.
>>             >>> Thx, Chris
>>             >>>
>>             >>> On 20/05/2020 14:43, Igor Fedotov wrote:
>>             >>>> Hi Cris,
>>             >>>>
>>             >>>> could you please share the full log prior to the
>>             first failure?
>>             >>>>
>>             >>>> Also if possible please set debug-bluestore/debug
>>             bluefs to 20 and
>>             >>>> collect another one for failed OSD startup.
>>             >>>>
>>             >>>>
>>             >>>> Thanks,
>>             >>>>
>>             >>>> Igor
>>             >>>>
>>             >>>>
>>             >>>> On 5/20/2020 4:39 PM, Chris Palmer wrote:
>>             >>>>> I'm getting similar errors after rebooting a node.
>>             Cluster was
>>             >>>>> upgraded 15.2.1 -> 15.2.2 yesterday. No problems
>>             after rebooting
>>             >>>>> during upgrade.
>>             >>>>>
>>             >>>>> On the node I just rebooted, 2/4 OSDs won't restart.
>>             Similar logs
>>             >>>>> from both. Logs from one below.
>>             >>>>> Neither OSDs have compression enabled, although
>>             there is a
>>             >>>>> compression-related error in the log.
>>             >>>>> Both are replicated x3. One has data on HDD &
>>             separate WAL/DB on
>>             >>>>> NVMe partition, the other is everything on NVMe
>>             partition only.
>>             >>>>>
>>             >>>>> Feeling kinda nervous here - advice welcomed!!
>>             >>>>>
>>             >>>>> Thx, Chris
>>             >>>>>
>>             >>>>>
>>             >>>>>
>>             >>>>> 2020-05-20T13:14:00.837+0100 7f2e0d273700  3 rocksdb:
>>             >>>>> [table/block_based_table_reader.cc:1117] Encountered
>>             error while
>>             >>>>> reading data from compression dictionary block
>>             Corruption: block
>>             >>>>> checksum mismatch: expected 0, got 3423870535  in
>>             db/000304.sst
>>             >>>>> offset 18446744073709551615 size 18446744073709551615
>>             >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4 rocksdb:
>>             >>>>> [db/version_set.cc:3757] Recovered from manifest
>>             >>>>> file:db/MANIFEST-000312
>>             succeeded,manifest_file_number is 312,
>>             >>>>> next_file_number is 314, last_sequence is 22320582,
>>             log_number is
>>             >>>>> 309,prev_log_number is 0,max_column_family is
>>             >>>>> 0,min_log_number_to_keep is 0
>>             >>>>>
>>             >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4 rocksdb:
>>             >>>>> [db/version_set.cc:3766] Column family [default] (ID
>>             0), log
>>             >>>>> number is 309
>>             >>>>>
>>             >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4
>>             rocksdb: EVENT_LOG_v1
>>             >>>>> {"time_micros": 1589976840843199, "job": 1, "event":
>>             >>>>> "recovery_started", "log_files": [313]}
>>             >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00  4 rocksdb:
>>             >>>>> [db/db_impl_open.cc:583] Recovering log #313 mode 0
>>             >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  3 rocksdb:
>>             >>>>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping
>>             9044 bytes;
>>             >>>>> Corruption: error in middle of record
>>             >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  3 rocksdb:
>>             >>>>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping
>>             86 bytes;
>>             >>>>> Corruption: missing start of fragmented record(2)
>>             >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  4 rocksdb:
>>             >>>>> [db/db_impl.cc:390] Shutdown: canceling all
>>             background work
>>             >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  4 rocksdb:
>>             >>>>> [db/db_impl.cc:563] Shutdown complete
>>             >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1
>>             rocksdb: Corruption:
>>             >>>>> error in middle of record
>>             >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1
>>             >>>>> bluestore(/var/lib/ceph/osd/ceph-9) _open_db
>>             erroring opening db:
>>             >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  1 bluefs
>>             umount
>>             >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  1
>>             fbmap_alloc
>>             >>>>> 0x55daf2b3a900 shutdown
>>             >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00  1
>>             bdev(0x55daf3838700
>>             >>>>> /var/lib/ceph/osd/ceph-9/block) close
>>             >>>>> 2020-05-20T13:14:01.093+0100 7f2e1957ee00  1
>>             bdev(0x55daf3838000
>>             >>>>> /var/lib/ceph/osd/ceph-9/block) close
>>             >>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 osd.9 0
>>             OSD:init:
>>             >>>>> unable to mount object store
>>             >>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1
>>             ESC[0;31m ** ERROR:
>>             >>>>> osd init failed: (5) Input/output errorESC[0m
>>             >>>>> _______________________________________________
>>             >>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>             <mailto:ceph-users@xxxxxxx>
>>             >>>>> To unsubscribe send an email to
>>             ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>>             >>>
>>             >> _______________________________________________
>>             >> ceph-users mailing list -- ceph-users@xxxxxxx
>>             <mailto:ceph-users@xxxxxxx>
>>             >> To unsubscribe send an email to
>>             ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>>             > _______________________________________________
>>             > ceph-users mailing list -- ceph-users@xxxxxxx
>>             <mailto:ceph-users@xxxxxxx>
>>             > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>             <mailto:ceph-users-leave@xxxxxxx>
>>             _______________________________________________
>>             ceph-users mailing list -- ceph-users@xxxxxxx
>>             <mailto:ceph-users@xxxxxxx>
>>             To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>             <mailto:ceph-users-leave@xxxxxxx>
>>
>>
>>
>>
>>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________ceph-users mailing list -- ceph-users@xxxxxxxxx unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx