Hi Ashley
The command to reset the flag for ALL OSDs is
ceph config set osd bluefs_preextend_wal_files false
And for just an individual OSD:
ceph config set osd.5 bluefs_preextend_wal_files false
And to remove it from an individual one (so you just have the global one
left):
ceph config rm osd.5 bluefs_preextend_wal_files
BUT: I can't stress enough how important it is to only take down ONE OSD
AT A TIME. And not to take any others down until that one is properly
back up (replaced and backfilled if necessary). *Rebooting nodes without
doing this may very well cause irretrievable data loss, no matter how
long it has been since you reset that parameter.* This all seems to have
worked for me but you should get expert advice.
Regards, Chris
On 23/05/2020 16:32, Ashley Merrick wrote:
Hello,
Great news can you confirm the exact command you used to inject the
value so I can replicate you exact steps.
I will do that and then leave it a good couple of days before trying a
reboot to make sure the WAL is completely flushed
Thanks
Ashley
---- On Sat, 23 May 2020 23:20:45 +0800 *chris.palmer@xxxxxxxxx *
wrote ----
Status date:
We seem to have success. I followed the steps below. Only one more
OSD
(on node3) failed to restart, showing the same WAL corruption
messages.
After replacing that & backfilling I could then restart it. So we
have a
healthy cluster with restartable OSDs again, with
bluefs_preextend_wal_files=false until its deemed safe to
re-enable it.
Many thanks Igor!
Regards, Chris
On 23/05/2020 11:06, Chris Palmer wrote:
> Hi Ashley
>
> Setting bluefs_preextend_wal_files to false should stop any further
> corruption of the WAL (subject to the small risk of doing this
while
> the OSD is active). Over time WAL blocks will be recycled and
> overwritten with new good blocks, so the extent of the
corruption may
> decrease or even eliminate. However you can't tell whether this has
> happened. But leaving each running for a while may decrease the
> chances of having to recreate it.
>
> Having tried changing the parameter on one, then another, I've
taken
> the risk of resetting it on all (running) OSDs, and nothing
untoward
> seems to have happened. I have removed and recreated both failed
OSDs
> (both on the node that was rebooted). They are in different crush
> device classes so I know that they are used by discrete sets of
pgs.
> osd.9 has been recreated, backfilled, and stopped/started without
> issue. osd,2 has been recreated and is currently backfilling. When
> that has finished I will restart osd.2 and expect that the restart
> will not find any corruption.
>
> Following that I will cycle through all other OSDs, stopping and
> starting each in turn. If one fails to restart, I will replace it,
> wait until it backfills, then stop/start it.
>
> Do be aware that you can set the parameter globally (for all OSDs)
> and/or individually. I made sure the global setting was in place
> before creating new OSDs. (There might be other ways to achieve
this
> on the command line for creating a new one).
>
> Hope that's clear. But once again, please don't take this as
advice on
> what you should do. That should come from the experts!
>
> Regards, Chris
>
> On 23/05/2020 10:03, Ashley Merrick wrote:
>> Hello Chris,
>>
>> Great to hear, few questions.
>>
>> Once you have injected the bluefs_preextend_wal_files to false,
are
>> you just rebuilding the OSD's that failed? Or are you going
through
>> and rebuilding every OSD even the working one's?
>>
>> Or does setting the bluefs_preextend_wal_files value to false and
>> leaving the OSD running fix the WAL automatically?
>>
>> Thanks
>>
>>
>> ---- On Sat, 23 May 2020 15:53:42 +0800 *Chris Palmer
>> <chris.palmer@xxxxxxxxx <mailto:chris.palmer@xxxxxxxxx>>* wrote
----
>>
>> Hi Ashley
>>
>> Igor has done a great job of tracking down the problem, and we
>> have finally shown evidence of the type of corruption it would
>> produce in one of my WALs. Our feeling at the moment is
that the
>> problem can be detoured by setting
bluefs_preextend_wal_files to
>> false on affected OSDs while they are running (but see below),
>> although Igor does note that there is a small risk in doing
this.
>> I've agreed a plan of action based on this route,
recreating the
>> failed OSDs, and then cycling through the others until all are
>> healthy. I've started this now, and so far it looks promising,
>> although of course I have to wait for recovery/rebalancing.
This
>> is the fastest route to recovery, although there other
options.
>>
>> I'll post as it progresses. The good news seems to be that
there
>> shouldn't be any actual data corruption or loss, providing
that
>> this can be done before OSDs are taken down (other than as
part of
>> this process). My understanding is that there will some
degree of
>> performance penalty until the root cause is fixed in the next
>> release and preextending can be turned back on. However it
does
>> seem like I can get back to a stable/safe position without
waiting
>> for a software release.
>>
>> I'm just working through this at the moment though, so please
>> don't take the above as any form of recommendation. It is
>> important not to try to restart OSDs though in the
meantime. I'm
>> sure Igor will publish some more expert recommendations in due
>> course...
>>
>> Regards, Chris
>>
>>
>> On 23/05/2020 06:54, Ashley Merrick wrote:
>>
>>
>> Thanks Igor,
>>
>> Do you have any idea on a e.t.a or plan for people that
are
>> running 15.2.2 to be able to patch / fix the issue.
>>
>> I had a read of the ticket and seems the corruption is
>> happening but the WAL is not read till OSD restart, so I
>> imagine we will need some form of fix / patch we can
apply to
>> a running OSD before we then restart the OSD, as a
normal OSD
>> upgrade will require the OSD to restart to apply the code
>> resulting in a corrupt OSD.
>>
>> Thanks
>>
>>
>> ---- On Sat, 23 May 2020 00:12:59 +0800 *Igor Fedotov
>> <ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>>
<mailto:ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>>* wrote ----
>>
>> Status update:
>>
>> Finally we have the first patch to fix the issue in
master:
>> https://github.com/ceph/ceph/pull/35201
>>
>> And ticket has been updated with root cause
>> analysis:https://tracker.ceph.com/issues/45613On
5/21/2020
>> 2:07 PM, Igor
>> Fedotov wrote:
>>
>> @Chris - unfortunately it looks like the corruption is
>> permanent since
>> valid WAL data are presumably overwritten with another
>> stuff. Hence I
>> don't know any way to recover - perhaps you can try
cutting
>>
>> WAL file off which will allow OSD to start. With some
>> latest ops lost.
>> Once can use exported BlueFS as a drop in
replacement for
>> regular DB
>> volume but I'm not aware of details.
>>
>> And the above are just speculations, can't say for
sure if
>> it helps...
>>
>> I can't explain why WAL doesn't have zero block in
your
>> case though.
>> Little chances this is a different issue. Just in
case -
>> could you
>> please search for 32K zero blocks over the whole
file? And
>> the same for
>> another OSD?
>>
>>
>> Thanks,
>>
>> Igor
>>
>> > Short update on the issue:
>> >
>> > Finally we're able to reproduce the issue in
master (not
>> octopus),
>> > investigating further..
>> >
>> > @Chris - to make sure you're facing the same
issue could
>> you please
>> > check the content of the broken file. To do so:
>> >
>> > 1) run "ceph-bluestore-tool --path <path-to-osd>
>> --our-dir <target
>> > dir> --command bluefs-export
>> >
>> > This will export bluefs files to <target dir>
>> >
>> > 2) Check the content for file db.wal/002040.log at
>> offset 0x470000
>> >
>> > This will presumably contain 32K of zero bytes.
Is this
>> the case?
>> >
>> >
>> > No hurry as I'm just making sure symptoms in
Octopus are
>> the same...
>> >
>> >
>> > Thanks,
>> >
>> > Igor
>> >
>> > On 5/20/2020 5:24 PM, Igor Fedotov wrote:
>> >> Chris,
>> >>
>> >> got them, thanks!
>> >>
>> >> Investigating....
>> ; >>
>> >>
>> >> Thanks,
>> >>
>> >> Igor
>> >>
>> >> On 5/20/2020 5:23 PM, Chris Palmer wrote:
>> >>> Hi Igor
>> >>> I've sent you these directly as they're a bit
chunky.
>> Let me know if
>> >>> you haven't got them.
>> >>> Thx, Chris
>> >>>
>> >>> On 20/05/2020 14:43, Igor Fedotov wrote:
>> >>>> Hi Cris,
>> >>>>
>> >>>> could you please share the full log prior to the
>> first failure?
>> >>>>
>> >>>> Also if possible please set debug-bluestore/debug
>> bluefs to 20 and
>> >>>> collect another one for failed OSD startup.
>> >>>>
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> Igor
>> >>>>
>> >>>>
>> >>>> On 5/20/2020 4:39 PM, Chris Palmer wrote:
>> >>>>> I'm getting similar errors after rebooting a
node.
>> Cluster was
>> >>>>> upgraded 15.2.1 -> 15.2.2 yesterday. No problems
>> after rebooting
>> >>>>> during upgrade.
>> >>>>>
>> >>>>> On the node I just rebooted, 2/4 OSDs won't
restart.
>> Similar logs
>> >>>>> from both. Logs from one below.
>> >>>>> Neither OSDs have compression enabled, although
>> there is a
>> >>>>> compression-related error in the log.
>> >>>>> Both are replicated x3. One has data on HDD &
>> separate WAL/DB on
>> >>>>> NVMe partition, the other is everything on NVMe
>> partition only.
>> >>>>>
>> >>>>> Feeling kinda nervous here - advice welcomed!!
>> >>>>>
>> >>>>> Thx, Chris
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> 2020-05-20T13:14:00.837+0100 7f2e0d273700 3
rocksdb:
>> >>>>> [table/block_based_table_reader.cc:1117]
Encountered
>> error while
>> >>>>> reading data from compression dictionary block
>> Corruption: block
>> >>>>> checksum mismatch: expected 0, got
3423870535 in
>> db/000304.sst
>> >>>>> offset 18446744073709551615 size
18446744073709551615
>> >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4
rocksdb:
>> >>>>> [db/version_set.cc:3757] Recovered from manifest
>> >>>>> file:db/MANIFEST-000312
>> succeeded,manifest_file_number is 312,
>> >>>>> next_file_number is 314, last_sequence is
22320582,
>> log_number is
>> >>>>> 309,prev_log_number is 0,max_column_family is
>> >>>>> 0,min_log_number_to_keep is 0
>> >>>>>
>> >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4
rocksdb:
>> >>>>> [db/version_set.cc:3766] Column family
[default] (ID
>> 0), log
>> >>>>> number is 309
>> >>>>>
>> >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4
>> rocksdb: EVENT_LOG_v1
>> >>>>> {"time_micros": 1589976840843199, "job": 1,
"event":
>> >>>>> "recovery_started", "log_files": [313]}
>> >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4
rocksdb:
>> >>>>> [db/db_impl_open.cc:583] Recovering log #313
mode 0
>> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3
rocksdb:
>> >>>>> [db/db_impl_open.cc:518] db.wal/000313.log:
dropping
>> 9044 bytes;
>> >>>>> Corruption: error in middle of record
>> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3
rocksdb:
>> >>>>> [db/db_impl_open.cc:518] db.wal/000313.log:
dropping
>> 86 bytes;
>> >>>>> Corruption: missing start of fragmented
record(2)
>> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4
rocksdb:
>> >>>>> [db/db_impl.cc:390] Shutdown: canceling all
>> background work
>> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4
rocksdb:
>> >>>>> [db/db_impl.cc:563] Shutdown complete
>> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1
>> rocksdb: Corruption:
>> >>>>> error in middle of record
>> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1
>> >>>>> bluestore(/var/lib/ceph/osd/ceph-9) _open_db
>> erroring opening db:
>> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1
bluefs
>> umount
>> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1
>> fbmap_alloc
>> >>>>> 0x55daf2b3a900 shutdown
>> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1
>> bdev(0x55daf3838700
>> >>>>> /var/lib/ceph/osd/ceph-9/block) close
>> >>>>> 2020-05-20T13:14:01.093+0100 7f2e1957ee00 1
>> bdev(0x55daf3838000
>> >>>>> /var/lib/ceph/osd/ceph-9/block) close
>> >>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1
osd.9 0
>> OSD:init:
>> >>>>> unable to mount object store
>> >>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1
>> ESC[0;31m ** ERROR:
>> >>>>> osd init failed: (5) Input/output errorESC[0m
>> >>>>> _______________________________________________
>> >>>>> ceph-users mailing list -- ceph-users@xxxxxxx
<mailto:ceph-users@xxxxxxx>
>> <mailto:ceph-users@xxxxxxx
<mailto:ceph-users@xxxxxxx>>
>> >>>>> To unsubscribe send an email to
>> ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
<mailto:ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>>
>> >>>
>> >> _______________________________________________
>> >> ceph-users mailing list -- ceph-users@xxxxxxx
<mailto:ceph-users@xxxxxxx>
>> <mailto:ceph-users@xxxxxxx
<mailto:ceph-users@xxxxxxx>>
>> >> To unsubscribe send an email to
>> ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
<mailto:ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>>
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
<mailto:ceph-users@xxxxxxx>
>> <mailto:ceph-users@xxxxxxx
<mailto:ceph-users@xxxxxxx>>
>> > To unsubscribe send an email to
ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>> <mailto:ceph-users-leave@xxxxxxx
<mailto:ceph-users-leave@xxxxxxx>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
<mailto:ceph-users@xxxxxxx>
>> <mailto:ceph-users@xxxxxxx
<mailto:ceph-users@xxxxxxx>>
>> To unsubscribe send an email to
ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>> <mailto:ceph-users-leave@xxxxxxx
<mailto:ceph-users-leave@xxxxxxx>>
>>
>>
>>
>>
>>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
<mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx
<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx