Hello,Great news can you confirm the exact command you used to inject the value so I can replicate you exact steps.I will do that and then leave it a good couple of days before trying a reboot to make sure the WAL is completely flushed Thanks Ashley ---- On Sat, 23 May 2020 23:20:45 +0800 chris.palmer@xxxxxxxxx wrote ----Status date: We seem to have success. I followed the steps below. Only one more OSD (on node3) failed to restart, showing the same WAL corruption messages. After replacing that & backfilling I could then restart it. So we have a healthy cluster with restartable OSDs again, with bluefs_preextend_wal_files=false until its deemed safe to re-enable it. Many thanks Igor! Regards, Chris On 23/05/2020 11:06, Chris Palmer wrote: > Hi Ashley > > Setting bluefs_preextend_wal_files to false should stop any further > corruption of the WAL (subject to the small risk of doing this while > the OSD is active). Over time WAL blocks will be recycled and > overwritten with new good blocks, so the extent of the corruption may > decrease or even eliminate. However you can't tell whether this has > happened. But leaving each running for a while may decrease the > chances of having to recreate it. > > Having tried changing the parameter on one, then another, I've taken > the risk of resetting it on all (running) OSDs, and nothing untoward > seems to have happened. I have removed and recreated both failed OSDs > (both on the node that was rebooted). They are in different crush > device classes so I know that they are used by discrete sets of pgs. > osd.9 has been recreated, backfilled, and stopped/started without > issue. osd,2 has been recreated and is currently backfilling. When > that has finished I will restart osd.2 and expect that the restart > will not find any corruption. > > Following that I will cycle through all other OSDs, stopping and > starting each in turn. If one fails to restart, I will replace it, > wait until it backfills, then stop/start it. > > Do be aware that you can set the parameter globally (for all OSDs) > and/or individually. I made sure the global setting was in place > before creating new OSDs. (There might be other ways to achieve this > on the command line for creating a new one). > > Hope that's clear. But once again, please don't take this as advice on > what you should do. That should come from the experts! > > Regards, Chris > > On 23/05/2020 10:03, Ashley Merrick wrote: >> Hello Chris, >> >> Great to hear, few questions. >> >> Once you have injected the bluefs_preextend_wal_files to false, are >> you just rebuilding the OSD's that failed? Or are you going through >> and rebuilding every OSD even the working one's? >> >> Or does setting the bluefs_preextend_wal_files value to false and >> leaving the OSD running fix the WAL automatically? >> >> Thanks >> >> >> ---- On Sat, 23 May 2020 15:53:42 +0800 *Chris Palmer >> <chris.palmer@xxxxxxxxx>* wrote ---- >> >> Hi Ashley >> >> Igor has done a great job of tracking down the problem, and we >> have finally shown evidence of the type of corruption it would >> produce in one of my WALs. Our feeling at the moment is that the >> problem can be detoured by setting bluefs_preextend_wal_files to >> false on affected OSDs while they are running (but see below), >> although Igor does note that there is a small risk in doing this. >> I've agreed a plan of action based on this route, recreating the >> failed OSDs, and then cycling through the others until all are >> healthy. I've started this now, and so far it looks promising, >> although of course I have to wait for recovery/rebalancing. This >> is the fastest route to recovery, although there other options. >> >> I'll post as it progresses. The good news seems to be that there >> shouldn't be any actual data corruption or loss, providing that >> this can be done before OSDs are taken down (other than as part of >> this process). My understanding is that there will some degree of >> performance penalty until the root cause is fixed in the next >> release and preextending can be turned back on. However it does >> seem like I can get back to a stable/safe position without waiting >> for a software release. >> >> I'm just working through this at the moment though, so please >> don't take the above as any form of recommendation. It is >> important not to try to restart OSDs though in the meantime. I'm >> sure Igor will publish some more expert recommendations in due >> course... >> >> Regards, Chris >> >> >> On 23/05/2020 06:54, Ashley Merrick wrote: >> >> >> Thanks Igor, >> >> Do you have any idea on a e.t.a or plan for people that are >> running 15.2.2 to be able to patch / fix the issue. >> >> I had a read of the ticket and seems the corruption is >> happening but the WAL is not read till OSD restart, so I >> imagine we will need some form of fix / patch we can apply to >> a running OSD before we then restart the OSD, as a normal OSD >> upgrade will require the OSD to restart to apply the code >> resulting in a corrupt OSD. >> >> Thanks >> >> >> ---- On Sat, 23 May 2020 00:12:59 +0800 *Igor Fedotov >> <ifedotov@xxxxxxx> <mailto:ifedotov@xxxxxxx>* wrote ---- >> >> Status update: >> >> Finally we have the first patch to fix the issue in master: >> https://github.com/ceph/ceph/pull/35201 >> >> And ticket has been updated with root cause >> analysis:https://tracker.ceph.com/issues/45613On 5/21/2020 >> 2:07 PM, Igor >> Fedotov wrote: >> >> @Chris - unfortunately it looks like the corruption is >> permanent since >> valid WAL data are presumably overwritten with another >> stuff. Hence I >> don't know any way to recover - perhaps you can try cutting >> >> WAL file off which will allow OSD to start. With some >> latest ops lost. >> Once can use exported BlueFS as a drop in replacement for >> regular DB >> volume but I'm not aware of details. >> >> And the above are just speculations, can't say for sure if >> it helps... >> >> I can't explain why WAL doesn't have zero block in your >> case though. >> Little chances this is a different issue. Just in case - >> could you >> please search for 32K zero blocks over the whole file? And >> the same for >> another OSD? >> >> >> Thanks, >> >> Igor >> >> > Short update on the issue: >> > >> > Finally we're able to reproduce the issue in master (not >> octopus), >> > investigating further.. >> > >> > @Chris - to make sure you're facing the same issue could >> you please >> > check the content of the broken file. To do so: >> > >> > 1) run "ceph-bluestore-tool --path <path-to-osd> >> --our-dir <target >> > dir> --command bluefs-export >> > >> > This will export bluefs files to <target dir> >> > >> > 2) Check the content for file db.wal/002040.log at >> offset 0x470000 >> > >> > This will presumably contain 32K of zero bytes. Is this >> the case? >> > >> > >> > No hurry as I'm just making sure symptoms in Octopus are >> the same... >> > >> > >> > Thanks, >> > >> > Igor >> > >> > On 5/20/2020 5:24 PM, Igor Fedotov wrote: >> >> Chris, >> >> >> >> got them, thanks! >> >> >> >> Investigating.... >> ; >> >> >> >> >> Thanks, >> >> >> >> Igor >> >> >> >> On 5/20/2020 5:23 PM, Chris Palmer wrote: >> >>> Hi Igor >> >>> I've sent you these directly as they're a bit chunky. >> Let me know if >> >>> you haven't got them. >> >>> Thx, Chris >> >>> >> >>> On 20/05/2020 14:43, Igor Fedotov wrote: >> >>>> Hi Cris, >> >>>> >> >>>> could you please share the full log prior to the >> first failure? >> >>>> >> >>>> Also if possible please set debug-bluestore/debug >> bluefs to 20 and >> >>>> collect another one for failed OSD startup. >> >>>> >> >>>> >> >>>> Thanks, >> >>>> >> >>>> Igor >> >>>> >> >>>> >> >>>> On 5/20/2020 4:39 PM, Chris Palmer wrote: >> >>>>> I'm getting similar errors after rebooting a node. >> Cluster was >> >>>>> upgraded 15.2.1 -> 15.2.2 yesterday. No problems >> after rebooting >> >>>>> during upgrade. >> >>>>> >> >>>>> On the node I just rebooted, 2/4 OSDs won't restart. >> Similar logs >> >>>>> from both. Logs from one below. >> >>>>> Neither OSDs have compression enabled, although >> there is a >> >>>>> compression-related error in the log. >> >>>>> Both are replicated x3. One has data on HDD & >> separate WAL/DB on >> >>>>> NVMe partition, the other is everything on NVMe >> partition only. >> >>>>> >> >>>>> Feeling kinda nervous here - advice welcomed!! >> >>>>> >> >>>>> Thx, Chris >> >>>>> >> >>>>> >> >>>>> >> >>>>> 2020-05-20T13:14:00.837+0100 7f2e0d273700 3 rocksdb: >> >>>>> [table/block_based_table_reader.cc:1117] Encountered >> error while >> >>>>> reading data from compression dictionary block >> Corruption: block >> >>>>> checksum mismatch: expected 0, got 3423870535 in >> db/000304.sst >> >>>>> offset 18446744073709551615 size 18446744073709551615 >> >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: >> >>>>> [db/version_set.cc:3757] Recovered from manifest >> >>>>> file:db/MANIFEST-000312 >> succeeded,manifest_file_number is 312, >> >>>>> next_file_number is 314, last_sequence is 22320582, >> log_number is >> >>>>> 309,prev_log_number is 0,max_column_family is >> >>>>> 0,min_log_number_to_keep is 0 >> >>>>> >> >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: >> >>>>> [db/version_set.cc:3766] Column family [default] (ID >> 0), log >> >>>>> number is 309 >> >>>>> >> >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 >> rocksdb: EVENT_LOG_v1 >> >>>>> {"time_micros": 1589976840843199, "job": 1, "event": >> >>>>> "recovery_started", "log_files": [313]} >> >>>>> 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: >> >>>>> [db/db_impl_open.cc:583] Recovering log #313 mode 0 >> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: >> >>>>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping >> 9044 bytes; >> >>>>> Corruption: error in middle of record >> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: >> >>>>> [db/db_impl_open.cc:518] db.wal/000313.log: dropping >> 86 bytes; >> >>>>> Corruption: missing start of fragmented record(2) >> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: >> >>>>> [db/db_impl.cc:390] Shutdown: canceling all >> background work >> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: >> >>>>> [db/db_impl.cc:563] Shutdown complete >> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 >> rocksdb: Corruption: >> >>>>> error in middle of record >> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 >> >>>>> bluestore(/var/lib/ceph/osd/ceph-9) _open_db >> erroring opening db: >> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bluefs >> umount >> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 >> fbmap_alloc >> >>>>> 0x55daf2b3a900 shutdown >> >>>>> 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 >> bdev(0x55daf3838700 >> >>>>> /var/lib/ceph/osd/ceph-9/block) close >> >>>>> 2020-05-20T13:14:01.093+0100 7f2e1957ee00 1 >> bdev(0x55daf3838000 >> >>>>> /var/lib/ceph/osd/ceph-9/block) close >> >>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 osd.9 0 >> OSD:init: >> >>>>> unable to mount object store >> >>>>> 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 >> ESC[0;31m ** ERROR: >> >>>>> osd init failed: (5) Input/output errorESC[0m >> >>>>> _______________________________________________ >> >>>>> ceph-users mailing list -- ceph-users@xxxxxxx >> <mailto:ceph-users@xxxxxxx> >> >>>>> To unsubscribe send an email to >> ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx> >> >>> >> >> _______________________________________________ >> >> ceph-users mailing list -- ceph-users@xxxxxxx >> <mailto:ceph-users@xxxxxxx> >> >> To unsubscribe send an email to >> ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx> >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users@xxxxxxx >> <mailto:ceph-users@xxxxxxx> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >> <mailto:ceph-users-leave@xxxxxxx> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> <mailto:ceph-users@xxxxxxx> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> <mailto:ceph-users-leave@xxxxxxx> >> >> >> >> >> > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ceph-users mailing list -- ceph-users@xxxxxxxxx unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx