Re: 15.2.2 Upgrade - Corruption: error in middle of record

Ashley Merrick <singapore@xxxxxxxxxxxxxx> · Sat, 23 May 2020 17:03:19 +0800

Hello Chris,

Great to hear, few questions.

Once you have injected the bluefs_preextend_wal_files to false, are you just rebuilding the OSD's that failed? Or are you going through and rebuilding every OSD even the working one's?

Or does setting the bluefs_preextend_wal_files value to false and leaving the OSD running fix the WAL automatically?

Thanks

---- On Sat, 23 May 2020 15:53:42 +0800 Chris Palmer <chris.palmer@xxxxxxxxx> wrote ----

Hi Ashley

 Igor has done a great job of tracking down the problem, and we have
    finally shown evidence of the type of corruption it would produce in
    one of my WALs. Our feeling at the moment is that the problem can be
    detoured by setting bluefs_preextend_wal_files to false on affected
    OSDs while they are running (but see below), although Igor does note
    that there is a small risk in doing this. I've agreed a plan of
    action based on this route, recreating the failed OSDs, and then
    cycling through the others until all are healthy. I've started this
    now, and so far it looks promising, although of course I have to
    wait for recovery/rebalancing. This is the fastest route to
    recovery, although there other options.

 I'll post as it progresses. The good news seems to be that there
    shouldn't be any actual data corruption or loss, providing that this
    can be done before OSDs are taken down (other than as part of this
    process). My understanding is that there will some degree of
    performance penalty until the root cause is fixed in the next
    release and preextending can be turned back on. However it does seem
    like I can get back to a stable/safe position without waiting for a
    software release.

 I'm just working through this at the moment though, so please don't
    take the above as any form of recommendation. It is important not to
    try to restart OSDs though in the meantime. I'm sure Igor will
    publish some more expert recommendations in due course...

 Regards, Chris

On 23/05/2020 06:54, Ashley Merrick
      wrote:

Thanks Igor,

Do you have any idea on a e.t.a or plan for people that are
          running 15.2.2 to be able to patch / fix the issue.

I had a read of the ticket and seems the corruption is
          happening but the WAL is not read till OSD restart, so I
          imagine we will need some form of fix / patch we can apply to
          a running OSD before we then restart the OSD, as a normal OSD
          upgrade will require the OSD to restart to apply the code
          resulting in a corrupt OSD.

Thanks

---- On Sat, 23 May 2020 00:12:59 +0800 Igor Fedotov mailto:ifedotov@xxxxxxx wrote ----

Status update: 

 Finally we have the first patch to fix the issue in
              master: 
 https://github.com/ceph/ceph/pull/35201 

 And ticket has been updated with root cause 
 analysis:https://tracker.ceph.com/issues/45613On 5/21/2020 2:07 PM, Igor 
 Fedotov wrote: 

 @Chris - unfortunately it looks like the corruption is
              permanent since  
 valid WAL data are presumably overwritten with another
              stuff. Hence I 
 don't know any way to recover - perhaps you can try
              cutting 

 WAL file off which will allow OSD to start. With some
              latest ops lost. 
 Once can use exported BlueFS as a drop in replacement for
              regular DB 
 volume but I'm not aware of details. 

 And the above are just speculations, can't say for sure if
              it helps... 

 I can't explain why WAL doesn't have zero block in your
              case though. 
 Little chances this is a different issue. Just in case -
              could you 
 please search for 32K zero blocks over the whole file? And
              the same for 
 another OSD? 

 Thanks, 

 Igor 

 > Short update on the issue: 
 > 
 > Finally we're able to reproduce the issue in master
              (not octopus), 
 > investigating further.. 
 > 
 > @Chris - to make sure you're facing the same issue
              could you please 
 > check the content of the broken file. To do so: 
 > 
 > 1) run "ceph-bluestore-tool --path
              <path-to-osd> --our-dir <target 
 > dir> --command bluefs-export 
 > 
 > This will export bluefs files to <target dir> 
 > 
 > 2) Check the content for file db.wal/002040.log at
              offset 0x470000 
 > 
 > This will presumably contain 32K of zero bytes. Is
              this the case? 
 > 
 > 
 > No hurry as I'm just making sure symptoms in Octopus
              are the same... 
 > 
 > 
 > Thanks, 
 > 
 > Igor 
 > 
 > On 5/20/2020 5:24 PM, Igor Fedotov wrote: 
 >> Chris, 
 >> 
 >> got them, thanks! 
 >> 
 >> Investigating.... 
 >> 
 >> 
 >> Thanks, 
 >> 
 >> Igor 
 >> 
 >> On 5/20/2020 5:23 PM, Chris Palmer wrote: 
 >>> Hi Igor 
 >>> I've sent you these directly as they're a bit
              chunky. Let me know if 
 >>> you haven't got them. 
 >>> Thx, Chris 
 >>> 
 >>> On 20/05/2020 14:43, Igor Fedotov wrote: 
 >>>> Hi Cris, 
 >>>> 
 >>>> could you please share the full log prior
              to the first failure? 
 >>>> 
 >>>> Also if possible please set
              debug-bluestore/debug bluefs to 20 and 
 >>>> collect another one for failed OSD
              startup. 
 >>>> 
 >>>> 
 >>>> Thanks, 
 >>>> 
 >>>> Igor 
 >>>> 
 >>>> 
 >>>> On 5/20/2020 4:39 PM, Chris Palmer wrote: 
 >>>>> I'm getting similar errors after
              rebooting a node. Cluster was 
 >>>>> upgraded 15.2.1 -> 15.2.2
              yesterday. No problems after rebooting 
 >>>>> during upgrade. 
 >>>>> 
 >>>>> On the node I just rebooted, 2/4 OSDs
              won't restart. Similar logs 
 >>>>> from both. Logs from one below. 
 >>>>> Neither OSDs have compression
              enabled, although there is a 
 >>>>> compression-related error in the log. 
 >>>>> Both are replicated x3. One has data
              on HDD & separate WAL/DB on 
 >>>>> NVMe partition, the other is
              everything on NVMe partition only. 
 >>>>> 
 >>>>> Feeling kinda nervous here - advice
              welcomed!! 
 >>>>> 
 >>>>> Thx, Chris 
 >>>>> 
 >>>>> 
 >>>>> 
 >>>>> 2020-05-20T13:14:00.837+0100
              7f2e0d273700  3 rocksdb: 
 >>>>>
              [table/block_based_table_reader.cc:1117] Encountered error
              while 
 >>>>> reading data from compression
              dictionary block Corruption: block 
 >>>>> checksum mismatch: expected 0, got
              3423870535  in db/000304.sst 
 >>>>> offset 18446744073709551615 size
              18446744073709551615 
 >>>>> 2020-05-20T13:14:00.841+0100
              7f2e1957ee00  4 rocksdb: 
 >>>>> [db/version_set.cc:3757] Recovered
              from manifest 
 >>>>>  succeeded,manifest_file_number is 312, 
 >>>>> next_file_number is 314,
              last_sequence is 22320582, log_number is 
 >>>>> 309,prev_log_number is
              0,max_column_family is 
 >>>>> 0,min_log_number_to_keep is 0 
 >>>>> 
 >>>>> 2020-05-20T13:14:00.841+0100
              7f2e1957ee00  4 rocksdb: 
 >>>>> [db/version_set.cc:3766] Column
              family [default] (ID 0), log 
 >>>>> number is 309 
 >>>>> 
 >>>>> 2020-05-20T13:14:00.841+0100
              7f2e1957ee00  4 rocksdb: EVENT_LOG_v1 
 >>>>> {"time_micros": 1589976840843199,
              "job": 1, "event": 
 >>>>> "recovery_started", "log_files":
              [313]} 
 >>>>> 2020-05-20T13:14:00.841+0100
              7f2e1957ee00  4 rocksdb: 
 >>>>> [db/db_impl_open.cc:583] Recovering
              log #313 mode 0 
 >>>>> 2020-05-20T13:14:00.937+0100
              7f2e1957ee00  3 rocksdb: 
 >>>>> [db/db_impl_open.cc:518]
              db.wal/000313.log: dropping 9044 bytes; 
 >>>>> Corruption: error in middle of record 
 >>>>> 2020-05-20T13:14:00.937+0100
              7f2e1957ee00  3 rocksdb: 
 >>>>> [db/db_impl_open.cc:518]
              db.wal/000313.log: dropping 86 bytes; 
 >>>>> Corruption: missing start of
              fragmented record(2) 
 >>>>> 2020-05-20T13:14:00.937+0100
              7f2e1957ee00  4 rocksdb: 
 >>>>> [db/db_impl.cc:390] Shutdown:
              canceling all background work 
 >>>>> 2020-05-20T13:14:00.937+0100
              7f2e1957ee00  4 rocksdb: 
 >>>>> [db/db_impl.cc:563] Shutdown complete 
 >>>>> 2020-05-20T13:14:00.937+0100
              7f2e1957ee00 -1 rocksdb: Corruption: 
 >>>>> error in middle of record 
 >>>>> 2020-05-20T13:14:00.937+0100
              7f2e1957ee00 -1 
 >>>>> bluestore(/var/lib/ceph/osd/ceph-9)
              _open_db erroring opening db: 
 >>>>> 2020-05-20T13:14:00.937+0100
              7f2e1957ee00  1 bluefs umount 
 >>>>> 2020-05-20T13:14:00.937+0100
              7f2e1957ee00  1 fbmap_alloc 
 >>>>> 0x55daf2b3a900 shutdown 
 >>>>> 2020-05-20T13:14:00.937+0100
              7f2e1957ee00  1 bdev(0x55daf3838700 
 >>>>> /var/lib/ceph/osd/ceph-9/block) close 
 >>>>> 2020-05-20T13:14:01.093+0100
              7f2e1957ee00  1 bdev(0x55daf3838000 
 >>>>> /var/lib/ceph/osd/ceph-9/block) close 
 >>>>> 2020-05-20T13:14:01.341+0100
              7f2e1957ee00 -1 osd.9 0 OSD:init: 
 >>>>> unable to mount object store 
 >>>>> 2020-05-20T13:14:01.341+0100
              7f2e1957ee00 -1 ESC[0;31m ** ERROR: 
 >>>>> osd init failed: (5) Input/output
              errorESC[0m 
 >>>>>
              _______________________________________________ 
 >>>>> ceph-users mailing list -- mailto:ceph-users@xxxxxxx 
 >>>>> To unsubscribe send an email to mailto:ceph-users-leave@xxxxxxx 
 >>> 
 >> _______________________________________________ 
 >> ceph-users mailing list -- mailto:ceph-users@xxxxxxx 
 >> To unsubscribe send an email to mailto:ceph-users-leave@xxxxxxx 
 > _______________________________________________ 
 > ceph-users mailing list -- mailto:ceph-users@xxxxxxx 
 > To unsubscribe send an email to mailto:ceph-users-leave@xxxxxxx
 _______________________________________________
 ceph-users mailing list -- mailto:ceph-users@xxxxxxx
 To unsubscribe send an email to mailto:ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx