Re: 15.2.2 Upgrade - Corruption: error in middle of record

Ashley Merrick <singapore@xxxxxxxxxxxxxx> · Wed, 20 May 2020 21:01:02 +0800

I attached the log but was too big and got moderated.

Here is it in a paste bin : https://pastebin.pl/view/69b2beb9

I have cut the log to start from the point of the original upgrade.

Thanks

---- On Wed, 20 May 2020 20:55:51 +0800 Igor Fedotov <ifedotov@xxxxxxx> wrote ----

Dan, thanks for the info. Good to know. 

Failed QA run in the ticket uses snappy though. 

And in fact any stuff writing to process memory can  introduce data 
corruption in the similar manner. 

So will keep that in mind but IMO relation to compression is still not 
evident... 

Kind regards, 

Igor 

On 5/20/2020 3:32 PM, Dan van der Ster wrote: 
> lz4 ? It's not obviously related, but I've seen it involved in really 
> non-obvious ways: https://tracker.ceph.com/issues/39525 
> 
> -- dan 
> 
> On Wed, May 20, 2020 at 2:27 PM Ashley Merrick <mailto:singapore@xxxxxxxxxxxxxx> wrote: 
>> Thanks, fyi the OSD's that went down back two pools, an Erasure code Meta (RBD) and cephFS Meta. The cephFS Pool does have compresison enabled ( I noticed it mentioned in the ceph tracker) 
>> 
>> 
>> 
>> Thanks 
>> 
>> 
>> 
>> 
>> 
>> ---- On Wed, 20 May 2020 20:17:33 +0800 Igor Fedotov <mailto:ifedotov@xxxxxxx> wrote ---- 
>> 
>> 
>> 
>> Hi Ashley, 
>> 
>> looks like this is a regression. Neha observed similar error(s) during 
>> here QA run, see https://tracker.ceph.com/issues/45613 
>> 
>> 
>> Please preserve broken OSDs for a while if possible, likely I'll come 
>> back to you for more information to troubleshoot. 
>> 
>> 
>> Thanks, 
>> 
>> Igor 
>> 
>> On 5/20/2020 1:26 PM, Ashley Merrick wrote: 
>> 
>>> So reading online it looked a dead end error, so I recreated the 3 OSD's on that node and now working fine after a reboot. 
>>> 
>>> 
>>> 
>>> However I restarted the next server with 3 OSD's and one of them is now facing the same issue. 
>>> 
>>> 
>>> 
>>> Let me know if you need any more logs. 
>>> 
>>> 
>>> 
>>> Thanks 
>>> 
>>> 
>>> 
>>> ---- On Wed, 20 May 2020 17:02:31 +0800 Ashley Merrick <mailto:mailto:singapore@xxxxxxxxxxxxxx> wrote ---- 
>>> 
>>> 
>>> I just upgraded a cephadm cluster from 15.2.1 to 15.2.2. 
>>> 
>>> 
>>> 
>>> Everything went fine on the upgrade, however after restarting one node that has 3 OSD's for ecmeta two of the 3 ODS's now wont boot with the following error: 
>>> 
>>> 
>>> 
>>> May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.598+0000 7fbcc46f7ec0  4 rocksdb: [db/version_set.cc:3757] Recovered from manifest file:db/MANIFEST-002768 succeeded,manifest_file_number is 2768, next_file_number is 2775, last_sequence is 188026749, log_number is 2767,prev_log_number is 0,max_column_family is 0,min_log_number_to_keep is 0 
>>> 
>>> May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.598+0000 7fbcc46f7ec0  4 rocksdb: [db/version_set.cc:3766] Column family [default] (ID 0), log number is 2767 
>>> 
>>> May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.598+0000 7fbcc46f7ec0  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1589963382599157, "job": 1, "event": "recovery_started", "log_files": [2769]} 
>>> 
>>> May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.598+0000 7fbcc46f7ec0  4 rocksdb: [db/db_impl_open.cc:583] Recovering log #2769 mode 0 
>>> 
>>> May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.598+0000 7fbcc46f7ec0  3 rocksdb: [db/db_impl_open.cc:518] db/002769.log: dropping 537526 bytes; Corruption: error in middle of record 
>>> 
>>> May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.598+0000 7fbcc46f7ec0  3 rocksdb: [db/db_impl_open.cc:518] db/002769.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1) 
>>> 
>>> May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0  3 rocksdb: [db/db_impl_open.cc:518] db/002769.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1) 
>>> 
>>> May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0  3 rocksdb: [db/db_impl_open.cc:518] db/002769.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1) 
>>> 
>>> May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0  3 rocksdb: [db/db_impl_open.cc:518] db/002769.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1) 
>>> 
>>> May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0  3 rocksdb: [db/db_impl_open.cc:518] db/002769.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1) 
>>> 
>>> May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0  3 rocksdb: [db/db_impl_open.cc:518] db/002769.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1) 
>>> 
>>> May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0  3 rocksdb: [db/db_impl_open.cc:518] db/002769.log: dropping 23263 bytes; Corruption: missing start of fragmented record(2) 
>>> 
>>> May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0  4 rocksdb: [db/db_impl.cc:390] Shutdown: canceling all background work 
>>> 
>>> May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0  4 rocksdb: [db/db_impl.cc:563] Shutdown complete 
>>> 
>>> May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0 -1 rocksdb: Corruption: error in middle of record 
>>> 
>>> May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0 -1 bluestore(/var/lib/ceph/osd/ceph-0) _open_db erroring opening db: 
>>> 
>>> May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.602+0000 7fbcc46f7ec0  1 bdev(0x558a28dd0700 /var/lib/ceph/osd/ceph-0/block) close 
>>> 
>>> May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.870+0000 7fbcc46f7ec0  1 bdev(0x558a28dd0000 /var/lib/ceph/osd/ceph-0/block) close 
>>> 
>>> May 20 08:29:43 sn-m01 bash[6833]: debug 2020-05-20T08:29:43.118+0000 7fbcc46f7ec0 -1 osd.0 0 OSD:init: unable to mount object store 
>>> 
>>> May 20 08:29:43 sn-m01 bash[6833]: debug 2020-05-20T08:29:43.118+0000 7fbcc46f7ec0 -1  ** ERROR: osd init failed: (5) Input/output error 
>>> 
>>> 
>>> 
>>> Have I hit a bug, or is there something I can do to try and fix these OSD's? 
>>> 
>>> 
>>> 
>>> Thanks 
>>> _______________________________________________ 
>>> ceph-users mailing list -- mailto:mailto:mailto:ceph-users@xxxxxxx 
>>> To unsubscribe send an email to mailto:mailto:mailto:ceph-users-leave@xxxxxxx 
>>> _______________________________________________ 
>>> ceph-users mailing list -- mailto:mailto:ceph-users@xxxxxxx 
>>> To unsubscribe send an email to mailto:mailto:ceph-users-leave@xxxxxxx 
>> _______________________________________________ 
>> ceph-users mailing list -- mailto:ceph-users@xxxxxxx 
>> To unsubscribe send an email to mailto:ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx