Re: corrupt OSD: BlueFS.cc: 828: FAILED assert

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Igor,

Many thanks for the quick reply.

Your advice concurs with my own thoughts, given the damage, probably
safest to wipe the OSD's and start over.

thanks again,

Jake


On 05/07/18 14:28, Igor Fedotov wrote:
> Hi Jake,
> 
> IMO it doesn't make sense to recover from this drive/data as the damage
> coverage looks pretty wide.
> 
> By modifying BlueFS code you can bypass that specific assertion but most
> probably BlueFS and  other BlueStore stuff are pretty inconsistent and
> most probably are unrecoverable at the moment. Given that you have valid
> replicated data it's much simpler just to start these OSDs over.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 7/5/2018 3:58 PM, Jake Grimmett wrote:
>> Dear All,
>>
>> I have a Mimic (13.2.0) cluster, which, due to a bad disk controller,
>> corrupted three Bluestore OSD's on one node.
>>
>> Unfortunately these three OSD's crash when they try to start.
>>
>> systemctl start ceph-osd@193
>> (snip)
>> /BlueFS.cc: 828: FAILED assert(r != q->second->file_map.end())
>>
>> Full log here: http://p.ip.fi/yFYn
>>
>> "ceph-bluestore-tool repair" also crashes, with a similar error in
>> BlueFS.cc
>>
>> # ceph-bluestore-tool repair --dev /dev/sdc2 --path
>> /var/lib/ceph/osd/ceph-193
>> (snip)
>> /BlueFS.cc: 828: FAILED assert(r != q->second->file_map.end())
>>
>> Full log here: http://p.ip.fi/l_Q_
>>
>> This command works OK:
>>
>> # ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-193
>> inferring bluefs devices from bluestore path
>> {
>>      "/var/lib/ceph/osd/ceph-193/block": {
>>          "osd_uuid": "90b25336-9932-4e0b-a16b-51159568c398",
>>          "size": 8001457295360,
>>          "btime": "2017-12-08 15:46:40.034495",
>>          "description": "main",
>>          "bluefs": "1",
>>          "ceph_fsid": "f035ee98-abfd-4496-b903-a403b29c828f",
>>          "kv_backend": "rocksdb",
>>          "magic": "ceph osd volume v026",
>>          "mkfs_done": "yes",
>>          "ready": "ready",
>>          "whoami": "193"
>>      }
>> }
>>
>> # lsblk | grep sdc
>> sdc       8:32   0   7.3T  0 disk
>> ├─sdc1    8:33   0   100M  0 part  /var/lib/ceph/osd/ceph-193
>> └─sdc2    8:34   0   7.3T  0 part
>>
>> Since the OSD's failed, the Cluster has rebalanced, though I still have
>> ceph HEALTH_ERR:
>> 95 scrub errors; Possible data damage: 11 pgs inconsistent
>>
>> Manual scrubs are not started by the OSD demons (reported elsewhere, see
>>    "ceph pg scrub" does not start)
>>
>> Looking at the old logs, I see ~3500 entries in the logs of the bad
>> OSDs, all similar to:
>>
>>      -9> 2018-07-04 14:42:34.744 7f9ef0bbb1c0  2 rocksdb:
>> [/root/ceph-build/ceph-13.2.0/src/rocksdb/db/version_set.cc:1330] Unable
>> to load table properties for file 43530 --- Corruption: bad block
>> contents���5b
>>
>> There are a much smaller number of crc errors, similar to :
>>
>> 2> 2018-07-02 12:58:07.702 7fd3649eb1c0 -1
>> bluestore(/var/lib/ceph/osd/ceph-425) _verify_csum bad crc32c/0x1000
>> checksum at blob offset 0x0, got 0xff625379, expected 0x75b558bc, device
>> location [0xf5a66e0000~1000], logical extent 0x0~1000, object
>> #-1:2c691ffb:::osdmap.176500:0#
>>
>> I'm inclined to wipe these three OSD's and start again, but am happy to
>> try suggestions to repair.
>>
>> thanks for any suggestions,
>>
>> Jake
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux