Hi Igor, Many thanks for the quick reply. Your advice concurs with my own thoughts, given the damage, probably safest to wipe the OSD's and start over. thanks again, Jake On 05/07/18 14:28, Igor Fedotov wrote: > Hi Jake, > > IMO it doesn't make sense to recover from this drive/data as the damage > coverage looks pretty wide. > > By modifying BlueFS code you can bypass that specific assertion but most > probably BlueFS and other BlueStore stuff are pretty inconsistent and > most probably are unrecoverable at the moment. Given that you have valid > replicated data it's much simpler just to start these OSDs over. > > > Thanks, > > Igor > > > On 7/5/2018 3:58 PM, Jake Grimmett wrote: >> Dear All, >> >> I have a Mimic (13.2.0) cluster, which, due to a bad disk controller, >> corrupted three Bluestore OSD's on one node. >> >> Unfortunately these three OSD's crash when they try to start. >> >> systemctl start ceph-osd@193 >> (snip) >> /BlueFS.cc: 828: FAILED assert(r != q->second->file_map.end()) >> >> Full log here: http://p.ip.fi/yFYn >> >> "ceph-bluestore-tool repair" also crashes, with a similar error in >> BlueFS.cc >> >> # ceph-bluestore-tool repair --dev /dev/sdc2 --path >> /var/lib/ceph/osd/ceph-193 >> (snip) >> /BlueFS.cc: 828: FAILED assert(r != q->second->file_map.end()) >> >> Full log here: http://p.ip.fi/l_Q_ >> >> This command works OK: >> >> # ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-193 >> inferring bluefs devices from bluestore path >> { >> "/var/lib/ceph/osd/ceph-193/block": { >> "osd_uuid": "90b25336-9932-4e0b-a16b-51159568c398", >> "size": 8001457295360, >> "btime": "2017-12-08 15:46:40.034495", >> "description": "main", >> "bluefs": "1", >> "ceph_fsid": "f035ee98-abfd-4496-b903-a403b29c828f", >> "kv_backend": "rocksdb", >> "magic": "ceph osd volume v026", >> "mkfs_done": "yes", >> "ready": "ready", >> "whoami": "193" >> } >> } >> >> # lsblk | grep sdc >> sdc 8:32 0 7.3T 0 disk >> ├─sdc1 8:33 0 100M 0 part /var/lib/ceph/osd/ceph-193 >> └─sdc2 8:34 0 7.3T 0 part >> >> Since the OSD's failed, the Cluster has rebalanced, though I still have >> ceph HEALTH_ERR: >> 95 scrub errors; Possible data damage: 11 pgs inconsistent >> >> Manual scrubs are not started by the OSD demons (reported elsewhere, see >> "ceph pg scrub" does not start) >> >> Looking at the old logs, I see ~3500 entries in the logs of the bad >> OSDs, all similar to: >> >> -9> 2018-07-04 14:42:34.744 7f9ef0bbb1c0 2 rocksdb: >> [/root/ceph-build/ceph-13.2.0/src/rocksdb/db/version_set.cc:1330] Unable >> to load table properties for file 43530 --- Corruption: bad block >> contents���5b >> >> There are a much smaller number of crc errors, similar to : >> >> 2> 2018-07-02 12:58:07.702 7fd3649eb1c0 -1 >> bluestore(/var/lib/ceph/osd/ceph-425) _verify_csum bad crc32c/0x1000 >> checksum at blob offset 0x0, got 0xff625379, expected 0x75b558bc, device >> location [0xf5a66e0000~1000], logical extent 0x0~1000, object >> #-1:2c691ffb:::osdmap.176500:0# >> >> I'm inclined to wipe these three OSD's and start again, but am happy to >> try suggestions to repair. >> >> thanks for any suggestions, >> >> Jake >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com