corrupt OSD: BlueFS.cc: 828: FAILED assert

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear All,

I have a Mimic (13.2.0) cluster, which, due to a bad disk controller,
corrupted three Bluestore OSD's on one node.

Unfortunately these three OSD's crash when they try to start.

systemctl start ceph-osd@193
(snip)
/BlueFS.cc: 828: FAILED assert(r != q->second->file_map.end())

Full log here: http://p.ip.fi/yFYn

"ceph-bluestore-tool repair" also crashes, with a similar error in BlueFS.cc

# ceph-bluestore-tool repair --dev /dev/sdc2 --path
/var/lib/ceph/osd/ceph-193
(snip)
/BlueFS.cc: 828: FAILED assert(r != q->second->file_map.end())

Full log here: http://p.ip.fi/l_Q_

This command works OK:

# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-193
inferring bluefs devices from bluestore path
{
    "/var/lib/ceph/osd/ceph-193/block": {
        "osd_uuid": "90b25336-9932-4e0b-a16b-51159568c398",
        "size": 8001457295360,
        "btime": "2017-12-08 15:46:40.034495",
        "description": "main",
        "bluefs": "1",
        "ceph_fsid": "f035ee98-abfd-4496-b903-a403b29c828f",
        "kv_backend": "rocksdb",
        "magic": "ceph osd volume v026",
        "mkfs_done": "yes",
        "ready": "ready",
        "whoami": "193"
    }
}

# lsblk | grep sdc
sdc       8:32   0   7.3T  0 disk
├─sdc1    8:33   0   100M  0 part  /var/lib/ceph/osd/ceph-193
└─sdc2    8:34   0   7.3T  0 part

Since the OSD's failed, the Cluster has rebalanced, though I still have
ceph HEALTH_ERR:
95 scrub errors; Possible data damage: 11 pgs inconsistent

Manual scrubs are not started by the OSD demons (reported elsewhere, see
  "ceph pg scrub" does not start)

Looking at the old logs, I see ~3500 entries in the logs of the bad
OSDs, all similar to:

    -9> 2018-07-04 14:42:34.744 7f9ef0bbb1c0  2 rocksdb:
[/root/ceph-build/ceph-13.2.0/src/rocksdb/db/version_set.cc:1330] Unable
to load table properties for file 43530 --- Corruption: bad block
contents���5b

There are a much smaller number of crc errors, similar to :

2> 2018-07-02 12:58:07.702 7fd3649eb1c0 -1
bluestore(/var/lib/ceph/osd/ceph-425) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x0, got 0xff625379, expected 0x75b558bc, device
location [0xf5a66e0000~1000], logical extent 0x0~1000, object
#-1:2c691ffb:::osdmap.176500:0#

I'm inclined to wipe these three OSD's and start again, but am happy to
try suggestions to repair.

thanks for any suggestions,

Jake
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux