Help needed to recover from cache tier OSD crash

Dmitry <dmit2k@xxxxxxxxx> · Sun, 29 Jul 2018 00:35:17 +0300

Hello all,

would someone please help with recovering from a recent failure of all cache tier pool OSDs?

My CEPH cluster has a usual replica 2 pool with two 500GB SSD OSD’s writeback cache tier over it (also replica 2). 

Both cache OSD’s were created with standard ceph deploy tool, and have 2 partitions (one journal and one XFS).

The target_max_bytes parameter for this cache pool was set to a 70% of the size of a single SSD disk to avoid overflow. This configuration worked fine for years..

But recently, for some unknown reason, when exporting large 300GB raw RBD image with 'rbd export’ command, both cache OSDs got 100% full and crashed.

In attempt to flush all the data from the cache to the underlying pool and avoid further damage, I have switched the cache pool into ‘forward’ mode and restarted both cache OSDs.

Both worked for some minutes and segfaulted again, and do not start anymore. Debugging the crash errors, I found out that the error is related to decoding attributes.

When checked with 'getfattr -d’ on random object files and directories on affected OSDs, I discovered that there are NO extended attributes exist at all anymore.

So I suspect that due to filesystems getting 100% full and restarting the OSD daemons several times, the XFS was somehow corrupted and lost the extended attributes which are required for CEPH to operate.

The question is - is it possible to somehow recover the attributes or flush the cached data back to the cold storage pool?

Would someone advise or help to recover the data please?

—
Regards,
Dmit

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com