Salvage CEPHFS after lost PG

Rik <rik@xxxxxxxxxx> · Sun, 20 Jan 2019 18:47:03 +1100

Hi all,

I'm looking for some suggestions on how to do something inappropriate. 

In a nutshell, I've lost the WAL/DB for three bluestore OSDs on a small cluster and, as a result of those three OSDs going offline, I've lost a placement group (7.a7). How I achieved this feat is an embarrassing mistake, which I don't think has bearing on my question.

The OSDs were created a few months ago with ceph-deploy:
/usr/local/bin/ceph-deploy --overwrite-conf osd create --bluestore --data /dev/vdc1 --block-db /dev/vdf1 ceph-a

With the 3 OSDs out, I'm sitting at OSD_BACKFILLFULL.

First, the PG 7.a7 belongs to the data pool, rather than the metadata pool and if I run "cephfs-data-scan pg_files / 7.a7" then I get a list of 4149 files/objects but then it hangs. I don't understand why this would hang if it's only the data pool which is impacted (since pg_files only operates on the metadata pool?).

The ceph-log shows:
cluster [WRN] slow request 30.894832 seconds old, received at 2019-01-20 18:00:12.555398: client_request(client.25017730:21
8006 lookup #0x10001c8ce15/000001 2019-01-20 18:00:12.550421 caller_uid=0, caller_gid=0{}) currently failed to rdlock, waiting

Is the hang perhaps related to the OSD_BACKFILLFULL? If so, I could add some completely new OSDs to fix that problem. I have held off doing that for now as that will trigger a whole lot of data movement which might be unnecessary.

Or is the hang indeed related to the missing PG?

Second, if I try to copy files out of the CEPHFS filesystem, I get a few hundred files and then it too hangs. None of the files I’m attempting to copy are listed in the pg_files output (although since the pg_files hangs, perhaps it hadn't got to those files yet). Again, should I not be able to access files which are not associated with the a missing data pool PG?

Lastly, I want to know if there is some way to recreate the WAL/DB while leaving the OSD data intact and/or fool one of the OSDs into thinking everything is OK, allowing it to serve up the data it has in the missing PG.

From reading the mailing list and documentation, I know that this is not a "safe" operation:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021713.html
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/024268.html

However, my current status indicates an unusable CEPHFS and limited access to the data. I'd like to get as much data off it as possible and then I expect to have to recreate it. With a combination of the backups I have and what I can salvage from the cluster, I should hopefully have most of what I need.

I know what I *should* have done, but now I'm at this point, I know I'm asking for something which would never be required on a properly-run cluster.

If it really is not possible to get the (possibly corrupt) PG back again, can I get the cluster back so the remainder of the files are accessible?

Currently running mimic 13.2.4 on all nodes.

Status:
$ ceph health detail - https://gist.github.com/kawaja/f59d231179b3186748eca19aae26bcd4
$ ceph fs get main - https://gist.github.com/kawaja/a7ab0b285d53dee6a950a4310be4fa5a

Any advice on where I could go from here would be greatly appreciated.

thanks,
rik.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com