Re: Salvage CEPHFS after lost PG

"Marc Roos" <M.Roos@xxxxxxxxxxxxxxxxx> · Sun, 20 Jan 2019 10:36:26 +0100

If you have a backfillfull, no pg's will be able to migrate. 
Better is to just add harddrives, because at least one of your osd's is 
to full.

I know you can set the backfillfull ratio's with commands like these
ceph tell osd.* injectargs '--mon_osd_full_ratio=0.970000'
ceph tell osd.* injectargs '--mon_osd_backfillfull_ratio=0.950000'

ceph tell osd.* injectargs '--mon_osd_full_ratio=0.950000'
ceph tell osd.* injectargs '--mon_osd_backfillfull_ratio=0.900000'

Or maybe decrease the weight of the full osd, check the osds with 'ceph 
osd status' and make sure your nodes have even distribution of the 
storage.

-----Original Message-----
From: Rik [mailto:rik@xxxxxxxxxx] 
Sent: zondag 20 januari 2019 8:47
To: ceph-users@xxxxxxxxxxxxxx
Subject:  Salvage CEPHFS after lost PG

Hi all,

I'm looking for some suggestions on how to do something inappropriate. 

In a nutshell, I've lost the WAL/DB for three bluestore OSDs on a small 
cluster and, as a result of those three OSDs going offline, I've lost a 
placement group (7.a7). How I achieved this feat is an embarrassing 
mistake, which I don't think has bearing on my question.

The OSDs were created a few months ago with ceph-deploy:

/usr/local/bin/ceph-deploy --overwrite-conf osd create --bluestore 
--data /dev/vdc1 --block-db /dev/vdf1 ceph-a

With the 3 OSDs out, I'm sitting at OSD_BACKFILLFULL.

First, the PG 7.a7 belongs to the data pool, rather than the metadata 
pool and if I run "cephfs-data-scan pg_files / 7.a7" then I get a list 
of 4149 files/objects but then it hangs. I don't understand why this 
would hang if it's only the data pool which is impacted (since pg_files 
only operates on the metadata pool?).

The ceph-log shows:

cluster [WRN] slow request 30.894832 seconds old, received at 2019-01-20 
18:00:12.555398: client_request(client.25017730:21

8006 lookup #0x10001c8ce15/000001 2019-01-20 18:00:12.550421 
caller_uid=0, caller_gid=0{}) currently failed to rdlock, waiting

Is the hang perhaps related to the OSD_BACKFILLFULL? If so, I could add 
some completely new OSDs to fix that problem. I have held off doing that 
for now as that will trigger a whole lot of data movement which might be 
unnecessary.

Or is the hang indeed related to the missing PG?

Second, if I try to copy files out of the CEPHFS filesystem, I get a few 
hundred files and then it too hangs. None of the files I’m attempting 
to copy are listed in the pg_files output (although since the pg_files 
hangs, perhaps it hadn't got to those files yet). Again, should I not be 
able to access files which are not associated with the a missing data 
pool PG?

Lastly, I want to know if there is some way to recreate the WAL/DB while 
leaving the OSD data intact and/or fool one of the OSDs into thinking 
everything is OK, allowing it to serve up the data it has in the missing 
PG.

>From reading the mailing list and documentation, I know that this is not 
a "safe" operation:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021713.html

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/024268.html

However, my current status indicates an unusable CEPHFS and limited 
access to the data. I'd like to get as much data off it as possible and 
then I expect to have to recreate it. With a combination of the backups 
I have and what I can salvage from the cluster, I should hopefully have 
most of what I need.

I know what I *should* have done, but now I'm at this point, I know I'm 
asking for something which would never be required on a properly-run 
cluster.

If it really is not possible to get the (possibly corrupt) PG back 
again, can I get the cluster back so the remainder of the files are 
accessible?

Currently running mimic 13.2.4 on all nodes.

Status:

$ ceph health detail - 
https://gist.github.com/kawaja/f59d231179b3186748eca19aae26bcd4

$ ceph fs get main - 
https://gist.github.com/kawaja/a7ab0b285d53dee6a950a4310be4fa5a

Any advice on where I could go from here would be greatly appreciated.

thanks,

rik.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com