Re: Salvage CEPHFS after lost PG

Rik <rik@xxxxxxxxxx> · Fri, 25 Jan 2019 14:51:20 +1100

Thanks Marc,
When I next have physical access to the cluster, I’ll add some more OSDs. Would that cause the hanging though?

No takers on the bluestore salvage?

thanks,
rik.

On 20 Jan 2019, at 20:36, Marc Roos <M.Roos@xxxxxxxxxxxxxxxxx> wrote:

If you have a backfillfull, no pg's will be able to migrate. 
Better is to just add harddrives, because at least one of your osd's is 
to full.

I know you can set the backfillfull ratio's with commands like these
ceph tell osd.* injectargs '--mon_osd_full_ratio=0.970000'
ceph tell osd.* injectargs '--mon_osd_backfillfull_ratio=0.950000'

ceph tell osd.* injectargs '--mon_osd_full_ratio=0.950000'
ceph tell osd.* injectargs '--mon_osd_backfillfull_ratio=0.900000'

Or maybe decrease the weight of the full osd, check the osds with 'ceph 
osd status' and make sure your nodes have even distribution of the 
storage.

-----Original Message-----
From: Rik [mailto:rik@xxxxxxxxxx] 
Sent: zondag 20 januari 2019 8:47
To: ceph-users@xxxxxxxxxxxxxx
Subject:  Salvage CEPHFS after lost PG

Hi all,

I'm looking for some suggestions on how to do something inappropriate. 

In a nutshell, I've lost the WAL/DB for three bluestore OSDs on a small 
cluster and, as a result of those three OSDs going offline, I've lost a 
placement group (7.a7). How I achieved this feat is an embarrassing 
mistake, which I don't think has bearing on my question.

The OSDs were created a few months ago with ceph-deploy:

/usr/local/bin/ceph-deploy --overwrite-conf osd create --bluestore 
--data /dev/vdc1 --block-db /dev/vdf1 ceph-a

With the 3 OSDs out, I'm sitting at OSD_BACKFILLFULL.

First, the PG 7.a7 belongs to the data pool, rather than the metadata 
pool and if I run "cephfs-data-scan pg_files / 7.a7" then I get a list 
of 4149 files/objects but then it hangs. I don't understand why this 
would hang if it's only the data pool which is impacted (since pg_files 
only operates on the metadata pool?).

The ceph-log shows:

cluster [WRN] slow request 30.894832 seconds old, received at 2019-01-20 
18:00:12.555398: client_request(client.25017730:21

8006 lookup #0x10001c8ce15/000001 2019-01-20 18:00:12.550421 
caller_uid=0, caller_gid=0{}) currently failed to rdlock, waiting

Is the hang perhaps related to the OSD_BACKFILLFULL? If so, I could add 
some completely new OSDs to fix that problem. I have held off doing that 
for now as that will trigger a whole lot of data movement which might be 
unnecessary.

Or is the hang indeed related to the missing PG?

Second, if I try to copy files out of the CEPHFS filesystem, I get a few 
hundred files and then it too hangs. None of the files I’m attempting 
to copy are listed in the pg_files output (although since the pg_files 
hangs, perhaps it hadn't got to those files yet). Again, should I not be 
able to access files which are not associated with the a missing data 
pool PG?

Lastly, I want to know if there is some way to recreate the WAL/DB while 
leaving the OSD data intact and/or fool one of the OSDs into thinking 
everything is OK, allowing it to serve up the data it has in the missing 
PG.

From reading the mailing list and documentation, I know that this is not 
a "safe" operation:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021713.html

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/024268.html

However, my current status indicates an unusable CEPHFS and limited 
access to the data. I'd like to get as much data off it as possible and 
then I expect to have to recreate it. With a combination of the backups 
I have and what I can salvage from the cluster, I should hopefully have 
most of what I need.

I know what I *should* have done, but now I'm at this point, I know I'm 
asking for something which would never be required on a properly-run 
cluster.

If it really is not possible to get the (possibly corrupt) PG back 
again, can I get the cluster back so the remainder of the files are 
accessible?

Currently running mimic 13.2.4 on all nodes.

Status:

$ ceph health detail - 
https://gist.github.com/kawaja/f59d231179b3186748eca19aae26bcd4

$ ceph fs get main - 
https://gist.github.com/kawaja/a7ab0b285d53dee6a950a4310be4fa5a

Any advice on where I could go from here would be greatly appreciated.

thanks,

rik.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com