Help recovering failed cluster

John Blackwood <jb@xxxxxxxxxxxxxxxxxx> · Fri, 10 Jun 2016 16:25:55 -0400

We're looking for some assistance recovering data from a failed ceph cluster; or some help determining if it is even possible to recover any data.
Background:
- We were using Ceph with Proxmox following the instructions Proxmox provides (https://pve.proxmox.com/wiki/Ceph_Server); which seems fairly close to the ceph recommendations except that the storage is on the same physical systems that virtual machines are running on. 
- Some of our Proxmox nodes use ZFS, and there is a rare bug where ZFS + Proxmox clustering can result in Proxmox hanging indefinitely
- We were using HA on our proxmox nodes, which means when they hang, they are rebooted (hard) automatically
- Hard reboots are bad for file systems
- Hard reboots mean that Ceph tries to recover - meaning more systems hitting the bug followed by more system restarts and general mayhem

We first ran into issues overnight; and at some point during the process one of the file systems on an OSD was corrupted. We managed to stabilize the systems, however we've not been able to recover the critical data from the pool (about 5-10%). 

Current cluster health:
    cluster 537a3e12-95d8-48c3-9e82-91abbfdf62e0
     health HEALTH_WARN
            5 pgs degraded
            8 pgs down
            48 pgs incomplete
            3 pgs recovering
            1 pgs recovery_wait
            76 pgs stale
            5 pgs stuck degraded
            48 pgs stuck inactive
            76 pgs stuck stale
            53 pgs stuck unclean
            5 pgs stuck undersized
            5 pgs undersized
            74 requests are blocked > 32 sec
            recovery 14656/6951979 objects degraded (0.211%)
            recovery 20585/6951979 objects misplaced (0.296%)
            recovery 5/3348270 unfound (0.000%)
     monmap e7: 7 mons at {0=10.11.0.126:6789/0,1=10.11.0.125:6789/0,2=10.11.0.124:6789/0,3=10.11.0.123:6789/0,4=10.11.0.122:6789/0,5=10.11.0.119:6789/0,6=10.11.0.121:6789/0}
            election epoch 482, quorum 0,1,2,3,4,5,6 5,6,4,3,2,1,0
     osdmap e15746: 16 osds: 16 up, 16 in; 5 remapped pgs
      pgmap v10200890: 3072 pgs, 3 pools, 12914 GB data, 3269 kobjects
            26923 GB used, 23327 GB / 50250 GB avail
            14656/6951979 objects degraded (0.211%)
            20585/6951979 objects misplaced (0.296%)
            5/3348270 unfound (0.000%)
                2943 active+clean
                  76 stale+active+clean
                  40 incomplete
                   8 down+incomplete
                   3 active+recovering+undersized+degraded+remapped
                   1 active+recovery_wait+undersized+degraded+remapped
                   1 active+undersized+degraded+remapped

There are two RBD's which we are looking to recover (out of about 130), totalling about 200GB of data. Those RBDs do not appear to be using any of the PGs which are incomplete or down; but do seem to use ones which are stale+active+clean and so if we read from the mapped RBD it will block indefinitely.

We were looking at http://ceph.com/community/incomplete-pgs-oh-my/ as a means of recovering the incomplete PGs as it does seem that the complete ones are on the corrupted OSD, and most or all were able to be exported without issue; however I'm not sure if this is the correct way to go or if I should be looking at something else. 

-- 
JOHN BLACKWOOD
P: 905 444 9166F: 905 668 8778Chief Technical Officer
jb@xxxxxxxxxxxxxxxxxxwww.kaiinnovations.comOntario
 Manitoba

DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received it by mistake, please let us know by email reply and delete it from your system; you should not disseminate, distribute or copy this email.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com