Dead pool recovery - Nightmare

Ralf Zerres <ralf.zerres@xxxxxxxxxxx> · Thu, 27 Oct 2016 11:23:27 +0200 (CEST)

Hello community,
hello ceph developers,

My name is Ralf working as IT-consultant. In this paticular case I do support a german customer running a 2 node CEPH cluster.

This customer is struggeling with a desasterous situation, where a full pool of rbd-data (about 12 TB valid production-data) is lost.
Details will follow underneath (The fact; Things already done).

I urgently need to answer the following questions, where I am aware that any procedure (if working out) will take time and money.
We will solve this problem, once there is light to go the right way. So, if you could point out any path to this way, I'd love to hear from you.
For the community I'm willing and keen to document it for any unlucky one, who will face a comparable situation in the future.
That said:

- Is there any realistic chance to reconstruct the data?
- A filesystem data-recovery-tool (here: XFS) is able to restore lost+found folders/objects form the involved OSD's
  Is ceph-objectstoor-tool is a valid tool to export -> import this folders to a new pool
- If there is no way get it as a well defined structure back a cluster, i got aware of the tool rbd_restore.
  http://ceph.com/planet/ceph-recover-a-rbd-image-from-a-dead-cluster/#more-6738
  Is this one versatil path to reconstruct a rbd-object from the recovered objects (all as fs-objects in subpathes of the recovery-disk)?

Again, any help is appreciated very much

best regards
Ralf

PS: I will be in IRC on #ceph (dwsadmin)

A) The facts

The cluster: ceph (v10.2.3), state: healthy
State of rbd-pool in question: gone, all PG's are deleted on the underlying OSD's
Cluster-structure:
-  3 server-nodes (64 GB RAM, Opteron CPU's)
-  2 server acting as monitor and osd node, 1 server acting as monitor
-  2 osd-nodes (15 osd's each, spinning disks), journals: party on ssd-partions, partly on sata partions
- just used for rbd
- curshmap: will take care to store rbd-pool data to storage-bucketes (pool size: 2); storgage host1 and host2 take the replicas

The cluster itself is in HEALTH state.

B) Things already done

We did analyse the situation and try to make sure not loose any bits on the underlying OSD disks

- Cluster activity like : ceph osd set noout, nodeep-scrub, no-scrub
  now cluster state change as expected to HEALTH_WARN
- shut down all involved OSD's (seen from crushmap)  like : systemctl stop ceph-osd@<osd-id>
- Get and install a professional Data Recovery Tool handling xfs filesystems (on node, 3Ware controller does not support JBOD, so runs in RAID0 mode)
- Drop in new physical disks (node1: 2x 8TB SATA) to copy out Lost+Found objects from the OSD's
- make Backup for all other Objects of the Ceph-Cluster

Of cose, since we are talking about roughly 12 TB data chunks, backukp and recovary takes an awful lang time ....

C) References found
- Incomplete PGs — OH MY! -> https://ceph.com/community/incomplete-pgs-oh-my/#comments
- Recovering incomplete PGs -> http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool
- ceph-users: Recover unfound objects from crashed OSD's underlying filesystem -> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007637.html

Reference
=========

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          2
Vendor ID:             AuthenticAMD
CPU family:            21
Model:                 1
Model name:            AMD Opteron(TM) Processor 6272
Stepping:              2
CPU MHz:               1400.000
CPU max MHz:           2100.0000
CPU min MHz:           1400.0000
BogoMIPS:              4199.99
Virtualization:        AMD-V
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15

# free
              total        used        free      shared  buff/cache   available
Mem:       65956972      751600      315672        1528    64889700    64383492
Swap:      16777212           0    16777212

# tw-cli show

Ctl   Model        (V)Ports  Drives   Units   NotOpt  RRate   VRate  BBU
------------------------------------------------------------------------
c2    9750-4i      16        16       16      1       1       1      OK

Enclosure     Slots  Drives  Fans  TSUnits  PSUnits  Alarms
--------------------------------------------------------------
/c2/e0        16     16      5     1        2        1

# ceph --version
ceph version 10.2.3-247-g0c83eb3 (0c83eb355e989fb6ed38a3b82f9705fd5d700e89)

# ceph osd tree
ID  WEIGHT   TYPE NAME                         UP/DOWN REWEIGHT PRIMARY-AFFINITY
-12        0 host xxxsrv1
 -1 xx room server-room
 -2 xx     rack rack-daywalker
 -4 29.16936         storage data
 -6 14.29945             host xxxsrv1-data
  9  1.70000                 osd.9                down  1.00000          1.00000
 18  1.79990                 osd.18               down  1.00000          1.00000
 19  1.79990                 osd.19               down  1.00000          1.00000
 22  1.79990                 osd.22               down  1.00000          1.00000
  1  1.79990                 osd.1                down  1.00000          1.00000
  0  1.79990                 osd.0                down  1.00000          1.00000
 12  1.79999                 osd.12               down  1.00000          1.00000
 25  1.79999                 osd.25               down  1.00000          1.00000
 -7 14.86990             host xxxsrv2-data
  3  1.79999                 osd.3                  up  1.00000          1.00000
 11  1.79999                 osd.11                 up  1.00000          1.00000
 13  1.79999                 osd.13                 up  1.00000          1.00000
  4  1.79999                 osd.4                  up  1.00000          1.00000
 20  1.79999                 osd.20                 up  1.00000          1.00000
 21  1.79999                 osd.21                 up  1.00000          1.00000
 23  2.26999                 osd.23                 up  1.00000          1.00000
 24  1.79999                 osd.24                 up  1.00000          1.00000
 -5 14.49991         storage archive
 -8  8.99994             host xxxsrv1-archive
  7  0.89998                 osd.7                down  1.00000          1.00000
  8  0.89998                 osd.8                down  1.00000          1.00000
 10  3.59999                 osd.10               down  1.00000          1.00000
 26  3.59999                 osd.26               down  1.00000          1.00000
 -9  5.49997             host xxxsrv2-archive
  5  0.89999                 osd.5                  up  1.00000          1.00000
  2  3.50000                 osd.2                  up  1.00000          1.00000
  6  0.89998                 osd.6                  up  1.00000          1.00000
 17  0.20000                 osd.17                 up  1.00000          1.00000

#  ceph osd crush rule dump vdi-data
{
    "rule_id": 3,
    "rule_name": "vdi-data",
    "ruleset": 3,
    "type": 1,
    "min_size": 1,
    "max_size": 10,
    "steps": [
        {
            "op": "take",
            "item": -4,
            "item_name": "data"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com