Re: Dead pool recovery - Nightmare

Wido den Hollander <wido@xxxxxxxx> · Thu, 27 Oct 2016 11:35:07 +0200 (CEST)

> Op 27 oktober 2016 om 11:23 schreef Ralf Zerres <ralf.zerres@xxxxxxxxxxx>:
> 
> 
> Hello community,
> hello ceph developers,
>  
> My name is Ralf working as IT-consultant. In this paticular case I do support a
> german customer running a 2 node CEPH cluster.
> 
> This customer is struggeling with a desasterous situation, where a full pool of
> rbd-data (about 12 TB valid production-data) is lost.
> Details will follow underneath (The fact; Things already done).
>  
> I urgently need to answer the following questions, where I am aware that any
> procedure (if working out) will take time and money.
> We will solve this problem, once there is light to go the right way. So, if you
> could point out any path to this way, I'd love to hear from you.
> For the community I'm willing and keen to document it for any unlucky one, who
> will face a comparable situation in the future.
> That said:
>  
> - Is there any realistic chance to reconstruct the data?

That depends on the case, see my questions below.

> - A filesystem data-recovery-tool (here: XFS) is able to restore lost+found
> folders/objects form the involved OSD's
>   Is ceph-objectstoor-tool is a valid tool to export -> import this folders to a
> new pool
> - If there is no way get it as a well defined structure back a cluster, i got
> aware of the tool rbd_restore.
>   http://ceph.com/planet/ceph-recover-a-rbd-image-from-a-dead-cluster/#more-6738
>   Is this one versatil path to reconstruct a rbd-object from the recovered
> objects (all as fs-objects in subpathes of the recovery-disk)?
>  
> Again, any help is appreciated very much
>  
> best regards
> Ralf
>  
> PS: I will be in IRC on #ceph (dwsadmin)
>  
>  
> A) The facts
>  
> The cluster: ceph (v10.2.3), state: healthy
> State of rbd-pool in question: gone, all PG's are deleted on the underlying

How do you mean with gone? Did somebody remove the pool from the system? If Ceph says HEALTH_OK it seems that that was the case.

# ceph osd dump|grep pool
# ceph -s

Can you post the output of both commands?

> OSD's
> Cluster-structure:
> -  3 server-nodes (64 GB RAM, Opteron CPU's)
> -  2 server acting as monitor and osd node, 1 server acting as monitor
> -  2 osd-nodes (15 osd's each, spinning disks), journals: party on ssd-partions,
> partly on sata partions
> - just used for rbd
> - curshmap: will take care to store rbd-pool data to storage-bucketes (pool
> size: 2); storgage host1 and host2 take the replicas
>  

size = 2 is always a bad thing, please, never do this again. Always run with size = 3.

> The cluster itself is in HEALTH state.
>  
> B) Things already done
>  
> We did analyse the situation and try to make sure not loose any bits on the
> underlying OSD disks
>  
> - Cluster activity like : ceph osd set noout, nodeep-scrub, no-scrub
>   now cluster state change as expected to HEALTH_WARN
> - shut down all involved OSD's (seen from crushmap)  like : systemctl stop
> ceph-osd@<osd-id>
> - Get and install a professional Data Recovery Tool handling xfs filesystems (on
> node, 3Ware controller does not support JBOD, so runs in RAID0 mode)
> - Drop in new physical disks (node1: 2x 8TB SATA) to copy out Lost+Found objects
> from the OSD's
> - make Backup for all other Objects of the Ceph-Cluster
>  
> Of cose, since we are talking about roughly 12 TB data chunks, backukp and
> recovary takes an awful lang time ....
>  
>  
> C) References found
> - Incomplete PGs — OH MY! -> https://ceph.com/community/incomplete-pgs-oh-my/
> https://ceph.com/community/incomplete-pgs-oh-my/#comments
> - Recovering incomplete PGs ->
> http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool
> - ceph-users: Recover unfound objects from crashed OSD's underlying filesystem
> -> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007637.html
>  
>  
> Reference
> =========
>  
> # lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                16
> On-line CPU(s) list:   0-15
> Thread(s) per core:    2
> Core(s) per socket:    8
> Socket(s):             1
> NUMA node(s):          2
> Vendor ID:             AuthenticAMD
> CPU family:            21
> Model:                 1
> Model name:            AMD Opteron(TM) Processor 6272
> Stepping:              2
> CPU MHz:               1400.000
> CPU max MHz:           2100.0000
> CPU min MHz:           1400.0000
> BogoMIPS:              4199.99
> Virtualization:        AMD-V
> NUMA node0 CPU(s):     0-7
> NUMA node1 CPU(s):     8-15
>  
> # free
>               total        used        free      shared  buff/cache   available
> Mem:       65956972      751600      315672        1528    64889700    64383492
> Swap:      16777212           0    16777212
> 
> # tw-cli show
> 
> Ctl   Model        (V)Ports  Drives   Units   NotOpt  RRate   VRate  BBU
> ------------------------------------------------------------------------
> c2    9750-4i      16        16       16      1       1       1      OK
> 
> Enclosure     Slots  Drives  Fans  TSUnits  PSUnits  Alarms
> --------------------------------------------------------------
> /c2/e0        16     16      5     1        2        1
>  
> # ceph --version
> ceph version 10.2.3-247-g0c83eb3 (0c83eb355e989fb6ed38a3b82f9705fd5d700e89)
>  
> # ceph osd tree
> ID  WEIGHT   TYPE NAME                         UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -12        0 host xxxsrv1
>  -1 xx room server-room
>  -2 xx     rack rack-daywalker
>  -4 29.16936         storage data
>  -6 14.29945             host xxxsrv1-data
>   9  1.70000                 osd.9                down  1.00000          1.00000
>  18  1.79990                 osd.18               down  1.00000          1.00000
>  19  1.79990                 osd.19               down  1.00000          1.00000
>  22  1.79990                 osd.22               down  1.00000          1.00000
>   1  1.79990                 osd.1                down  1.00000          1.00000
>   0  1.79990                 osd.0                down  1.00000          1.00000
>  12  1.79999                 osd.12               down  1.00000          1.00000
>  25  1.79999                 osd.25               down  1.00000          1.00000
>  -7 14.86990             host xxxsrv2-data
>   3  1.79999                 osd.3                  up  1.00000          1.00000
>  11  1.79999                 osd.11                 up  1.00000          1.00000
>  13  1.79999                 osd.13                 up  1.00000          1.00000
>   4  1.79999                 osd.4                  up  1.00000          1.00000
>  20  1.79999                 osd.20                 up  1.00000          1.00000
>  21  1.79999                 osd.21                 up  1.00000          1.00000
>  23  2.26999                 osd.23                 up  1.00000          1.00000
>  24  1.79999                 osd.24                 up  1.00000          1.00000
>  -5 14.49991         storage archive
>  -8  8.99994             host xxxsrv1-archive
>   7  0.89998                 osd.7                down  1.00000          1.00000
>   8  0.89998                 osd.8                down  1.00000          1.00000
>  10  3.59999                 osd.10               down  1.00000          1.00000
>  26  3.59999                 osd.26               down  1.00000          1.00000
>  -9  5.49997             host xxxsrv2-archive
>   5  0.89999                 osd.5                  up  1.00000          1.00000
>   2  3.50000                 osd.2                  up  1.00000          1.00000
>   6  0.89998                 osd.6                  up  1.00000          1.00000
>  17  0.20000                 osd.17                 up  1.00000          1.00000
>  
> #  ceph osd crush rule dump vdi-data
> {
>     "rule_id": 3,
>     "rule_name": "vdi-data",
>     "ruleset": 3,
>     "type": 1,
>     "min_size": 1,
>     "max_size": 10,
>     "steps": [
>         {
>             "op": "take",
>             "item": -4,
>             "item_name": "data"
>         },
>         {
>             "op": "chooseleaf_firstn",
>             "num": 0,
>             "type": "host"
>         },
>         {
>             "op": "emit"
>         }
>     ]
> }
>  
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com