Re: Dead pool recovery - Nightmare

Ralf Zerres <ralf.zerres@xxxxxxxxxxx> · Thu, 27 Oct 2016 13:22:08 +0200 (CEST)



> Wido den Hollander <wido@xxxxxxxx> hat am 27. Oktober 2016 um 12:37 geschrieben:
> 
> 
> Bringing back to the list
> 
> > Op 27 oktober 2016 om 12:08 schreef Ralf Zerres <ralf.zerres@xxxxxxxxxxx>:
> > 
> > 
> > > Wido den Hollander <wido@xxxxxxxx> hat am 27. Oktober 2016 um 11:51
> > > geschrieben:
> > >
> > >
> > >
> > > > Op 27 oktober 2016 om 11:46 schreef Ralf Zerres <hostmaster@xxxxxxxxxxx>:
> > > >
> > > >
> > > > Here we go ...
> > > >
> > > >
> > > > > Wido den Hollander <wido@xxxxxxxx> hat am 27. Oktober 2016 um 11:35
> > > > > geschrieben:
> > > > >
> > > > >
> > > > >
> > > > > > Op 27 oktober 2016 om 11:23 schreef Ralf Zerres
> > > > > > <ralf.zerres@xxxxxxxxxxx>:
> > > > > >
> > > > > >
> > > > > > Hello community,
> > > > > > hello ceph developers,
> > > > > >
> > > > > > My name is Ralf working as IT-consultant. In this paticular case I do
> > > > > > support a
> > > > > > german customer running a 2 node CEPH cluster.
> > > > > >
> > > > > > This customer is struggeling with a desasterous situation, where a full
> > > > > > pool
> > > > > > of
> > > > > > rbd-data (about 12 TB valid production-data) is lost.
> > > > > > Details will follow underneath (The fact; Things already done).
> > > > > >
> > > > > > I urgently need to answer the following questions, where I am aware that
> > > > > > any
> > > > > > procedure (if working out) will take time and money.
> > > > > > We will solve this problem, once there is light to go the right way. So,
> > > > > > if
> > > > > > you
> > > > > > could point out any path to this way, I'd love to hear from you.
> > > > > > For the community I'm willing and keen to document it for any unlucky
> > > > > > one,
> > > > > > who
> > > > > > will face a comparable situation in the future.
> > > > > > That said:
> > > > > >
> > > > > > - Is there any realistic chance to reconstruct the data?
> > > > >
> > > > > That depends on the case, see my questions below.
> > > > >
> > > > > > - A filesystem data-recovery-tool (here: XFS) is able to restore
> > > > > > lost+found
> > > > > > folders/objects form the involved OSD's
> > > > > > Is ceph-objectstoor-tool is a valid tool to export -> import this
> > > > > > folders to
> > > > > > a
> > > > > > new pool
> > > > > > - If there is no way get it as a well defined structure back a cluster,
> > > > > > i
> > > > > > got
> > > > > > aware of the tool rbd_restore.
> > > > > > http://ceph.com/planet/ceph-recover-a-rbd-image-from-a-dead-cluster/#more-6738
> > > > > > Is this one versatil path to reconstruct a rbd-object from the recovered
> > > > > > objects (all as fs-objects in subpathes of the recovery-disk)?
> > > > > >
> > > > > > Again, any help is appreciated very much
> > > > > >
> > > > > > best regards
> > > > > > Ralf
> > > > > >
> > > > > > PS: I will be in IRC on #ceph (dwsadmin)
> > > > > >
> > > > > >
> > > > > > A) The facts
> > > > > >
> > > > > > The cluster: ceph (v10.2.3), state: healthy
> > > > > > State of rbd-pool in question: gone, all PG's are deleted on the
> > > > > > underlying
> > > > >
> > > > > How do you mean with gone? Did somebody remove the pool from the system?
> > > > > If
> > > > > Ceph says HEALTH_OK it seems that that was the case.
> > > > >
> > > > > # ceph osd dump|grep pool
> > > > > # ceph -s
> > > > >
> > > > > Can you post the output of both commands?
> > > >
> > > > ok: i did stop the monitor and the relevant osd.s on xxxsrv1 (because
> > > > getting
> > > > out the blocks with xfs-recovery)
> > > >
> > > > # ceph -s
> > > > cluster 3d9571c0-b86c-4b6c-85b6-dc0a7aa8923b
> > > > health HEALTH_WARN
> > > > 2376 pgs degraded
> > > > 2376 pgs stuck unclean
> > > > 2376 pgs undersized
> > > > recovery 1136266/2272532 objects degraded (50.000%)
> > > > 16/29 in osds are down
> > > > noout,noscrub,nodeep-scrub flag(s) set
> > > > 1 mons down, quorum 1,2 xxxsrv2,xxxsrv3
> > > > monmap e21: 3 mons at
> > > > (xxxsrv1=ip:6789/0,xxxsrv2=ip:6789/0,xxxsrv3=ip:6789/0}
> > > > election epoch 1667830, quorum 1,2 dwssrv2,dwssrv3
> > > > fsmap e109117: 0/0/1 up
> > > > osdmap e107820: 29 osds: 13 up, 29 in; 2376 remapped pgs
> > > > flags noout,noscrub,nodeep-scrub
> > > > pgmap v48473784: 2376 pgs, 6 pools, 4421 GB data, 1109 kobjects
> > > > 8855 GB used, 37888 GB / 46827 GB avail
> > > > 1136266/2272532 objects degraded (50.000%)
> > > > 2376 active+undersized+degraded
> > > >
> > > > # ceph osd dump
> > > > pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> > > > rjenkins
> > > > pg_num 192 pgp_num 192 last_change 58521 crash_replay_interval 45
> > > > min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width
> > > > 0
> > > > pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash
> > > > rjenkins pg_num 192 pgp_num 192 last_change 58522
> > > > min_read_recency_for_promote 1
> > > > min_write_recency_for_promote 1 stripe_width 0
> > >
> > > The pool *rbd* is missing here. This has been deleted by somebody or some
> > > application, but the fact is that it is no longer there.
> > >
> > > The simple fact now is that the data is gone, really gone. I hope you have
> > > some good backups, since Ceph no longer has your data. There is NO way to get
> > > this back.
> > >
> > > For future reference, you can set the 'mon_allow_pool_delete' setting to
> > > 'false' in the [mon] section in ceph.conf to prevent pool deletion to happen
> > > and/or set the nodelete flag on a pool:
> > >
> > > # ceph osd pool set rbd nodelete true
> > >
> > > This is a additional safeguard against removing a pool.
> > >
> > > But in your situation now, the pool rbd is gone. It was removed by somebody
> > > and not by accident by Ceph itself.
> > >
> > > Sorry to bring you this bad news, but it's just not there anymore.
> > >
> > > Wido
> > >
> > 
> > really gone data. YES
> > And it wasn't a malfunction Ceph. YES
> 
> That wasn't clear from the first e-mail you send.
>
sorry for that.

> > I don't want and can't discuss who and how the pool was deleted (ceop osd pool
> > <poolname> <poolname> {--yes-i-really-really-mean-it}. Data deleted structural.
> > They might be recovered on the underlying Filesystem of the involved OSD's.
> > The qustion is: If, i mean "if a xfs-restore programm (if found an bought
> > 'r-explerer-pro') is able to restore the PG-Folders" (<PG-ID.<int> gets to
> > $LostFiles/$Group<int>/$Folder<int>), is there any way to make this data
> > valuable? 
> > 
> 
> Maybe, if you find those objects you might be partially able to restore a block device, but the chances are slim. Even if you are missing just a few objects you could have a broken filesystem which will not mount anymore.
> 
> Wido
>
ok. but slim is better then nothing.
The question is, which steps are needed now.
I already try to restore as much data as possible, since this takes a very long time.1) sacn 2) restore to a filesystem on new disk 

> > > > pool 13 'archive' replicated size 2 min_size 1 crush_ruleset 4 object_hash
> > > > rjenkins pg_num 256 pgp_num 256 last_change 92699
> > > > min_read_recency_for_promote 1
> > > > min_write_recency_for_promote 1 stripe_width 0
> > > > pool 16 'production' replicated size 2 min_size 1 crush_ruleset 3
> > > > object_hash
> > > > rjenkins pg_num 1024 pgp_num 1024 last_change 85051 lfor 85050 flags
> > > > hashpspool
> > > > min_write_recency_for_promote 1 stripe_width 0
> > > >
> > > > >
> > > > > > OSD's
> > > > > > Cluster-structure:
> > > > > > - 3 server-nodes (64 GB RAM, Opteron CPU's)
> > > > > > - 2 server acting as monitor and osd node, 1 server acting as monitor
> > > > > > - 2 osd-nodes (15 osd's each, spinning disks), journals: party on
> > > > > > ssd-partions,
> > > > > > partly on sata partions
> > > > > > - just used for rbd
> > > > > > - curshmap: will take care to store rbd-pool data to storage-bucketes
> > > > > > (pool
> > > > > > size: 2); storgage host1 and host2 take the replicas
> > > > > >
> > > > >
> > > > > size = 2 is always a bad thing, please, never do this again. Always run
> > > > > with
> > > > > size = 3.
> > > > >
> > > > > > The cluster itself is in HEALTH state.
> > > > > >
> > > > > > B) Things already done
> > > > > >
> > > > > > We did analyse the situation and try to make sure not loose any bits on
> > > > > > the
> > > > > > underlying OSD disks
> > > > > >
> > > > > > - Cluster activity like : ceph osd set noout, nodeep-scrub, no-scrub
> > > > > > now cluster state change as expected to HEALTH_WARN
> > > > > > - shut down all involved OSD's (seen from crushmap) like : systemctl
> > > > > > stop
> > > > > > ceph-osd@<osd-id>
> > > > > > - Get and install a professional Data Recovery Tool handling xfs
> > > > > > filesystems
> > > > > > (on
> > > > > > node, 3Ware controller does not support JBOD, so runs in RAID0 mode)
> > > > > > - Drop in new physical disks (node1: 2x 8TB SATA) to copy out Lost+Found
> > > > > > objects
> > > > > > from the OSD's
> > > > > > - make Backup for all other Objects of the Ceph-Cluster
> > > > > >
> > > > > > Of cose, since we are talking about roughly 12 TB data chunks, backukp
> > > > > > and
> > > > > > recovary takes an awful lang time ....
> > > > > >
> > > > > >
> > > > > > C) References found
> > > > > > - Incomplete PGs — OH MY! ->
> > > > > > https://ceph.com/community/incomplete-pgs-oh-my/
> > > > > > https://ceph.com/community/incomplete-pgs-oh-my/#comments
> > > > > > - Recovering incomplete PGs ->
> > > > > > http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool
> > > > > > - ceph-users: Recover unfound objects from crashed OSD's underlying
> > > > > > filesystem
> > > > > > ->
> > > > > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007637.html
> > > > > >
> > > > > >
> > > > > > Reference
> > > > > > =========
> > > > > >
> > > > > > # lscpu
> > > > > > Architecture: x86_64
> > > > > > CPU op-mode(s): 32-bit, 64-bit
> > > > > > Byte Order: Little Endian
> > > > > > CPU(s): 16
> > > > > > On-line CPU(s) list: 0-15
> > > > > > Thread(s) per core: 2
> > > > > > Core(s) per socket: 8
> > > > > > Socket(s): 1
> > > > > > NUMA node(s): 2
> > > > > > Vendor ID: AuthenticAMD
> > > > > > CPU family: 21
> > > > > > Model: 1
> > > > > > Model name: AMD Opteron(TM) Processor 6272
> > > > > > Stepping: 2
> > > > > > CPU MHz: 1400.000
> > > > > > CPU max MHz: 2100.0000
> > > > > > CPU min MHz: 1400.0000
> > > > > > BogoMIPS: 4199.99
> > > > > > Virtualization: AMD-V
> > > > > > NUMA node0 CPU(s): 0-7
> > > > > > NUMA node1 CPU(s): 8-15
> > > > > >
> > > > > > # free
> > > > > > total used free shared buff/cache available
> > > > > > Mem: 65956972 751600 315672 1528 64889700 64383492
> > > > > > Swap: 16777212 0 16777212
> > > > > >
> > > > > > # tw-cli show
> > > > > >
> > > > > > Ctl Model (V)Ports Drives Units NotOpt RRate VRate BBU
> > > > > > ------------------------------------------------------------------------
> > > > > > c2 9750-4i 16 16 16 1 1 1 OK
> > > > > >
> > > > > > Enclosure Slots Drives Fans TSUnits PSUnits Alarms
> > > > > > --------------------------------------------------------------
> > > > > > /c2/e0 16 16 5 1 2 1
> > > > > >
> > > > > > # ceph --version
> > > > > > ceph version 10.2.3-247-g0c83eb3
> > > > > > (0c83eb355e989fb6ed38a3b82f9705fd5d700e89)
> > > > > >
> > > > > > # ceph osd tree
> > > > > > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
> > > > > > -12 0 host xxxsrv1
> > > > > > -1 xx room server-room
> > > > > > -2 xx rack rack-daywalker
> > > > > > -4 29.16936 storage data
> > > > > > -6 14.29945 host xxxsrv1-data
> > > > > > 9 1.70000 osd.9 down 1.00000 1.00000
> > > > > > 18 1.79990 osd.18 down 1.00000 1.00000
> > > > > > 19 1.79990 osd.19 down 1.00000 1.00000
> > > > > > 22 1.79990 osd.22 down 1.00000 1.00000
> > > > > > 1 1.79990 osd.1 down 1.00000 1.00000
> > > > > > 0 1.79990 osd.0 down 1.00000 1.00000
> > > > > > 12 1.79999 osd.12 down 1.00000 1.00000
> > > > > > 25 1.79999 osd.25 down 1.00000 1.00000
> > > > > > -7 14.86990 host xxxsrv2-data
> > > > > > 3 1.79999 osd.3 up 1.00000 1.00000
> > > > > > 11 1.79999 osd.11 up 1.00000 1.00000
> > > > > > 13 1.79999 osd.13 up 1.00000 1.00000
> > > > > > 4 1.79999 osd.4 up 1.00000 1.00000
> > > > > > 20 1.79999 osd.20 up 1.00000 1.00000
> > > > > > 21 1.79999 osd.21 up 1.00000 1.00000
> > > > > > 23 2.26999 osd.23 up 1.00000 1.00000
> > > > > > 24 1.79999 osd.24 up 1.00000 1.00000
> > > > > > -5 14.49991 storage archive
> > > > > > -8 8.99994 host xxxsrv1-archive
> > > > > > 7 0.89998 osd.7 down 1.00000 1.00000
> > > > > > 8 0.89998 osd.8 down 1.00000 1.00000
> > > > > > 10 3.59999 osd.10 down 1.00000 1.00000
> > > > > > 26 3.59999 osd.26 down 1.00000 1.00000
> > > > > > -9 5.49997 host xxxsrv2-archive
> > > > > > 5 0.89999 osd.5 up 1.00000 1.00000
> > > > > > 2 3.50000 osd.2 up 1.00000 1.00000
> > > > > > 6 0.89998 osd.6 up 1.00000 1.00000
> > > > > > 17 0.20000 osd.17 up 1.00000 1.00000
> > > > > >
> > > > > > # ceph osd crush rule dump vdi-data
> > > > > > {
> > > > > > "rule_id": 3,
> > > > > > "rule_name": "vdi-data",
> > > > > > "ruleset": 3,
> > > > > > "type": 1,
> > > > > > "min_size": 1,
> > > > > > "max_size": 10,
> > > > > > "steps": [
> > > > > > {
> > > > > > "op": "take",
> > > > > > "item": -4,
> > > > > > "item_name": "data"
> > > > > > },
> > > > > > {
> > > > > > "op": "chooseleaf_firstn",
> > > > > > "num": 0,
> > > > > > "type": "host"
> > > > > > },
> > > > > > {
> > > > > > "op": "emit"
> > > > > > }
> > > > > > ]
> > > > > > }
> > > > > >
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > >
> > >
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com