Re: Question on cephfs recovery tools

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Did you try to identify what kind of processes were accessing filesystem using fuser or lsof and then kill them?
If not, you had to do that first.

Shinobu

----- Original Message -----
From: "Goncalo Borges" <goncalo@xxxxxxxxxxxxxxxxxxx>
To: skinjo@xxxxxxxxxx
Sent: Wednesday, September 9, 2015 5:04:23 PM
Subject: Re:  Question on cephfs recovery tools

Hi Shinobu

> Did you unmount filesystem using?
>
>   umount -l

Yes!
Goncalo

>
> Shinobu
>
> On Wed, Sep 9, 2015 at 4:31 PM, Goncalo Borges 
> <goncalo@xxxxxxxxxxxxxxxxxxx <mailto:goncalo@xxxxxxxxxxxxxxxxxxx>> wrote:
>
>     Dear Ceph / CephFS gurus...
>
>     Bare a bit with me while I give you a bit of context. Questions
>     will appear at the end.
>
>     1) I am currently running ceph 9.0.3 and I have install it  to
>     test the cephfs recovery tools.
>
>     2) I've created a situation where I've deliberately (on purpose)
>     lost some data and metadata (check annex 1 after the main email).
>
>     3) I've stopped the mds, and waited to check how the cluster
>     reacts. After some time, as expected, the cluster reports a ERROR
>     state, with a lot of PGs degraded and stuck
>
>         # ceph -s
>             cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>              health HEALTH_ERR
>                     174 pgs degraded
>                     48 pgs stale
>                     174 pgs stuck degraded
>                     41 pgs stuck inactive
>                     48 pgs stuck stale
>                     238 pgs stuck unclean
>                     174 pgs stuck undersized
>                     174 pgs undersized
>                     recovery 22366/463263 objects degraded (4.828%)
>                     recovery 8190/463263 objects misplaced (1.768%)
>                     too many PGs per OSD (388 > max 300)
>                     mds rank 0 has failed
>                     mds cluster is degraded
>              monmap e1: 3 mons at
>         {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
>                     election epoch 10, quorum 0,1,2 mon1,mon3,mon2
>              mdsmap e24: 0/1/1 up, 1 failed
>              osdmap e544: 21 osds: 15 up, 15 in; 87 remapped pgs
>               pgmap v25699: 2048 pgs, 2 pools, 602 GB data, 150 kobjects
>                     1715 GB used, 40027 GB / 41743 GB avail
>                     22366/463263 objects degraded (4.828%)
>                     8190/463263 objects misplaced (1.768%)
>                         1799 active+clean
>                          110 active+undersized+degraded
>                           60 active+remapped
>                           37 stale+undersized+degraded+peered
>                           23 active+undersized+degraded+remapped
>                           11 stale+active+clean
>                            4 undersized+degraded+peered
>                            4 active
>
>     4) I've umounted the cephfs clients ('umount -l' worked for me
>     this time but I already had situations where 'umount' would simply
>     hang, and the only viable solutions would be to reboot the client).
>
>     5) I've recovered the ceph cluster by (details on the recover
>     operations are in annex 2 after the main email.)
>     - declaring the osds lost
>     - removing the osds from the crush map
>     - letting the cluster stabilize and letting all the recover I/O finish
>     - identifying stuck PGs
>     - checking if they existed, and if not recreate them.
>
>
>     6) I've restarted the MDS. Initially, the mds cluster was
>     considered degraded but after some small amount of time, that
>     message disappeared. The WARNING status was just because of "too
>     many PGs per OSD (409 > max 300)"
>
>         # ceph -s
>             cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>              health HEALTH_WARN
>                     too many PGs per OSD (409 > max 300)
>                     mds cluster is degraded
>              monmap e1: 3 mons at
>         {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
>                     election epoch 10, quorum 0,1,2 mon1,mon3,mon2
>              mdsmap e27: 1/1/1 up {0=rccephmds=up:reconnect}
>              osdmap e614: 15 osds: 15 up, 15 in
>               pgmap v27304: 2048 pgs, 2 pools, 586 GB data, 146 kobjects
>                     1761 GB used, 39981 GB / 41743 GB avail
>                         2048 active+clean
>           client io 4151 kB/s rd, 1 op/s
>
>         (wait some time)
>
>         # ceph -s
>             cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>              health HEALTH_WARN
>                     too many PGs per OSD (409 > max 300)
>              monmap e1: 3 mons at
>         {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
>                     election epoch 10, quorum 0,1,2 mon1,mon3,mon2
>              mdsmap e29: 1/1/1 up {0=rccephmds=up:active}
>              osdmap e614: 15 osds: 15 up, 15 in
>               pgmap v30442: 2048 pgs, 2 pools, 586 GB data, 146 kobjects
>                     1761 GB used, 39981 GB / 41743 GB avail
>                         2048 active+clean
>
>     7) I was able to mount the cephfs filesystem in a client. When I
>     tried to read a file made of some lost objects, I got holes in
>     part of the file (compare with the same operation on annex 1)
>
>         # od /cephfs/goncalo/5Gbytes_029.txt | head
>         0000000 000000 000000 000000 000000 000000 000000 000000 000000
>         *
>         2000000 176665 053717 015710 124465 047254 102011 065275 123534
>         2000020 015727 131070 075673 176566 047511 154343 146334 006111
>         2000040 050506 102172 172362 121464 003532 005427 137554 137111
>         2000060 071444 052477 123364 127652 043562 144163 170405 026422
>         2000100 050316 117337 042573 171037 150704 071144 066344 116653
>         2000120 076041 041546 030235 055204 016253 136063 046012 066200
>         2000140 171626 123573 065351 032357 171326 132673 012213 016046
>         2000160 022034 160053 156107 141471 162551 124615 102247 125502
>
>
>     Finally the questions:
>
>     a./ Under a situation as the one describe above, how can we safely
>     terminate cephfs in the clients? I have had situations where
>     umount simply hangs and there is no real way to unblock the
>     situation unless I reboot the client. If we have hundreds of
>     clients, I would like to avoid that.
>
>     b./ I was expecting to have lost metadata information since I've
>     clean OSDs where metadata information was stored for the
>     /cephfs/goncalo/5Gbytes_029.txt file. I was a bit surprised that
>     the /'cephfs/goncalo/5Gbytes_029.txt' was still properly
>     referenced, without me having to run any recover tool. What am I
>     missing?
>
>     c./ After recovering the cluster, I though I was in a cephfs
>     situation where I had
>         c.1 files with holes (because of lost PGs and objects in the
>     data pool)
>         c.2 files without metadata (because of lost PGs and objects in
>     the metadata pool)
>         c.3 metadata without associated files (because of lost PGs and
>     objects in the data pool)
>     I've tried to run the recovery tools, but I have several doubts
>     which I did not found described in the documentation
>         - Is there a specific order / a way to run the tools for the
>     c.1, c.2 and c.3 cases I mentioned?
>
>     d./ Since I was testing, I simply ran the following sequence but I
>     am not sure of what the command are doing, nor if the sequence is
>     correct. I think an example use case should be documented.
>     Specially the cephfs-data-scan did not returned any output, or
>     information. So, I am not sure if anything happened at all.
>
>         # cephfs-table-tool 0 reset session
>         {
>             "0": {
>                 "data": {},
>                 "result": 0
>             }
>         }
>
>         # cephfs-table-tool 0 reset snap
>         {
>             "result": 0
>         }
>
>         # cephfs-table-tool 0 reset inode
>         {
>             "0": {
>                 "data": {},
>                 "result": 0
>             }
>         }
>
>         # cephfs-journal-tool --rank=0 journal reset
>         old journal was 4194304~22381701
>         new journal start will be 29360128 (2784123 bytes past old end)
>         writing journal head
>         writing EResetJournal entry
>         done
>
>         # cephfs-data-scan init
>
>         # cephfs-data-scan scan_extents cephfs_dt
>         # cephfs-data-scan scan_inodes cephfs_dt
>
>         # cephfs-data-scan scan_extents --force-pool cephfs_mt
>         (doesn't seem to work)
>
>     e./ After running the cephfs tools, everything seemed exactly in
>     the same status. No visible changes or errors at the filesystem
>     level. So, at this point not sure what to conclude...
>
>
>     Thank you in Advance for your responses
>     Cheers
>     Goncalo
>
>
>     # #####################
>     # ANNEX 1: GENERATE DATA LOSS #
>     # #####################
>
>     1) Check a file
>     # ls -l /cephfs/goncalo/5Gbytes_029.txt
>     -rw-r--r-- 1 root root 5368709120 Sep  8 03:55
>     /cephfs/goncalo/5Gbytes_029.txt
>
>     --- * ---
>
>     2) See its contents
>     # od /cephfs/goncalo/5Gbytes_029.txt |  head
>     0000000 150343 117016 156040 100553 154377 174521 137643 047440
>     0000020 006310 013157 064422 136662 145623 116101 137007 031237
>     0000040 111570 010104 103540 126335 014632 053445 006114 047003
>     0000060 123201 170045 042771 036561 152363 017716 000405 053556
>     0000100 102524 106517 066114 071112 144366 011405 074170 032621
>     0000120 047761 177217 103414 000774 174320 122332 110323 065706
>     0000140 042467 035356 132363 067446 145351 155277 177533 062050
>     0000160 016303 030741 066567 043517 172655 176016 017304 033342
>     0000200 177440 130510 163707 060513 055027 107702 023012 130435
>     0000220 022342 011762 035372 044033 152230 043424 004062 177461
>
>     --- * ---
>
>     3) Get its inode, and convert it to HEX
>     # ls -li /cephfs/goncalo/5Gbytes_029.txt
>     1099511627812 -rw-r--r-- 1 root root 5368709120 Sep  8 03:55
>     /cephfs/goncalo/5Gbytes_029.txt
>
>     (1099511627812)_base = (10000000024)_base16
>
>     --- * ---
>
>     4) Get the osd pool details
>     # ceph osd pool ls detail
>     pool 1 'cephfs_dt' replicated size 3 min_size 2 crush_ruleset 0
>     object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 196
>     flags hashpspool crash_replay_interval 45 stripe_width 0
>     pool 2 'cephfs_mt' replicated size 3 min_size 2 crush_ruleset 0
>     object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 182
>     flags hashpspool stripe_width 0
>
>     --- * ---
>
>     5) Get the file / PG / OSD mapping
>
>     # ceph osd map cephfs_dt 10000000024.00000000
>     osdmap e479 pool 'cephfs_dt' (1) object '10000000024.00000000' ->
>     pg 1.c18fbb6f (1.36f) -> up ([19,15,6], p19) acting ([19,15,6], p19)
>     # ceph osd map cephfs_mt 10000000024.00000000
>     osdmap e479 pool 'cephfs_mt' (2) object '10000000024.00000000' ->
>     pg 2.c18fbb6f (2.36f) -> up ([27,23,13], p27) acting ([27,23,13], p27)
>
>     --- * ---
>
>     6) Kill the relevant osd daemons, umount the osd partition and
>     delete the partitions
>
>     [root@server1 ~]# for o in 6; do dev=`df /var/lib/ceph/osd/ceph-$o
>     | tail -n 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o;
>     umount /var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted
>     -s  ${dev::8} rm 2; partprobe; done
>     [root@server2 ~]# for o in 13 15; do dev=`df
>     /var/lib/ceph/osd/ceph-$o | tail -n 1 | awk '{print $1}'`;
>     /etc/init.d/ceph stop osd.$o; umount /var/lib/ceph/osd/ceph-$o;
>     parted -s ${dev::8} rm 1; parted -s  ${dev::8} rm 2; partprobe; done
>     [root@server3 ~]# for o in 19 23; do dev=`df
>     /var/lib/ceph/osd/ceph-$o | tail -n 1 | awk '{print $1}'`;
>     /etc/init.d/ceph stop osd.$o; umount /var/lib/ceph/osd/ceph-$o;
>     parted -s ${dev::8} rm 1; parted -s  ${dev::8} rm 2; partprobe; done
>     [root@server4 ~]# for o in 27; do dev=`df
>     /var/lib/ceph/osd/ceph-$o | tail -n 1 | awk '{print $1}'`;
>     /etc/init.d/ceph stop osd.$o; umount /var/lib/ceph/osd/ceph-$o;
>     parted -s ${dev::8} rm 1; parted -s  ${dev::8} rm 2; partprobe; done
>
>
>     # #######################
>     # ANNEX 2: RECOVER CEPH CLUSTER #
>     # #######################
>
>     1) Declare OSDS losts
>
>     # for o in 6 13 15 19 23 27;do ceph osd lost $o
>     --yes-i-really-mean-it; done
>     marked osd lost in epoch 480
>     marked osd lost in epoch 482
>     marked osd lost in epoch 487
>     marked osd lost in epoch 483
>     marked osd lost in epoch 489
>     marked osd lost in epoch 485
>
>     --- * ---
>
>     2) Remove OSDs from CRUSH map
>
>     # for o in 6 13 15 19 23 27;do ceph osd crush remove osd.$o; ceph
>     osd down $o; ceph osd rm $o; ceph auth del osd.$o; done
>     removed item id 6 name 'osd.6' from crush map
>     osd.6 is already down.
>     removed osd.6
>     updated
>     removed item id 13 name 'osd.13' from crush map
>     osd.13 is already down.
>     removed osd.13
>     updated
>     removed item id 15 name 'osd.15' from crush map
>     osd.15 is already down.
>     removed osd.15
>     updated
>     removed item id 19 name 'osd.19' from crush map
>     osd.19 is already down.
>     removed osd.19
>     updated
>     removed item id 23 name 'osd.23' from crush map
>     osd.23 is already down.
>     removed osd.23
>     updated
>     removed item id 27 name 'osd.27' from crush map
>     osd.27 is already down.
>     removed osd.27
>     updated
>
>     --- * ---
>
>     3) Give time to the cluster react, and to the recover I/O to finish.
>
>     --- * ---
>
>     4) Check which PGS are still stale
>
>     # ceph pg dump_stuck stale
>     ok
>     pg_stat    state    up    up_primary    acting acting_primary
>     1.23    stale+undersized+degraded+peered    [23]    23 [23]    23
>     2.38b    stale+undersized+degraded+peered    [23]    23 [23]    23
>     (...)
>
>     --- * ---
>
>     5) Try to query those stale PGs
>
>     # for pg in `ceph pg dump_stuck stale | grep ^[12]  | awk '{print
>     $1}'`; do ceph pg $pg query; done
>     ok
>     Error ENOENT: i don't have pgid 1.23
>     Error ENOENT: i don't have pgid 2.38b
>     (...)
>
>     --- * ---
>
>     6) Create the non existing PGs
>
>     # for pg in `ceph pg dump_stuck stale | grep ^[12]  | awk '{print
>     $1}'`; do ceph pg force_create_pg $pg; done
>     ok
>     pg 1.23 now creating, ok
>     pg 2.38b now creating, ok
>     (...)
>
>     --- * ---
>
>     7) At this point, for the PGs to leave the 'creating' status, I
>     had to restart all remaining OSDs. Otherwise those PGs were in the
>     creating state forever.
>
>
>
>
>     -- 
>     Goncalo Borges
>     Research Computing
>     ARC Centre of Excellence for Particle Physics at the Terascale
>     School of Physics A28 | University of Sydney, NSW  2006
>     T:+61 2 93511937 <tel:%2B61%202%2093511937>
>
>
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> -- 
> Email:
> - shinobu@xxxxxxxxx <mailto:shinobu@xxxxxxxxx>
> Blog:
>  - Life with Distributed Computational System based on OpenSource 
> <http://i-shinobu.hatenablog.com/>

-- 
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux