Re: Question on cephfs recovery tools

Goncalo Borges <goncalo@xxxxxxxxxxxxxxxxxxx> · Thu, 10 Sep 2015 10:41:39 +1000

Hey Shinobu

Thanks for the replies.

     a./ Under a situation as the one describe above, how can we safely
     terminate cephfs in the clients? I have had situations where
     umount simply hangs and there is no real way to unblock the
     situation unless I reboot the client. If we have hundreds of
     clients, I would like to avoid that.
Use "lsof" to find process accessing filesystem. I would see process id.
And then kill that process using:

  kill -9 <pid>

But you **must** make sure if it's ok or not to kill that process.
You have to be careful no matter when you kill any process.

Sure.  Thanks for the advice.

     b./ I was expecting to have lost metadata information since I've
     clean OSDs where metadata information was stored for the
     /cephfs/goncalo/5Gbytes_029.txt file. I was a bit surprised that
     the /'cephfs/goncalo/5Gbytes_029.txt' was still properly
     referenced, without me having to run any recover tool. What am I
     missing?

     c./ After recovering the cluster, I though I was in a cephfs
     situation where I had
         c.1 files with holes (because of lost PGs and objects in the
     data pool)
         c.2 files without metadata (because of lost PGs and objects in
     the metadata pool)
         c.3 metadata without associated files (because of lost PGs and
     objects in the data pool)
     I've tried to run the recovery tools, but I have several doubts
     which I did not found described in the documentation
         - Is there a specific order / a way to run the tools for the
     c.1, c.2 and c.3 cases I mentioned?
What is the recovery tools?
How did you do with that tool?
I'm just assuming d) -;

Am I right?

If so, why did you use that tool?

The tools and the order of execution I've used were the ones mentioned 
in my point d./ bellow. However, I am not really sure if what I did was 
correct. The tools have not provided any output nor I have seen any 
meaningful change comparing files in the filesystem before and after 
their execution. So, am I a bit in the dark concerning what the tools 
do. I guess that the tools should log what they are doing so that the 
admin understands what is going on. At the end, they should give a 
summary of what they fix or not fixed.

Another thing that puzzled me was what I reported in point b./ I was 
able to list /cephfs/goncalo/5Gbytes_029.txt, after I've recovered the 
Ceph cluster, restarted mds, remounted the client and without having to 
run any recover tools. Please be aware that the original problem was 
generated by me when I've destroyed the 3 OSDs (my cluster is configured 
with 3 replicas) where the metadata for this file was stored. I do not 
understand why the metadata information for the file was still available.

    3) Get its inode, and convert it to HEX

    # ls -li /cephfs/goncalo/5Gbytes_029.txt
    1099511627812 -rw-r--r-- 1 root root 5368709120 Sep  8 03:55
    /cephfs/goncalo/5Gbytes_029.txt

    (1099511627812)_base = (10000000024)_base16

    --- * ---

    5) Get the file / PG / OSD mapping

    # ceph osd map cephfs_dt 10000000024.00000000
    osdmap e479 pool 'cephfs_dt' (1) object '10000000024.00000000' ->
    pg 1.c18fbb6f (1.36f) -> up ([19,15,6], p19) acting ([19,15,6], p19)
    # ceph osd map cephfs_mt 10000000024.00000000
    osdmap e479 pool 'cephfs_mt' (2) object '10000000024.00000000' ->
    pg 2.c18fbb6f (2.36f) -> up ([27,23,13], p27) acting ([27,23,13], p27)

    --- * ---

    6) Kill the relevant osd daemons, umount the osd partition and
    delete the partitions

    [root@server1 ~]# for o in 6; do dev=`df /var/lib/ceph/osd/ceph-$o
    | tail -n 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o;
    umount /var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted
    -s  ${dev::8} rm 2; partprobe; done

    [root@server2 ~]# for o in 13 15; do dev=`df
    /var/lib/ceph/osd/ceph-$o | tail -n 1 | awk '{print $1}'`;
    /etc/init.d/ceph stop osd.$o; umount /var/lib/ceph/osd/ceph-$o;
    parted -s ${dev::8} rm 1; parted -s  ${dev::8} rm 2; partprobe; done

    [root@server3 ~]# for o in 19 23; do dev=`df
    /var/lib/ceph/osd/ceph-$o | tail -n 1 | awk '{print $1}'`;
    /etc/init.d/ceph stop osd.$o; umount /var/lib/ceph/osd/ceph-$o;
    parted -s ${dev::8} rm 1; parted -s  ${dev::8} rm 2; partprobe; done

    [root@server4 ~]# for o in 27; do dev=`df
    /var/lib/ceph/osd/ceph-$o | tail -n 1 | awk '{print $1}'`;
    /etc/init.d/ceph stop osd.$o; umount /var/lib/ceph/osd/ceph-$o;
    parted -s ${dev::8} rm 1; parted -s  ${dev::8} rm 2; partprobe; done

     d./ Since I was testing, I simply ran the following sequence but I
     am not sure of what the command are doing, nor if the sequence is
     correct. I think an example use case should be documented.
     Specially the cephfs-data-scan did not returned any output, or
     information. So, I am not sure if anything happened at all.

         # cephfs-table-tool 0 reset session
         {
             "0": {
                 "data": {},
                 "result": 0
             }
         }

         # cephfs-table-tool 0 reset snap
         {
             "result": 0
         }

         # cephfs-table-tool 0 reset inode
         {
             "0": {
                 "data": {},
                 "result": 0
             }
         }

         # cephfs-journal-tool --rank=0 journal reset
         old journal was 4194304~22381701
         new journal start will be 29360128 (2784123 bytes past old end)
         writing journal head
         writing EResetJournal entry
         done

         # cephfs-data-scan init

         # cephfs-data-scan scan_extents cephfs_dt
         # cephfs-data-scan scan_inodes cephfs_dt

         # cephfs-data-scan scan_extents --force-pool cephfs_mt
         (doesn't seem to work)

     e./ After running the cephfs tools, everything seemed exactly in
     the same status. No visible changes or errors at the filesystem
     level. So, at this point not sure what to conclude...
Anyway just let me know if your ceph cluster is production or not.
I do hope, not -;

Nope. it is not production. But we intend to have something in 
production soon, and this is just a way to prepare myself to DC cases 
which I am certain they will exist, at some point

Cheers
Goncalo

Shinobu

----- Original Message -----
From: goncalo@xxxxxxxxxxxxxxxxxxx
To: "Shinobu Kinjo" <skinjo@xxxxxxxxxx>
Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
Sent: Wednesday, September 9, 2015 9:50:30 PM
Subject: Re:  Question on cephfs recovery tools

Hi Shinobu

I did check that page but I do not think that in its current state it
helps much.

If you look to my email, I did try the operations documented there but
nothing substantial really happened. The tools do not produce any
output so I am not sure what they did, if they did something at all.
  From the documentation it is also not obvious in which situations we
should use the tools, and if there is a particular order to run them.

The reason for my email is to get some clarification on that.

Cheers

Quoting Shinobu Kinjo <skinjo@xxxxxxxxxx>:

Anyhow this page would help you:

     http://ceph.com/docs/master/cephfs/disaster-recovery/

Shinobu

----- Original Message -----
From: "Shinobu Kinjo" <skinjo@xxxxxxxxxx>
To: "Goncalo Borges" <goncalo@xxxxxxxxxxxxxxxxxxx>
Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
Sent: Wednesday, September 9, 2015 5:28:38 PM
Subject: Re:  Question on cephfs recovery tools

Did you try to identify what kind of processes were accessing
filesystem using fuser or lsof and then kill them?
If not, you had to do that first.

Shinobu

----- Original Message -----
From: "Goncalo Borges" <goncalo@xxxxxxxxxxxxxxxxxxx>
To: skinjo@xxxxxxxxxx
Sent: Wednesday, September 9, 2015 5:04:23 PM
Subject: Re:  Question on cephfs recovery tools

Hi Shinobu

Did you unmount filesystem using?

   umount -l
Yes!
Goncalo

Shinobu

On Wed, Sep 9, 2015 at 4:31 PM, Goncalo Borges
<goncalo@xxxxxxxxxxxxxxxxxxx <mailto:goncalo@xxxxxxxxxxxxxxxxxxx>> wrote:

     Dear Ceph / CephFS gurus...

     Bare a bit with me while I give you a bit of context. Questions
     will appear at the end.

     1) I am currently running ceph 9.0.3 and I have install it  to
     test the cephfs recovery tools.

     2) I've created a situation where I've deliberately (on purpose)
     lost some data and metadata (check annex 1 after the main email).

     3) I've stopped the mds, and waited to check how the cluster
     reacts. After some time, as expected, the cluster reports a ERROR
     state, with a lot of PGs degraded and stuck

         # ceph -s
             cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
              health HEALTH_ERR
                     174 pgs degraded
                     48 pgs stale
                     174 pgs stuck degraded
                     41 pgs stuck inactive
                     48 pgs stuck stale
                     238 pgs stuck unclean
                     174 pgs stuck undersized
                     174 pgs undersized
                     recovery 22366/463263 objects degraded (4.828%)
                     recovery 8190/463263 objects misplaced (1.768%)
                     too many PGs per OSD (388 > max 300)
                     mds rank 0 has failed
                     mds cluster is degraded
              monmap e1: 3 mons at
         {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
                     election epoch 10, quorum 0,1,2 mon1,mon3,mon2
              mdsmap e24: 0/1/1 up, 1 failed
              osdmap e544: 21 osds: 15 up, 15 in; 87 remapped pgs
               pgmap v25699: 2048 pgs, 2 pools, 602 GB data, 150 kobjects
                     1715 GB used, 40027 GB / 41743 GB avail
                     22366/463263 objects degraded (4.828%)
                     8190/463263 objects misplaced (1.768%)
                         1799 active+clean
                          110 active+undersized+degraded
                           60 active+remapped
                           37 stale+undersized+degraded+peered
                           23 active+undersized+degraded+remapped
                           11 stale+active+clean
                            4 undersized+degraded+peered
                            4 active

     4) I've umounted the cephfs clients ('umount -l' worked for me
     this time but I already had situations where 'umount' would simply
     hang, and the only viable solutions would be to reboot the client).

     5) I've recovered the ceph cluster by (details on the recover
     operations are in annex 2 after the main email.)
     - declaring the osds lost
     - removing the osds from the crush map
     - letting the cluster stabilize and letting all the recover I/O finish
     - identifying stuck PGs
     - checking if they existed, and if not recreate them.

     6) I've restarted the MDS. Initially, the mds cluster was
     considered degraded but after some small amount of time, that
     message disappeared. The WARNING status was just because of "too
     many PGs per OSD (409 > max 300)"

         # ceph -s
             cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
              health HEALTH_WARN
                     too many PGs per OSD (409 > max 300)
                     mds cluster is degraded
              monmap e1: 3 mons at
         {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
                     election epoch 10, quorum 0,1,2 mon1,mon3,mon2
              mdsmap e27: 1/1/1 up {0=rccephmds=up:reconnect}
              osdmap e614: 15 osds: 15 up, 15 in
               pgmap v27304: 2048 pgs, 2 pools, 586 GB data, 146 kobjects
                     1761 GB used, 39981 GB / 41743 GB avail
                         2048 active+clean
           client io 4151 kB/s rd, 1 op/s

         (wait some time)

         # ceph -s
             cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
              health HEALTH_WARN
                     too many PGs per OSD (409 > max 300)
              monmap e1: 3 mons at
         {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
                     election epoch 10, quorum 0,1,2 mon1,mon3,mon2
              mdsmap e29: 1/1/1 up {0=rccephmds=up:active}
              osdmap e614: 15 osds: 15 up, 15 in
               pgmap v30442: 2048 pgs, 2 pools, 586 GB data, 146 kobjects
                     1761 GB used, 39981 GB / 41743 GB avail
                         2048 active+clean

     7) I was able to mount the cephfs filesystem in a client. When I
     tried to read a file made of some lost objects, I got holes in
     part of the file (compare with the same operation on annex 1)

         # od /cephfs/goncalo/5Gbytes_029.txt | head
         0000000 000000 000000 000000 000000 000000 000000 000000 000000
         *
         2000000 176665 053717 015710 124465 047254 102011 065275 123534
         2000020 015727 131070 075673 176566 047511 154343 146334 006111
         2000040 050506 102172 172362 121464 003532 005427 137554 137111
         2000060 071444 052477 123364 127652 043562 144163 170405 026422
         2000100 050316 117337 042573 171037 150704 071144 066344 116653
         2000120 076041 041546 030235 055204 016253 136063 046012 066200
         2000140 171626 123573 065351 032357 171326 132673 012213 016046
         2000160 022034 160053 156107 141471 162551 124615 102247 125502

     Finally the questions:

     a./ Under a situation as the one describe above, how can we safely
     terminate cephfs in the clients? I have had situations where
     umount simply hangs and there is no real way to unblock the
     situation unless I reboot the client. If we have hundreds of
     clients, I would like to avoid that.

     b./ I was expecting to have lost metadata information since I've
     clean OSDs where metadata information was stored for the
     /cephfs/goncalo/5Gbytes_029.txt file. I was a bit surprised that
     the /'cephfs/goncalo/5Gbytes_029.txt' was still properly
     referenced, without me having to run any recover tool. What am I
     missing?

     c./ After recovering the cluster, I though I was in a cephfs
     situation where I had
         c.1 files with holes (because of lost PGs and objects in the
     data pool)
         c.2 files without metadata (because of lost PGs and objects in
     the metadata pool)
         c.3 metadata without associated files (because of lost PGs and
     objects in the data pool)
     I've tried to run the recovery tools, but I have several doubts
     which I did not found described in the documentation
         - Is there a specific order / a way to run the tools for the
     c.1, c.2 and c.3 cases I mentioned?

     d./ Since I was testing, I simply ran the following sequence but I
     am not sure of what the command are doing, nor if the sequence is
     correct. I think an example use case should be documented.
     Specially the cephfs-data-scan did not returned any output, or
     information. So, I am not sure if anything happened at all.

         # cephfs-table-tool 0 reset session
         {
             "0": {
                 "data": {},
                 "result": 0
             }
         }

         # cephfs-table-tool 0 reset snap
         {
             "result": 0
         }

         # cephfs-table-tool 0 reset inode
         {
             "0": {
                 "data": {},
                 "result": 0
             }
         }

         # cephfs-journal-tool --rank=0 journal reset
         old journal was 4194304~22381701
         new journal start will be 29360128 (2784123 bytes past old end)
         writing journal head
         writing EResetJournal entry
         done

         # cephfs-data-scan init

         # cephfs-data-scan scan_extents cephfs_dt
         # cephfs-data-scan scan_inodes cephfs_dt

         # cephfs-data-scan scan_extents --force-pool cephfs_mt
         (doesn't seem to work)

     e./ After running the cephfs tools, everything seemed exactly in
     the same status. No visible changes or errors at the filesystem
     level. So, at this point not sure what to conclude...

     Thank you in Advance for your responses
     Cheers
     Goncalo

     # #####################
     # ANNEX 1: GENERATE DATA LOSS #
     # #####################

     1) Check a file
     # ls -l /cephfs/goncalo/5Gbytes_029.txt
     -rw-r--r-- 1 root root 5368709120 Sep  8 03:55
     /cephfs/goncalo/5Gbytes_029.txt

     --- * ---

     2) See its contents
     # od /cephfs/goncalo/5Gbytes_029.txt |  head
     0000000 150343 117016 156040 100553 154377 174521 137643 047440
     0000020 006310 013157 064422 136662 145623 116101 137007 031237
     0000040 111570 010104 103540 126335 014632 053445 006114 047003
     0000060 123201 170045 042771 036561 152363 017716 000405 053556
     0000100 102524 106517 066114 071112 144366 011405 074170 032621
     0000120 047761 177217 103414 000774 174320 122332 110323 065706
     0000140 042467 035356 132363 067446 145351 155277 177533 062050
     0000160 016303 030741 066567 043517 172655 176016 017304 033342
     0000200 177440 130510 163707 060513 055027 107702 023012 130435
     0000220 022342 011762 035372 044033 152230 043424 004062 177461

     --- * ---

     3) Get its inode, and convert it to HEX
     # ls -li /cephfs/goncalo/5Gbytes_029.txt
     1099511627812 -rw-r--r-- 1 root root 5368709120 Sep  8 03:55
     /cephfs/goncalo/5Gbytes_029.txt

     (1099511627812)_base = (10000000024)_base16

     --- * ---

     4) Get the osd pool details
     # ceph osd pool ls detail
     pool 1 'cephfs_dt' replicated size 3 min_size 2 crush_ruleset 0
     object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 196
     flags hashpspool crash_replay_interval 45 stripe_width 0
     pool 2 'cephfs_mt' replicated size 3 min_size 2 crush_ruleset 0
     object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 182
     flags hashpspool stripe_width 0

     --- * ---

     5) Get the file / PG / OSD mapping

     # ceph osd map cephfs_dt 10000000024.00000000
     osdmap e479 pool 'cephfs_dt' (1) object '10000000024.00000000' ->
     pg 1.c18fbb6f (1.36f) -> up ([19,15,6], p19) acting ([19,15,6], p19)
     # ceph osd map cephfs_mt 10000000024.00000000
     osdmap e479 pool 'cephfs_mt' (2) object '10000000024.00000000' ->
     pg 2.c18fbb6f (2.36f) -> up ([27,23,13], p27) acting ([27,23,13], p27)

     --- * ---

     6) Kill the relevant osd daemons, umount the osd partition and
     delete the partitions

     [root@server1 ~]# for o in 6; do dev=`df /var/lib/ceph/osd/ceph-$o
     | tail -n 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o;
     umount /var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted
     -s  ${dev::8} rm 2; partprobe; done
     [root@server2 ~]# for o in 13 15; do dev=`df
     /var/lib/ceph/osd/ceph-$o | tail -n 1 | awk '{print $1}'`;
     /etc/init.d/ceph stop osd.$o; umount /var/lib/ceph/osd/ceph-$o;
     parted -s ${dev::8} rm 1; parted -s  ${dev::8} rm 2; partprobe; done
     [root@server3 ~]# for o in 19 23; do dev=`df
     /var/lib/ceph/osd/ceph-$o | tail -n 1 | awk '{print $1}'`;
     /etc/init.d/ceph stop osd.$o; umount /var/lib/ceph/osd/ceph-$o;
     parted -s ${dev::8} rm 1; parted -s  ${dev::8} rm 2; partprobe; done
     [root@server4 ~]# for o in 27; do dev=`df
     /var/lib/ceph/osd/ceph-$o | tail -n 1 | awk '{print $1}'`;
     /etc/init.d/ceph stop osd.$o; umount /var/lib/ceph/osd/ceph-$o;
     parted -s ${dev::8} rm 1; parted -s  ${dev::8} rm 2; partprobe; done

     # #######################
     # ANNEX 2: RECOVER CEPH CLUSTER #
     # #######################

     1) Declare OSDS losts

     # for o in 6 13 15 19 23 27;do ceph osd lost $o
     --yes-i-really-mean-it; done
     marked osd lost in epoch 480
     marked osd lost in epoch 482
     marked osd lost in epoch 487
     marked osd lost in epoch 483
     marked osd lost in epoch 489
     marked osd lost in epoch 485

     --- * ---

     2) Remove OSDs from CRUSH map

     # for o in 6 13 15 19 23 27;do ceph osd crush remove osd.$o; ceph
     osd down $o; ceph osd rm $o; ceph auth del osd.$o; done
     removed item id 6 name 'osd.6' from crush map
     osd.6 is already down.
     removed osd.6
     updated
     removed item id 13 name 'osd.13' from crush map
     osd.13 is already down.
     removed osd.13
     updated
     removed item id 15 name 'osd.15' from crush map
     osd.15 is already down.
     removed osd.15
     updated
     removed item id 19 name 'osd.19' from crush map
     osd.19 is already down.
     removed osd.19
     updated
     removed item id 23 name 'osd.23' from crush map
     osd.23 is already down.
     removed osd.23
     updated
     removed item id 27 name 'osd.27' from crush map
     osd.27 is already down.
     removed osd.27
     updated

     --- * ---

     3) Give time to the cluster react, and to the recover I/O to finish.

     --- * ---

     4) Check which PGS are still stale

     # ceph pg dump_stuck stale
     ok
     pg_stat    state    up    up_primary    acting acting_primary
     1.23    stale+undersized+degraded+peered    [23]    23 [23]    23
     2.38b    stale+undersized+degraded+peered    [23]    23 [23]    23
     (...)

     --- * ---

     5) Try to query those stale PGs

     # for pg in `ceph pg dump_stuck stale | grep ^[12]  | awk '{print
     $1}'`; do ceph pg $pg query; done
     ok
     Error ENOENT: i don't have pgid 1.23
     Error ENOENT: i don't have pgid 2.38b
     (...)

     --- * ---

     6) Create the non existing PGs

     # for pg in `ceph pg dump_stuck stale | grep ^[12]  | awk '{print
     $1}'`; do ceph pg force_create_pg $pg; done
     ok
     pg 1.23 now creating, ok
     pg 2.38b now creating, ok
     (...)

     --- * ---

     7) At this point, for the PGs to leave the 'creating' status, I
     had to restart all remaining OSDs. Otherwise those PGs were in the
     creating state forever.

     --
     Goncalo Borges
     Research Computing
     ARC Centre of Excellence for Particle Physics at the Terascale
     School of Physics A28 | University of Sydney, NSW  2006
     T:+61 2 93511937 <tel:%2B61%202%2093511937>

     _______________________________________________
     ceph-users mailing list
     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Email:
- shinobu@xxxxxxxxx <mailto:shinobu@xxxxxxxxx>
Blog:
  - Life with Distributed Computational System based on OpenSource
<http://i-shinobu.hatenablog.com/>
--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com