Did you try to identify what kind of processes were accessing filesystem using fuser or lsof and then kill them? If not, you had to do that first. Shinobu ----- Original Message ----- From: "Goncalo Borges" <goncalo@xxxxxxxxxxxxxxxxxxx> To: skinjo@xxxxxxxxxx Sent: Wednesday, September 9, 2015 5:04:23 PM Subject: Re: Question on cephfs recovery tools Hi Shinobu > Did you unmount filesystem using? > > umount -l Yes! Goncalo > > Shinobu > > On Wed, Sep 9, 2015 at 4:31 PM, Goncalo Borges > <goncalo@xxxxxxxxxxxxxxxxxxx <mailto:goncalo@xxxxxxxxxxxxxxxxxxx>> wrote: > > Dear Ceph / CephFS gurus... > > Bare a bit with me while I give you a bit of context. Questions > will appear at the end. > > 1) I am currently running ceph 9.0.3 and I have install it to > test the cephfs recovery tools. > > 2) I've created a situation where I've deliberately (on purpose) > lost some data and metadata (check annex 1 after the main email). > > 3) I've stopped the mds, and waited to check how the cluster > reacts. After some time, as expected, the cluster reports a ERROR > state, with a lot of PGs degraded and stuck > > # ceph -s > cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738 > health HEALTH_ERR > 174 pgs degraded > 48 pgs stale > 174 pgs stuck degraded > 41 pgs stuck inactive > 48 pgs stuck stale > 238 pgs stuck unclean > 174 pgs stuck undersized > 174 pgs undersized > recovery 22366/463263 objects degraded (4.828%) > recovery 8190/463263 objects misplaced (1.768%) > too many PGs per OSD (388 > max 300) > mds rank 0 has failed > mds cluster is degraded > monmap e1: 3 mons at > {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0} > election epoch 10, quorum 0,1,2 mon1,mon3,mon2 > mdsmap e24: 0/1/1 up, 1 failed > osdmap e544: 21 osds: 15 up, 15 in; 87 remapped pgs > pgmap v25699: 2048 pgs, 2 pools, 602 GB data, 150 kobjects > 1715 GB used, 40027 GB / 41743 GB avail > 22366/463263 objects degraded (4.828%) > 8190/463263 objects misplaced (1.768%) > 1799 active+clean > 110 active+undersized+degraded > 60 active+remapped > 37 stale+undersized+degraded+peered > 23 active+undersized+degraded+remapped > 11 stale+active+clean > 4 undersized+degraded+peered > 4 active > > 4) I've umounted the cephfs clients ('umount -l' worked for me > this time but I already had situations where 'umount' would simply > hang, and the only viable solutions would be to reboot the client). > > 5) I've recovered the ceph cluster by (details on the recover > operations are in annex 2 after the main email.) > - declaring the osds lost > - removing the osds from the crush map > - letting the cluster stabilize and letting all the recover I/O finish > - identifying stuck PGs > - checking if they existed, and if not recreate them. > > > 6) I've restarted the MDS. Initially, the mds cluster was > considered degraded but after some small amount of time, that > message disappeared. The WARNING status was just because of "too > many PGs per OSD (409 > max 300)" > > # ceph -s > cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738 > health HEALTH_WARN > too many PGs per OSD (409 > max 300) > mds cluster is degraded > monmap e1: 3 mons at > {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0} > election epoch 10, quorum 0,1,2 mon1,mon3,mon2 > mdsmap e27: 1/1/1 up {0=rccephmds=up:reconnect} > osdmap e614: 15 osds: 15 up, 15 in > pgmap v27304: 2048 pgs, 2 pools, 586 GB data, 146 kobjects > 1761 GB used, 39981 GB / 41743 GB avail > 2048 active+clean > client io 4151 kB/s rd, 1 op/s > > (wait some time) > > # ceph -s > cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738 > health HEALTH_WARN > too many PGs per OSD (409 > max 300) > monmap e1: 3 mons at > {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0} > election epoch 10, quorum 0,1,2 mon1,mon3,mon2 > mdsmap e29: 1/1/1 up {0=rccephmds=up:active} > osdmap e614: 15 osds: 15 up, 15 in > pgmap v30442: 2048 pgs, 2 pools, 586 GB data, 146 kobjects > 1761 GB used, 39981 GB / 41743 GB avail > 2048 active+clean > > 7) I was able to mount the cephfs filesystem in a client. When I > tried to read a file made of some lost objects, I got holes in > part of the file (compare with the same operation on annex 1) > > # od /cephfs/goncalo/5Gbytes_029.txt | head > 0000000 000000 000000 000000 000000 000000 000000 000000 000000 > * > 2000000 176665 053717 015710 124465 047254 102011 065275 123534 > 2000020 015727 131070 075673 176566 047511 154343 146334 006111 > 2000040 050506 102172 172362 121464 003532 005427 137554 137111 > 2000060 071444 052477 123364 127652 043562 144163 170405 026422 > 2000100 050316 117337 042573 171037 150704 071144 066344 116653 > 2000120 076041 041546 030235 055204 016253 136063 046012 066200 > 2000140 171626 123573 065351 032357 171326 132673 012213 016046 > 2000160 022034 160053 156107 141471 162551 124615 102247 125502 > > > Finally the questions: > > a./ Under a situation as the one describe above, how can we safely > terminate cephfs in the clients? I have had situations where > umount simply hangs and there is no real way to unblock the > situation unless I reboot the client. If we have hundreds of > clients, I would like to avoid that. > > b./ I was expecting to have lost metadata information since I've > clean OSDs where metadata information was stored for the > /cephfs/goncalo/5Gbytes_029.txt file. I was a bit surprised that > the /'cephfs/goncalo/5Gbytes_029.txt' was still properly > referenced, without me having to run any recover tool. What am I > missing? > > c./ After recovering the cluster, I though I was in a cephfs > situation where I had > c.1 files with holes (because of lost PGs and objects in the > data pool) > c.2 files without metadata (because of lost PGs and objects in > the metadata pool) > c.3 metadata without associated files (because of lost PGs and > objects in the data pool) > I've tried to run the recovery tools, but I have several doubts > which I did not found described in the documentation > - Is there a specific order / a way to run the tools for the > c.1, c.2 and c.3 cases I mentioned? > > d./ Since I was testing, I simply ran the following sequence but I > am not sure of what the command are doing, nor if the sequence is > correct. I think an example use case should be documented. > Specially the cephfs-data-scan did not returned any output, or > information. So, I am not sure if anything happened at all. > > # cephfs-table-tool 0 reset session > { > "0": { > "data": {}, > "result": 0 > } > } > > # cephfs-table-tool 0 reset snap > { > "result": 0 > } > > # cephfs-table-tool 0 reset inode > { > "0": { > "data": {}, > "result": 0 > } > } > > # cephfs-journal-tool --rank=0 journal reset > old journal was 4194304~22381701 > new journal start will be 29360128 (2784123 bytes past old end) > writing journal head > writing EResetJournal entry > done > > # cephfs-data-scan init > > # cephfs-data-scan scan_extents cephfs_dt > # cephfs-data-scan scan_inodes cephfs_dt > > # cephfs-data-scan scan_extents --force-pool cephfs_mt > (doesn't seem to work) > > e./ After running the cephfs tools, everything seemed exactly in > the same status. No visible changes or errors at the filesystem > level. So, at this point not sure what to conclude... > > > Thank you in Advance for your responses > Cheers > Goncalo > > > # ##################### > # ANNEX 1: GENERATE DATA LOSS # > # ##################### > > 1) Check a file > # ls -l /cephfs/goncalo/5Gbytes_029.txt > -rw-r--r-- 1 root root 5368709120 Sep 8 03:55 > /cephfs/goncalo/5Gbytes_029.txt > > --- * --- > > 2) See its contents > # od /cephfs/goncalo/5Gbytes_029.txt | head > 0000000 150343 117016 156040 100553 154377 174521 137643 047440 > 0000020 006310 013157 064422 136662 145623 116101 137007 031237 > 0000040 111570 010104 103540 126335 014632 053445 006114 047003 > 0000060 123201 170045 042771 036561 152363 017716 000405 053556 > 0000100 102524 106517 066114 071112 144366 011405 074170 032621 > 0000120 047761 177217 103414 000774 174320 122332 110323 065706 > 0000140 042467 035356 132363 067446 145351 155277 177533 062050 > 0000160 016303 030741 066567 043517 172655 176016 017304 033342 > 0000200 177440 130510 163707 060513 055027 107702 023012 130435 > 0000220 022342 011762 035372 044033 152230 043424 004062 177461 > > --- * --- > > 3) Get its inode, and convert it to HEX > # ls -li /cephfs/goncalo/5Gbytes_029.txt > 1099511627812 -rw-r--r-- 1 root root 5368709120 Sep 8 03:55 > /cephfs/goncalo/5Gbytes_029.txt > > (1099511627812)_base = (10000000024)_base16 > > --- * --- > > 4) Get the osd pool details > # ceph osd pool ls detail > pool 1 'cephfs_dt' replicated size 3 min_size 2 crush_ruleset 0 > object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 196 > flags hashpspool crash_replay_interval 45 stripe_width 0 > pool 2 'cephfs_mt' replicated size 3 min_size 2 crush_ruleset 0 > object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 182 > flags hashpspool stripe_width 0 > > --- * --- > > 5) Get the file / PG / OSD mapping > > # ceph osd map cephfs_dt 10000000024.00000000 > osdmap e479 pool 'cephfs_dt' (1) object '10000000024.00000000' -> > pg 1.c18fbb6f (1.36f) -> up ([19,15,6], p19) acting ([19,15,6], p19) > # ceph osd map cephfs_mt 10000000024.00000000 > osdmap e479 pool 'cephfs_mt' (2) object '10000000024.00000000' -> > pg 2.c18fbb6f (2.36f) -> up ([27,23,13], p27) acting ([27,23,13], p27) > > --- * --- > > 6) Kill the relevant osd daemons, umount the osd partition and > delete the partitions > > [root@server1 ~]# for o in 6; do dev=`df /var/lib/ceph/osd/ceph-$o > | tail -n 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o; > umount /var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted > -s ${dev::8} rm 2; partprobe; done > [root@server2 ~]# for o in 13 15; do dev=`df > /var/lib/ceph/osd/ceph-$o | tail -n 1 | awk '{print $1}'`; > /etc/init.d/ceph stop osd.$o; umount /var/lib/ceph/osd/ceph-$o; > parted -s ${dev::8} rm 1; parted -s ${dev::8} rm 2; partprobe; done > [root@server3 ~]# for o in 19 23; do dev=`df > /var/lib/ceph/osd/ceph-$o | tail -n 1 | awk '{print $1}'`; > /etc/init.d/ceph stop osd.$o; umount /var/lib/ceph/osd/ceph-$o; > parted -s ${dev::8} rm 1; parted -s ${dev::8} rm 2; partprobe; done > [root@server4 ~]# for o in 27; do dev=`df > /var/lib/ceph/osd/ceph-$o | tail -n 1 | awk '{print $1}'`; > /etc/init.d/ceph stop osd.$o; umount /var/lib/ceph/osd/ceph-$o; > parted -s ${dev::8} rm 1; parted -s ${dev::8} rm 2; partprobe; done > > > # ####################### > # ANNEX 2: RECOVER CEPH CLUSTER # > # ####################### > > 1) Declare OSDS losts > > # for o in 6 13 15 19 23 27;do ceph osd lost $o > --yes-i-really-mean-it; done > marked osd lost in epoch 480 > marked osd lost in epoch 482 > marked osd lost in epoch 487 > marked osd lost in epoch 483 > marked osd lost in epoch 489 > marked osd lost in epoch 485 > > --- * --- > > 2) Remove OSDs from CRUSH map > > # for o in 6 13 15 19 23 27;do ceph osd crush remove osd.$o; ceph > osd down $o; ceph osd rm $o; ceph auth del osd.$o; done > removed item id 6 name 'osd.6' from crush map > osd.6 is already down. > removed osd.6 > updated > removed item id 13 name 'osd.13' from crush map > osd.13 is already down. > removed osd.13 > updated > removed item id 15 name 'osd.15' from crush map > osd.15 is already down. > removed osd.15 > updated > removed item id 19 name 'osd.19' from crush map > osd.19 is already down. > removed osd.19 > updated > removed item id 23 name 'osd.23' from crush map > osd.23 is already down. > removed osd.23 > updated > removed item id 27 name 'osd.27' from crush map > osd.27 is already down. > removed osd.27 > updated > > --- * --- > > 3) Give time to the cluster react, and to the recover I/O to finish. > > --- * --- > > 4) Check which PGS are still stale > > # ceph pg dump_stuck stale > ok > pg_stat state up up_primary acting acting_primary > 1.23 stale+undersized+degraded+peered [23] 23 [23] 23 > 2.38b stale+undersized+degraded+peered [23] 23 [23] 23 > (...) > > --- * --- > > 5) Try to query those stale PGs > > # for pg in `ceph pg dump_stuck stale | grep ^[12] | awk '{print > $1}'`; do ceph pg $pg query; done > ok > Error ENOENT: i don't have pgid 1.23 > Error ENOENT: i don't have pgid 2.38b > (...) > > --- * --- > > 6) Create the non existing PGs > > # for pg in `ceph pg dump_stuck stale | grep ^[12] | awk '{print > $1}'`; do ceph pg force_create_pg $pg; done > ok > pg 1.23 now creating, ok > pg 2.38b now creating, ok > (...) > > --- * --- > > 7) At this point, for the PGs to leave the 'creating' status, I > had to restart all remaining OSDs. Otherwise those PGs were in the > creating state forever. > > > > > -- > Goncalo Borges > Research Computing > ARC Centre of Excellence for Particle Physics at the Terascale > School of Physics A28 | University of Sydney, NSW 2006 > T:+61 2 93511937 <tel:%2B61%202%2093511937> > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > -- > Email: > - shinobu@xxxxxxxxx <mailto:shinobu@xxxxxxxxx> > Blog: > - Life with Distributed Computational System based on OpenSource > <http://i-shinobu.hatenablog.com/> -- Goncalo Borges Research Computing ARC Centre of Excellence for Particle Physics at the Terascale School of Physics A28 | University of Sydney, NSW 2006 T: +61 2 93511937 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com