>> c./ After recovering the cluster, I though I was in a cephfs situation where >> I had >> c.1 files with holes (because of lost PGs and objects in the data pool) >> c.2 files without metadata (because of lost PGs and objects in the >> metadata pool) > > What does "files without metadata" mean? Do you mean their objects > were in the data pool but they didn't appear in your filesystem mount? > >> c.3 metadata without associated files (because of lost PGs and objects >> in the data pool) > > So you mean you had files with the expected size but zero data, right? > >> I've tried to run the recovery tools, but I have several doubts which I did >> not found described in the documentation >> - Is there a specific order / a way to run the tools for the c.1, c.2 >> and c.3 cases I mentioned? I'm still trying to understand what you try to say in your original message but I have not been able to get you yet. Can you summarize like: 1. What current status is. e.g: working but not as expected. 2. What your thought (, guess or whatever) is about your cluster. e.g: broken metadata, data or whatever you're thinking now. 3. What you exactly did shortly not bla bla bla... 4. What you really want to do (shortly)? Otherwise there would be a bunch of back-end-force messages. Shinobu ----- Original Message ----- From: "John Spray" <jspray@xxxxxxxxxx> To: "Goncalo Borges" <goncalo@xxxxxxxxxxxxxxxxxxx> Cc: ceph-users@xxxxxxxxxxxxxx Sent: Thursday, September 10, 2015 8:49:46 PM Subject: Re: Question on cephfs recovery tools On Wed, Sep 9, 2015 at 2:31 AM, Goncalo Borges <goncalo@xxxxxxxxxxxxxxxxxxx> wrote: > Dear Ceph / CephFS gurus... > > Bare a bit with me while I give you a bit of context. Questions will appear > at the end. > > 1) I am currently running ceph 9.0.3 and I have install it to test the > cephfs recovery tools. > > 2) I've created a situation where I've deliberately (on purpose) lost some > data and metadata (check annex 1 after the main email). You're only *maybe* losing metadata here, as your procedure is targeting OSDs that contain data, and just hoping that those OSDs also contain some metadata. > > 3) I've stopped the mds, and waited to check how the cluster reacts. After > some time, as expected, the cluster reports a ERROR state, with a lot of PGs > degraded and stuck > > # ceph -s > cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738 > health HEALTH_ERR > 174 pgs degraded > 48 pgs stale > 174 pgs stuck degraded > 41 pgs stuck inactive > 48 pgs stuck stale > 238 pgs stuck unclean > 174 pgs stuck undersized > 174 pgs undersized > recovery 22366/463263 objects degraded (4.828%) > recovery 8190/463263 objects misplaced (1.768%) > too many PGs per OSD (388 > max 300) > mds rank 0 has failed > mds cluster is degraded > monmap e1: 3 mons at > {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0} > election epoch 10, quorum 0,1,2 mon1,mon3,mon2 > mdsmap e24: 0/1/1 up, 1 failed > osdmap e544: 21 osds: 15 up, 15 in; 87 remapped pgs > pgmap v25699: 2048 pgs, 2 pools, 602 GB data, 150 kobjects > 1715 GB used, 40027 GB / 41743 GB avail > 22366/463263 objects degraded (4.828%) > 8190/463263 objects misplaced (1.768%) > 1799 active+clean > 110 active+undersized+degraded > 60 active+remapped > 37 stale+undersized+degraded+peered > 23 active+undersized+degraded+remapped > 11 stale+active+clean > 4 undersized+degraded+peered > 4 active > > 4) I've umounted the cephfs clients ('umount -l' worked for me this time but > I already had situations where 'umount' would simply hang, and the only > viable solutions would be to reboot the client). > > 5) I've recovered the ceph cluster by (details on the recover operations are > in annex 2 after the main email.) > - declaring the osds lost > - removing the osds from the crush map > - letting the cluster stabilize and letting all the recover I/O finish > - identifying stuck PGs > - checking if they existed, and if not recreate them. > > > 6) I've restarted the MDS. Initially, the mds cluster was considered > degraded but after some small amount of time, that message disappeared. The > WARNING status was just because of "too many PGs per OSD (409 > max 300)" > > # ceph -s > cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738 > health HEALTH_WARN > too many PGs per OSD (409 > max 300) > mds cluster is degraded > monmap e1: 3 mons at > {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0} > election epoch 10, quorum 0,1,2 mon1,mon3,mon2 > mdsmap e27: 1/1/1 up {0=rccephmds=up:reconnect} > osdmap e614: 15 osds: 15 up, 15 in > pgmap v27304: 2048 pgs, 2 pools, 586 GB data, 146 kobjects > 1761 GB used, 39981 GB / 41743 GB avail > 2048 active+clean > client io 4151 kB/s rd, 1 op/s > > (wait some time) > > # ceph -s > cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738 > health HEALTH_WARN > too many PGs per OSD (409 > max 300) > monmap e1: 3 mons at > {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0} > election epoch 10, quorum 0,1,2 mon1,mon3,mon2 > mdsmap e29: 1/1/1 up {0=rccephmds=up:active} > osdmap e614: 15 osds: 15 up, 15 in > pgmap v30442: 2048 pgs, 2 pools, 586 GB data, 146 kobjects > 1761 GB used, 39981 GB / 41743 GB avail > 2048 active+clean > > 7) I was able to mount the cephfs filesystem in a client. When I tried to > read a file made of some lost objects, I got holes in part of the file > (compare with the same operation on annex 1) > > # od /cephfs/goncalo/5Gbytes_029.txt | head > 0000000 000000 000000 000000 000000 000000 000000 000000 000000 > * > 2000000 176665 053717 015710 124465 047254 102011 065275 123534 > 2000020 015727 131070 075673 176566 047511 154343 146334 006111 > 2000040 050506 102172 172362 121464 003532 005427 137554 137111 > 2000060 071444 052477 123364 127652 043562 144163 170405 026422 > 2000100 050316 117337 042573 171037 150704 071144 066344 116653 > 2000120 076041 041546 030235 055204 016253 136063 046012 066200 > 2000140 171626 123573 065351 032357 171326 132673 012213 016046 > 2000160 022034 160053 156107 141471 162551 124615 102247 125502 > Yes, this is all expected behaviour at present. Missing objects are how CephFS represents a sparse file, so these will now look like zero regions in the file. > Finally the questions: > > a./ Under a situation as the one describe above, how can we safely terminate > cephfs in the clients? I have had situations where umount simply hangs and > there is no real way to unblock the situation unless I reboot the client. If > we have hundreds of clients, I would like to avoid that. In your procedure, the umount problems have nothing to do with corruption. It's (sometimes) hanging because the MDS is offline. If the client has dirty metadata, it may not be able to flush it until the MDS is online -- there's no general way to "abort" this without breaking userspace semantics. Similar case: http://tracker.ceph.com/issues/9477 Rebooting the machine is actually correct, as it ensures that we can kill the filesystem mount at the same time as any application processes using it, and therefore not break the filesystem semantics from the point of view of those applications. All that said, from a practical point of view we probably do need some slightly nicer abort hooks that allow admins to "break the rules" in crazy situations. > b./ I was expecting to have lost metadata information since I've clean OSDs > where metadata information was stored for the > /cephfs/goncalo/5Gbytes_029.txt file. I was a bit surprised that the > /'cephfs/goncalo/5Gbytes_029.txt' was still properly referenced, without me > having to run any recover tool. What am I missing? I would guess that when you deleted 6/21 of your OSDs, you just happened not to hit any metadata journal objects. The journal replayed, the MDS came back online, and your metadata was back in cache. My automated tests for damage cases are much more clinical, and target specific metadata objects: https://github.com/ceph/ceph-qa-suite/blob/master/tasks/cephfs/test_data_scan.py At some stage I expect we'll add tests that e.g. nuke a quarter of the metadata PGs or something. > c./ After recovering the cluster, I though I was in a cephfs situation where > I had > c.1 files with holes (because of lost PGs and objects in the data pool) > c.2 files without metadata (because of lost PGs and objects in the > metadata pool) What does "files without metadata" mean? Do you mean their objects were in the data pool but they didn't appear in your filesystem mount? > c.3 metadata without associated files (because of lost PGs and objects > in the data pool) So you mean you had files with the expected size but zero data, right? > I've tried to run the recovery tools, but I have several doubts which I did > not found described in the documentation > - Is there a specific order / a way to run the tools for the c.1, c.2 > and c.3 cases I mentioned? Right now your best reference might be the test code (linked above). These tools are not finished yet, and I doubt we will write user documentation until they're more complete (probably in Jewel). Even then, the tools are designed to enable expert support intervention in disasters, not to provide a general "wizard" for fixing filesystems (yet) -- ideally we would always specifically identify what was broken in a filesystem before starting to use the (potentially dangerous) tools that modify metadata. Sorry if that all sounds a bit scary, but when it comes to disaster recovery it's better to be conservative than to promise too much. > d./ Since I was testing, I simply ran the following sequence but I am not > sure of what the command are doing, nor if the sequence is correct. I think > an example use case should be documented. Specially the cephfs-data-scan did > not returned any output, or information. So, I am not sure if anything > happened at all. cephfs-data-scan is a bit "unixy" at the moment in that it will return nothing if there are no errors (athough you can always do an "echo $?" afterwards to check it returned zero). At some point this will get more verbose and return a "dry run" report on any issues it finds, before going ahead and attempting to fixing them. Also, the post-infernalis pgls code will enable progress reporting, so there will be a progress indicator in cephfs-data-scan to indicate where it is in the (long) process of scanning a large filesystem. Right now you can pass "--debug-mds=10" or so to get more spew from it. > > # cephfs-table-tool 0 reset session > { > "0": { > "data": {}, > "result": 0 > } > } > > # cephfs-table-tool 0 reset snap > { > "result": 0 > } > > # cephfs-table-tool 0 reset inode > { > "0": { > "data": {}, > "result": 0 > } > } > > # cephfs-journal-tool --rank=0 journal reset > old journal was 4194304~22381701 > new journal start will be 29360128 (2784123 bytes past old end) > writing journal head > writing EResetJournal entry > done > > # cephfs-data-scan init > > # cephfs-data-scan scan_extents cephfs_dt > # cephfs-data-scan scan_inodes cephfs_dt > > # cephfs-data-scan scan_extents --force-pool cephfs_mt (doesn't seem to > work) I don't know what "doesn't seem to work" means -- can you be more specific about the error? > e./ After running the cephfs tools, everything seemed exactly in the same > status. No visible changes or errors at the filesystem level. So, at this > point not sure what to conclude... It's pretty early days for these tools, and it's not clear that the metadata was damaged in ways that the tools currently know how to fix. You're probably not going to get too far without finding gaps that we already know about[1], but please do report bugs for any cases that cause the tools to crash or otherwise behave badly. Cheers, John 1. http://tracker.ceph.com/projects/cephfs/issues?utf8=%E2%9C%93&set_filter=1&f%5B%5D=status_id&op%5Bstatus_id%5D=o&f%5B%5D=category_id&op%5Bcategory_id%5D=%3D&v%5Bcategory_id%5D%5B%5D=80&f%5B%5D=&c%5B%5D=project&c%5B%5D=tracker&c%5B%5D=status&c%5B%5D=priority&c%5B%5D=subject&c%5B%5D=assigned_to&c%5B%5D=updated_on&c%5B%5D=category&c%5B%5D=fixed_version&c%5B%5D=cf_3&group_by= > > > Thank you in Advance for your responses > Cheers > Goncalo > > > # ##################### > # ANNEX 1: GENERATE DATA LOSS # > # ##################### > > 1) Check a file > # ls -l /cephfs/goncalo/5Gbytes_029.txt > -rw-r--r-- 1 root root 5368709120 Sep 8 03:55 > /cephfs/goncalo/5Gbytes_029.txt > > --- * --- > > 2) See its contents > # od /cephfs/goncalo/5Gbytes_029.txt | head > 0000000 150343 117016 156040 100553 154377 174521 137643 047440 > 0000020 006310 013157 064422 136662 145623 116101 137007 031237 > 0000040 111570 010104 103540 126335 014632 053445 006114 047003 > 0000060 123201 170045 042771 036561 152363 017716 000405 053556 > 0000100 102524 106517 066114 071112 144366 011405 074170 032621 > 0000120 047761 177217 103414 000774 174320 122332 110323 065706 > 0000140 042467 035356 132363 067446 145351 155277 177533 062050 > 0000160 016303 030741 066567 043517 172655 176016 017304 033342 > 0000200 177440 130510 163707 060513 055027 107702 023012 130435 > 0000220 022342 011762 035372 044033 152230 043424 004062 177461 > > --- * --- > > 3) Get its inode, and convert it to HEX > # ls -li /cephfs/goncalo/5Gbytes_029.txt > 1099511627812 -rw-r--r-- 1 root root 5368709120 Sep 8 03:55 > /cephfs/goncalo/5Gbytes_029.txt > > (1099511627812)_base = (10000000024)_base16 > > --- * --- > > 4) Get the osd pool details > # ceph osd pool ls detail > pool 1 'cephfs_dt' replicated size 3 min_size 2 crush_ruleset 0 object_hash > rjenkins pg_num 1024 pgp_num 1024 last_change 196 flags hashpspool > crash_replay_interval 45 stripe_width 0 > pool 2 'cephfs_mt' replicated size 3 min_size 2 crush_ruleset 0 object_hash > rjenkins pg_num 1024 pgp_num 1024 last_change 182 flags hashpspool > stripe_width 0 > > --- * --- > > 5) Get the file / PG / OSD mapping > > # ceph osd map cephfs_dt 10000000024.00000000 > osdmap e479 pool 'cephfs_dt' (1) object '10000000024.00000000' -> pg > 1.c18fbb6f (1.36f) -> up ([19,15,6], p19) acting ([19,15,6], p19) > # ceph osd map cephfs_mt 10000000024.00000000 > osdmap e479 pool 'cephfs_mt' (2) object '10000000024.00000000' -> pg > 2.c18fbb6f (2.36f) -> up ([27,23,13], p27) acting ([27,23,13], p27) > > --- * --- > > 6) Kill the relevant osd daemons, umount the osd partition and delete the > partitions > > [root@server1 ~]# for o in 6; do dev=`df /var/lib/ceph/osd/ceph-$o | tail -n > 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o; umount > /var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted -s ${dev::8} rm > 2; partprobe; done > [root@server2 ~]# for o in 13 15; do dev=`df /var/lib/ceph/osd/ceph-$o | > tail -n 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o; umount > /var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted -s ${dev::8} rm > 2; partprobe; done > [root@server3 ~]# for o in 19 23; do dev=`df /var/lib/ceph/osd/ceph-$o | > tail -n 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o; umount > /var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted -s ${dev::8} rm > 2; partprobe; done > [root@server4 ~]# for o in 27; do dev=`df /var/lib/ceph/osd/ceph-$o | tail > -n 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o; umount > /var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted -s ${dev::8} rm > 2; partprobe; done > > > # ####################### > # ANNEX 2: RECOVER CEPH CLUSTER # > # ####################### > > 1) Declare OSDS losts > > # for o in 6 13 15 19 23 27;do ceph osd lost $o --yes-i-really-mean-it; done > marked osd lost in epoch 480 > marked osd lost in epoch 482 > marked osd lost in epoch 487 > marked osd lost in epoch 483 > marked osd lost in epoch 489 > marked osd lost in epoch 485 > > --- * --- > > 2) Remove OSDs from CRUSH map > > # for o in 6 13 15 19 23 27;do ceph osd crush remove osd.$o; ceph osd down > $o; ceph osd rm $o; ceph auth del osd.$o; done > removed item id 6 name 'osd.6' from crush map > osd.6 is already down. > removed osd.6 > updated > removed item id 13 name 'osd.13' from crush map > osd.13 is already down. > removed osd.13 > updated > removed item id 15 name 'osd.15' from crush map > osd.15 is already down. > removed osd.15 > updated > removed item id 19 name 'osd.19' from crush map > osd.19 is already down. > removed osd.19 > updated > removed item id 23 name 'osd.23' from crush map > osd.23 is already down. > removed osd.23 > updated > removed item id 27 name 'osd.27' from crush map > osd.27 is already down. > removed osd.27 > updated > > --- * --- > > 3) Give time to the cluster react, and to the recover I/O to finish. > > --- * --- > > 4) Check which PGS are still stale > > # ceph pg dump_stuck stale > ok > pg_stat state up up_primary acting acting_primary > 1.23 stale+undersized+degraded+peered [23] 23 [23] 23 > 2.38b stale+undersized+degraded+peered [23] 23 [23] 23 > (...) > > --- * --- > > 5) Try to query those stale PGs > > # for pg in `ceph pg dump_stuck stale | grep ^[12] | awk '{print $1}'`; do > ceph pg $pg query; done > ok > Error ENOENT: i don't have pgid 1.23 > Error ENOENT: i don't have pgid 2.38b > (...) > > --- * --- > > 6) Create the non existing PGs > > # for pg in `ceph pg dump_stuck stale | grep ^[12] | awk '{print $1}'`; do > ceph pg force_create_pg $pg; done > ok > pg 1.23 now creating, ok > pg 2.38b now creating, ok > (...) > > --- * --- > > 7) At this point, for the PGs to leave the 'creating' status, I had to > restart all remaining OSDs. Otherwise those PGs were in the creating state > forever. > > > > > -- > Goncalo Borges > Research Computing > ARC Centre of Excellence for Particle Physics at the Terascale > School of Physics A28 | University of Sydney, NSW 2006 > T: +61 2 93511937 > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com