Re: Question on cephfs recovery tools

Goncalo Borges <goncalo@xxxxxxxxxxxxxxxxxxx> · Tue, 15 Sep 2015 13:34:45 +1000

    Hello John...

    Thank you for the replies. I do have some comments in line.

Bare a bit with me while I give you a bit of context. Questions will appear
at the end.

1) I am currently running ceph 9.0.3 and I have install it  to test the
cephfs recovery tools.

2) I've created a situation where I've deliberately (on purpose) lost some
data and metadata (check annex 1 after the main email).

      You're only *maybe* losing metadata here, as your procedure is
targeting OSDs that contain data, and just hoping that those OSDs also
contain some metadata.

    My procedure was aiming to hit servers with both data and metadata.
    This is actually why I get the PG / OSD mapping, using the file
    inode, in the data and metadata pool, and then destroy those OSDs.

      5) Get the file / PG / OSD mapping

# ceph osd map cephfs_dt 10000000024.00000000
osdmap e479 pool 'cephfs_dt' (1) object '10000000024.00000000' -> pg 1.c18fbb6f (1.36f) -> up ([19,15,6], p19) acting ([19,15,6], p19)

# ceph osd map cephfs_mt 10000000024.00000000
osdmap e479 pool 'cephfs_mt' (2) object '10000000024.00000000' -> pg 2.c18fbb6f (2.36f) -> up ([27,23,13], p27) acting ([27,23,13], p27)

    Please note that I've destroyed OSDs 6, 13, 15, 19, 23 and 27. If
    this procedure is not hitting the file metadata, than the problem
    may be that I am not understand how the metadata is being stored and
    mapped to OSDs.  

        Finally the questions:

a./ Under a situation as the one describe above, how can we safely terminate
cephfs in the clients? I have had situations where umount simply hangs and
there is no real way to unblock the situation unless I reboot the client. If
we have hundreds of clients, I would like to avoid that.

      In your procedure, the umount problems have nothing to do with
corruption.  It's (sometimes) hanging because the MDS is offline.  If
the client has dirty metadata, it may not be able to flush it until
the MDS is online -- there's no general way to "abort" this without
breaking userspace semantics.  Similar case:
http://tracker.ceph.com/issues/9477

Rebooting the machine is actually correct, as it ensures that we can
kill the filesystem mount at the same time as any application
processes using it, and therefore not break the filesystem semantics
from the point of view of those applications.

All that said, from a practical point of view we probably do need some
slightly nicer abort hooks that allow admins to "break the rules" in
crazy situations.

    From my experience, I do think that this will eventually be needed
    by any admin at some point. 

        b./ I was expecting to have lost metadata information since I've clean OSDs
where metadata information was stored for the
/cephfs/goncalo/5Gbytes_029.txt file. I was a bit surprised that the
/'cephfs/goncalo/5Gbytes_029.txt' was still properly referenced, without me
having to run any recover tool. What am I missing?

      I would guess that when you deleted 6/21 of your OSDs, you just
happened not to hit any metadata journal objects.  The journal
replayed, the MDS came back online, and your metadata was back in
cache.

    I do understand your explanation about the journal. So, in the
    assumption that no objects related to the journal have been
    destroyed, the system has been able to replay the operations logged
    in the journal, and reconstruct the loss metadata info. Can we
    actually tune the size of the journal?

    Just as a side comment, the journal seems corrupted anyway

    # cephfs-journal-tool journal inspect

      2015-09-14 17:20:15.708622 7f7e2d2ec8c0 -1 Bad entry start ptr
      (0x1c00000) at 0x196dae0

      Overall journal integrity: DAMAGED

      Corrupt regions:

        0x196dae0-ffffffffffffffff

    # cephfs-journal-tool event get summary

      2015-09-14 17:22:54.235848 7f8cf306b8c0 -1 Bad entry start ptr
      (0x1c00000) at 0x196dae0

      Events by type:

        OPEN: 46

        SESSION: 13

        SUBTREEMAP: 16

        UPDATE: 13157

      Errors: 0

    Nevertheless, at this point I just decided to reset the journal 

    # cephfs-journal-tool journal reset

      old journal was 4194304~22470246

      new journal start will be 29360128 (2695578 bytes past old end)

      writing journal head

      writing EResetJournal entry

      done

      #  cephfs-journal-tool journal inspect

      Overall journal integrity: OK

        I've tried to run the recovery tools, but I have several doubts which I did
not found described in the documentation
    - Is there a specific order / a way to run the tools for the c.1, c.2
and c.3 cases I mentioned?

      Right now your best reference might be the test code (linked above).
These tools are not finished yet, and I doubt we will write user
documentation until they're more complete (probably in Jewel).  Even
then, the tools are designed to enable expert support intervention in
disasters, not to provide a general "wizard" for fixing filesystems
(yet) -- ideally we would always specifically identify what was broken
in a filesystem before starting to use the (potentially dangerous)
tools that modify metadata.

Sorry if that all sounds a bit scary, but when it comes to disaster
recovery it's better to be conservative than to promise too much.

    I understand that. But you want us to test this tools, right?! So,
    we should understand how can we do it

     I do not mind of having a set of good guidelines sent via this
    mailing list. Since this email exchange is already long, I will
    start a new thread going exactly to that point. 

        d./ Since I was testing, I simply ran the following sequence but I am not
sure of what the command are doing, nor if the sequence is correct. I think
an example use case should be documented. Specially the cephfs-data-scan did
not returned any output, or information. So, I am not sure if anything
happened at all.

      cephfs-data-scan is a bit "unixy" at the moment in that it will return
nothing if there are no errors (athough you can always do an "echo $?"
afterwards to check it returned zero).

At some point this will get more verbose and return a "dry run" report
on any issues it finds, before going ahead and attempting to fixing
them.  Also, the post-infernalis pgls code will enable progress
reporting, so there will be a progress indicator in cephfs-data-scan
to indicate where it is in the (long) process of scanning a large
filesystem.  Right now you can pass "--debug-mds=10" or so to get more
spew from it.

    That is actually already something good to know.

        # cephfs-table-tool 0 reset session
{
    "0": {
        "data": {},
        "result": 0
    }
}

# cephfs-table-tool 0 reset snap
{
    "result": 0
}

# cephfs-table-tool 0 reset inode
{
    "0": {
        "data": {},
        "result": 0
    }
}

# cephfs-journal-tool --rank=0 journal reset
old journal was 4194304~22381701
new journal start will be 29360128 (2784123 bytes past old end)
writing journal head
writing EResetJournal entry
done

# cephfs-data-scan init

# cephfs-data-scan scan_extents cephfs_dt
# cephfs-data-scan scan_inodes cephfs_dt

# cephfs-data-scan scan_extents --force-pool cephfs_mt (doesn't seem to
work)

      I don't know what "doesn't seem to work" means -- can you be more
specific about the error?

    Sorry if I was not too specific and I think I was doing some dummy
    operation, basically running the 'cephfs-data-scan scan_extents'
    command on the metadata pool, which now I understand it does not
    make sense.

        e./ After running the cephfs tools, everything seemed exactly in the same
status. No visible changes or errors at the filesystem level. So, at this
point not sure what to conclude...

      It's pretty early days for these tools, and it's not clear that the
metadata was damaged in ways that the tools currently know how to fix.
You're probably not going to get too far without finding gaps that we
already know about[1], but please do report bugs for any cases that
cause the tools to crash or otherwise behave badly.

    Thank you for the pedagogic approach.

    Cheers

    Goncalo

      Cheers,
John

1. http://tracker.ceph.com/projects/cephfs/issues?utf8=%E2%9C%93&set_filter=1&f%5B%5D=status_id&op%5Bstatus_id%5D=o&f%5B%5D=category_id&op%5Bcategory_id%5D=%3D&v%5Bcategory_id%5D%5B%5D=80&f%5B%5D=&c%5B%5D=project&c%5B%5D=tracker&c%5B%5D=status&c%5B%5D=priority&c%5B%5D=subject&c%5B%5D=assigned_to&c%5B%5D=updated_on&c%5B%5D=category&c%5B%5D=fixed_version&c%5B%5D=cf_3&group_by=

Thank you in Advance for your responses
Cheers
Goncalo

# #####################
# ANNEX 1: GENERATE DATA LOSS #
# #####################

1) Check a file
# ls -l /cephfs/goncalo/5Gbytes_029.txt
-rw-r--r-- 1 root root 5368709120 Sep  8 03:55
/cephfs/goncalo/5Gbytes_029.txt

--- * ---

2) See its contents
# od /cephfs/goncalo/5Gbytes_029.txt |  head
0000000 150343 117016 156040 100553 154377 174521 137643 047440
0000020 006310 013157 064422 136662 145623 116101 137007 031237
0000040 111570 010104 103540 126335 014632 053445 006114 047003
0000060 123201 170045 042771 036561 152363 017716 000405 053556
0000100 102524 106517 066114 071112 144366 011405 074170 032621
0000120 047761 177217 103414 000774 174320 122332 110323 065706
0000140 042467 035356 132363 067446 145351 155277 177533 062050
0000160 016303 030741 066567 043517 172655 176016 017304 033342
0000200 177440 130510 163707 060513 055027 107702 023012 130435
0000220 022342 011762 035372 044033 152230 043424 004062 177461

--- * ---

3) Get its inode, and convert it to HEX
# ls -li /cephfs/goncalo/5Gbytes_029.txt
1099511627812 -rw-r--r-- 1 root root 5368709120 Sep  8 03:55
/cephfs/goncalo/5Gbytes_029.txt

(1099511627812)_base = (10000000024)_base16

--- * ---

4) Get the osd pool details
# ceph osd pool ls detail
pool 1 'cephfs_dt' replicated size 3 min_size 2 crush_ruleset 0 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 196 flags hashpspool
crash_replay_interval 45 stripe_width 0
pool 2 'cephfs_mt' replicated size 3 min_size 2 crush_ruleset 0 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 182 flags hashpspool
stripe_width 0

--- * ---

5) Get the file / PG / OSD mapping

# ceph osd map cephfs_dt 10000000024.00000000
osdmap e479 pool 'cephfs_dt' (1) object '10000000024.00000000' -> pg
1.c18fbb6f (1.36f) -> up ([19,15,6], p19) acting ([19,15,6], p19)
# ceph osd map cephfs_mt 10000000024.00000000
osdmap e479 pool 'cephfs_mt' (2) object '10000000024.00000000' -> pg
2.c18fbb6f (2.36f) -> up ([27,23,13], p27) acting ([27,23,13], p27)

--- * ---

6) Kill the relevant osd daemons, umount the osd partition and delete the
partitions

[root@server1 ~]# for o in 6; do dev=`df /var/lib/ceph/osd/ceph-$o | tail -n
1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o; umount
/var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted -s  ${dev::8} rm
2; partprobe; done
[root@server2 ~]# for o in 13 15; do dev=`df /var/lib/ceph/osd/ceph-$o |
tail -n 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o; umount
/var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted -s  ${dev::8} rm
2; partprobe; done
[root@server3 ~]# for o in 19 23; do dev=`df /var/lib/ceph/osd/ceph-$o |
tail -n 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o; umount
/var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted -s  ${dev::8} rm
2; partprobe; done
[root@server4 ~]# for o in 27; do dev=`df /var/lib/ceph/osd/ceph-$o | tail
-n 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o; umount
/var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted -s  ${dev::8} rm
2; partprobe; done

# #######################
# ANNEX 2: RECOVER CEPH CLUSTER #
# #######################

1) Declare OSDS losts

# for o in 6 13 15 19 23 27;do ceph osd lost $o --yes-i-really-mean-it; done
marked osd lost in epoch 480
marked osd lost in epoch 482
marked osd lost in epoch 487
marked osd lost in epoch 483
marked osd lost in epoch 489
marked osd lost in epoch 485

--- * ---

2) Remove OSDs from CRUSH map

# for o in 6 13 15 19 23 27;do ceph osd crush remove osd.$o; ceph osd down
$o; ceph osd rm $o; ceph auth del osd.$o; done
removed item id 6 name 'osd.6' from crush map
osd.6 is already down.
removed osd.6
updated
removed item id 13 name 'osd.13' from crush map
osd.13 is already down.
removed osd.13
updated
removed item id 15 name 'osd.15' from crush map
osd.15 is already down.
removed osd.15
updated
removed item id 19 name 'osd.19' from crush map
osd.19 is already down.
removed osd.19
updated
removed item id 23 name 'osd.23' from crush map
osd.23 is already down.
removed osd.23
updated
removed item id 27 name 'osd.27' from crush map
osd.27 is already down.
removed osd.27
updated

--- * ---

3) Give time to the cluster react, and to the recover I/O to finish.

--- * ---

4) Check which PGS are still stale

# ceph pg dump_stuck stale
ok
pg_stat    state    up    up_primary    acting    acting_primary
1.23    stale+undersized+degraded+peered    [23]    23    [23]    23
2.38b    stale+undersized+degraded+peered    [23]    23    [23]    23
(...)

--- * ---

5) Try to query those stale PGs

# for pg in `ceph pg dump_stuck stale | grep ^[12]  | awk '{print $1}'`; do
ceph pg $pg query; done
ok
Error ENOENT: i don't have pgid 1.23
Error ENOENT: i don't have pgid 2.38b
(...)

--- * ---

6) Create the non existing PGs

# for pg in `ceph pg dump_stuck stale | grep ^[12]  | awk '{print $1}'`; do
ceph pg force_create_pg $pg; done
ok
pg 1.23 now creating, ok
pg 2.38b now creating, ok
(...)

--- * ---

7) At this point, for the PGs to leave the 'creating' status, I had to
restart all remaining OSDs. Otherwise those PGs were in the creating state
forever.

--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    -- 
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com