Re: Upgrade from 12.2.1 to 12.2.2 broke my CephFs

Tobias Prousa <tobias.prousa@xxxxxxxxx> · Mon, 11 Dec 2017 16:17:18 +0100

    On 12/11/2017 04:05 PM, Yan, Zheng
      wrote:

      On Mon, Dec 11, 2017 at 10:13 PM, Tobias Prousa <tobias.prousa@xxxxxxxxx> wrote:

        Hi there,

I'm running a CEPH cluster for some libvirt VMs and a CephFS providing /home
to ~20 desktop machines. There are 4 Hosts running 4 MONs, 4MGRs, 3MDSs (1
active, 2 standby) and 28 OSDs in total. This cluster is up and running
since the days of Bobtail (yes, including CephFS).

Now with update from 12.2.1 to 12.2.2 on last friday afternoon I restarted
MONs, MGRs, OSDs as usual. RBD is running just fine. But after trying to
restart MDSs they tried replaying journal then fell back to standby and FS
was in state "damaged". I finally got them back working after I did a good
portion of whats described here:

http://docs.ceph.com/docs/master/cephfs/disaster-recovery/

      What commands did you run? you need to run following commands.

cephfs-journal-tool event recover_dentries summary
cephfs-journal-tool journal reset
cephfs-table-tool all reset session

    These are essentially the first commands I did execute, in this
    exact order. Additionally I did a:

      ceph fs reset
    cephfs --yes-i-really-mean-it

      Which then was the moment when I was able to restart MDSs for the
      first time back on friday, IIRC.

        Now when all clients are shut down I can start MDS, will replay and become
active. I then can mount CephFS on a client and can access my files and
folders. But the more clients I bring up MDS will first report damaged
metadata (probably due to some damaged paths, I could live with that) and
then MDS will fail with assert:

/build/ceph-12.2.2/src/mds/MDCache.cc: 258: FAILED
assert(inode_map.count(in->vino()) == 0)

I tried doing an online CephFS scrub like

ceph daemon mds.a scrub_path / recursive repair

This will run for couple of hours, always finding exactly 10001 damages of
type "backtrace" and reporting it would be fixing loads of erronously
free-marked inodes until MDS crashes. When I rerun that scrub after having
killed all clients and restarted MDSs things will repeat finding exactly
those 10001 damages and it will begin fixing those exactly same free-marked
inodes over again.

      Find max inode number of these free-marked inodes, then use
cephfs-table-tool to remove inode numbers that are smaller than the
max number. you can remove a little more just in case.  Before doing
this, you should to stop mds and run "cephfs-table-tool all reset
session".

If everything goes right, mds will no longer trigger the assertion.

    Any hint on how to find max inode number and do I understand that I
    should remove every free-marked inode number that is there except
    the biggest one which has to stay?

    How to remove those inodes using cephfs-table-tool?

        Btw. CephFS has about 3 million objects in metadata pool. Data pool is about
30 million objects with ~2.5TB * 3 replicas.

What I tried next is keeping MDS down and doing

cephfs-data-scan scan_extents <data pool>
cephfs-data-scan scan_inodes <data pool>
cephfs-data-scan scan_links

As this is described to take "a very long time" this is what I initially
skipped from disater-recovery tips. Right now I'm still on first step with 6
workers on a single host busy doing cephfs-data-scan scan_extents. ceph -s
shows me client io of 20kB/s (!!!). If thats real scan speed this is going
to take ages.
Any way to tell how long this is going to take? Could I speed things up by
running more workers on multiple hosts simultaneously?
Should I abort it as I actually don't have the problem of lost files. Maybe
running cephfs-data-scan scan_links would better suit my issue, or does
scan_extents/scan_indoes HAVE to be run and finished first?

I have to get this cluster up and running again as soon as possible. Any
help highly appreciated. If there is anything I can help, e.g. with further
information, feel free to ask. I'll try to hang around on #ceph (nick
topro/topro_/topro__). FYI, I'm in Central Europe TimeZone (UTC+1).

Thank you so much!

Best regards,
Tobi

--
-----------------------------------------------------------
Dipl.-Inf. (FH) Tobias Prousa
Leiter Entwicklung Datenlogger

CAETEC GmbH
Industriestr. 1
D-82140 Olching
www.caetec.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Olching
Handelsregister: Amtsgericht München, HRB 183929
Geschäftsführung: Stephan Bacher, Andreas Wocke

Tel.: +49 (0)8142 / 50 13 60
Fax.: +49 (0)8142 / 50 13 69

eMail: tobias.prousa@xxxxxxxxx
Web:   http://www.caetec.de
------------------------------------------------------------

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    -- 
-----------------------------------------------------------
Dipl.-Inf. (FH) Tobias Prousa
Leiter Entwicklung Datenlogger

CAETEC GmbH
Industriestr. 1
D-82140 Olching
www.caetec.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Olching
Handelsregister: Amtsgericht München, HRB 183929
Geschäftsführung: Stephan Bacher, Andreas Wocke

Tel.: +49 (0)8142 / 50 13 60
Fax.: +49 (0)8142 / 50 13 69

eMail: tobias.prousa@xxxxxxxxx
Web:   http://www.caetec.de
------------------------------------------------------------

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com