Re: Cannot mount cephfs after some disaster recovery

John Spray <jspray@xxxxxxxxxx> · Tue, 1 Mar 2016 09:32:34 +0000

On Tue, Mar 1, 2016 at 3:51 AM, 10000 <10000@xxxxxxxxxxxxx> wrote:
> Hi,
>     I meet a trouble on mount the cephfs after doing some disaster recovery
> introducing by official
> document(http://docs.ceph.com/docs/master/cephfs/disaster-recovery).
>
>     Now when I try to mount the cephfs, I get "mount error 5 = Input/output
> error".
>     When run "ceph -s" on clusters, it print like this:
>      cluster 15935dde-1d19-486e-9e1c-67414f9927f6
>      health HEALTH_OK
>      monmap e1: 4 mons at
> {HK-IDC1-10-1-72-151=172.17.17.151:6789/0,HK-IDC1-10-1-72-152=172.17.17.152:6789/0,HK-IDC1-10-1-72-153=172.17.17.153:6789/0,HK-IDC1-10-1-72-160=10.1.72.160:6789/0}
>             election epoch 528, quorum 0,1,2,3
> HK-IDC1-10-1-72-160,HK-IDC1-10-1-72-151,HK-IDC1-10-1-72-152,HK-IDC1-10-1-72-153
>      mdsmap e21038: 1/1/0 up {0=HK-IDC1-10-1-72-160=up:active}
>      osdmap e10536: 108 osds: 108 up, 108 in
>             flags sortbitwise
>       pgmap v424957: 6564 pgs, 3 pools, 3863 GB data, 67643 kobjects
>             8726 GB used, 181 TB / 189 TB avail
>                 6560 active+clean
>                    3 active+clean+scrubbing+deep
>                    1 active+clean+scrubbing
>
>      It seems there should be "1/1/1 up" at mdsmap instead of "1/1/0 up" and
> I really don't know what the last number mean.

As Zheng has said, that last number is the "max_mds" setting.  It's a
little bit weird the way that "fs reset" leaves it at zero, but it
shouldn't be causing any problems (you still have one active MDS
daemon here).

>      And there is cephfs if I run "ceph fs ls" which print this:
>
>      name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data
> ]
>
>      I try my best to Google such problem however i get nothing. And I still
> want to know if i can bring the cephfs back. So does any one have ideas?
>
>      Oh, I do the disaster recovery because I get "mdsmap e21012: 0/1/1 up,
> 1 up:standby, 1 damaged" at first. And to bring the fs back to work, I do
> "JOURNAL TRUNCATION", "MDS TABLE WIPES", "MDS MAP RESET". However I think
> there must exist (and most) files that their metadata have been saved at
> OSDs (metadata pool, in RADOS). I just want to get them.

*before* trying to run any disaster recovery tools, you must diagnose
what is actually wrong with the filesystem.  It is too late for that
now, but I say it anyway so that people reading this mailing list will
be reminded.

1. Go look in your logs to see what caused the MDS to go damaged: the
cluster log should indicate when it happened, and then you can go and
look at your MDS daemon logs to see what was going on that caused it.
2. Go look in your logs to see what is going on now when you try to
mount and get EIO.  Either the client logs, or the MDS logs, or both
should contain some clues.
3. Hopefully you followed the first section on the page you linked
that you should make a backup of your journal.  Keep that somewhere
safe for the moment.
4. Keep a careful log of which commands you run from this point onwards.

John
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com