Re: 答复: 转发: how to fix the mds damaged issue

Shinobu Kinjo <shinobu.kj@xxxxxxxxx> · Tue, 5 Jul 2016 07:21:50 +0900

Reproduce with 'debug mds = 20' and 'debug ms = 20'.

 shinobu

On Mon, Jul 4, 2016 at 9:42 PM, Lihang <li.hang@xxxxxxx> wrote:
Thank you very much for your advice. The command "ceph mds repaired 0" work fine in my cluster, my cluster state become HEALTH_OK and the cephfs state become normal also. but in the monitor or mds log file ,it just record the replay and recover process log without point out somewhere is abnormal . and I haven't the log when this issue happened . So I haven't found out the root cause of this issue. I'll try to reproduce this issue . thank you very much again!

fisher

-----邮件原件-----

发件人: John Spray [mailto:jspray@xxxxxxxxxx]

发送时间: 2016年7月4日 17:49

收件人: lihang 12398 (RD)

抄送: ceph-users@xxxxxxxxxxxxxx

主题: Re:  转发: how to fix the mds damaged issue

On Sun, Jul 3, 2016 at 8:06 AM, Lihang <li.hang@xxxxxxx> wrote:

> root@BoreNode2:~# ceph -v

>

> ceph version 10.2.0

>

>

>

> 发件人: lihang 12398 (RD)

> 发送时间: 2016年7月3日 14:47

> 收件人: ceph-users@xxxxxxxxxxxxxx

> 抄送: Ceph Development; 'ukernel@xxxxxxxxx'; zhengbin 08747 (RD);

> xusangdi

> 11976 (RD)

> 主题: how to fix the mds damaged issue

>

>

>

> Hi, my ceph cluster mds is damaged and the cluster is degraded after

> our machines library power down suddenly. then the cluster is

> “HEALTH_ERR” and cann’t be recovered to health by itself after my

>

> Reboot the storage node system or restart the ceph cluster yet. After

> that I also use the following command to remove the damaged mds, but

> the damaged mds be removed failed and the issue exist still. The

> another two mds state is standby. Who can tell me how to fix this

> issue and find out what happened in my cluter?

>

> the remove damaged mds process in my storage node as follows.

>

> 1>     Execute ”stop ceph-mds-all” command  in the damaged mds node

>

> 2>  ceph mds rmfailed 0 --yes-i-really-mean-it

rmfailed is not something you want to use in these circumstances.

> 3>  root@BoreNode2:~# ceph  mds rm 0

>

> mds gid 0 dne

>

>

>

> The detailed status of my cluster as following:

>

> root@BoreNode2:~# ceph -s

>

>   cluster 98edd275-5df7-414f-a202-c3d4570f251c

>

>      health HEALTH_ERR

>

>             mds rank 0 is damaged

>

>             mds cluster is degraded

>

>      monmap e1: 3 mons at

> {BoreNode2=172.16.65.141:6789/0,BoreNode3=172.16.65.142:6789/0,BoreNod

> e4=172.16.65.143:6789/0}

>

>             election epoch 1010, quorum 0,1,2

> BoreNode2,BoreNode3,BoreNode4

>

>       fsmap e168: 0/1/1 up, 3 up:standby, 1 damaged

>

>      osdmap e338: 8 osds: 8 up, 8 in

>

>             flags sortbitwise

>

>       pgmap v17073: 1560 pgs, 5 pools, 218 kB data, 32 objects

>

>             423 MB used, 3018 GB / 3018 GB avail

>

>                 1560 active+clean

When an MDS rank is marked as damaged, that means something invalid was found when reading from the pool storing metadata objects.  The next step is to find out what that was.  Look in the MDS log and in ceph.log from the time when it went damaged, to find the most specific error message you can.

If you do not have the logs and want to have the MDS try operating again (to reproduce whatever condition caused it to be marked damaged), you can enable it by using "ceph mds repaired 0", then start the daemon and see how it is failing.

John

> root@BoreNode2:~# ceph mds dump

>

> dumped fsmap epoch 168

>

> fs_name TudouFS

>

> epoch   156

>

> flags   0

>

> created 2016-04-02 02:48:11.150539

>

> modified        2016-04-03 03:04:57.347064

>

> tableserver     0

>

> root    0

>

> session_timeout 60

>

> session_autoclose       300

>

> max_file_size   1099511627776

>

> last_failure    0

>

> last_failure_osd_epoch  83

>

> compat  compat={},rocompat={},incompat={1=base v0.20,2=client

> writeable ranges,3=default file layouts on dirs,4=dir inode in

> separate object,5=mds uses versioned encoding,6=dirfrag is stored in

> omap,8=file layout v2}

>

> max_mds 1

>

> in      0

>

> up      {}

>

> failed

>

> damaged 0

>

> stopped

>

> data_pools      4

>

> metadata_pool   3

>

> inline_data     disabled

>

> ----------------------------------------------------------------------

> ---------------------------------------------------------------

> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息，仅限于发送给上面地址中列出

> 的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、

> 或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本

> 邮件！

> This e-mail and its attachments contain confidential information from

> H3C, which is intended only for the person or entity whose address is

> listed above. Any use of the information contained herein in any way

> (including, but not limited to, total or partial disclosure,

> reproduction, or dissemination) by persons other than the intended

> recipient(s) is prohibited. If you receive this e-mail in error,

> please notify the sender by phone or email immediately and delete it!

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Email:
shinobu@xxxxxxxxx
shinobu@xxxxxxxxxx

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com