MDS dying on Ceph 0.67.10

greg@xxxxxxxxxxx (Gregory Farnum) · Tue, 26 Aug 2014 13:09:02 -0700



I don't think the log messages you're showing are the actual cause of
the failure. The log file should have a proper stack trace (with
specific function references and probably a listed assert failure),
can you find that?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Aug 26, 2014 at 9:11 AM, MinhTien MinhTien
<tientienminh080590 at gmail.com> wrote:
> Hi all,
>
> I have a cluster of 2 nodes on Centos 6.5 with ceph 0.67.10 (replicate =  2)
>
> When I add the 3rd node in the Ceph Cluster, CEPH perform load balancing.
>
> I have 3 MDS in 3 nodes,the MDS process is dying after a while with a stack
> trace:
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>  2014-08-26 17:08:34.362901 7f1c2c704700  1 -- 10.20.0.21:6800/22154 <==
> osd.10 10.20.0.21:6802/15917 1 ==== osd_op_reply(230 100000003f6.00000000
> [tmapup 0~0] ondisk = 0) v4 ==== 119+0+0 (1770421071 0 0) 0x2aece00 con
> 0x2aa4200
>    -54> 2014-08-26 17:08:34.362942 7f1c2c704700  1 -- 10.20.0.21:6800/22154
> <== osd.55 10.20.0.23:6800/2407 10 ==== osd_op_reply(263
> 1000000048a.00000000 [getxattr] ack = -2 (No such file or directory)) v4
> ==== 119+0+0 (3908997833 0 0) 0x1e63000 con 0x1e7aaa0
>    -53> 2014-08-26 17:08:34.363001 7f1c2c704700  5 mds.0.log submit_entry
> 427629603~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
>    -52> 2014-08-26 17:08:34.363022 7f1c2c704700  1 -- 10.20.0.21:6800/22154
> <== osd.37 10.20.0.22:6898/11994 6 ==== osd_op_reply(226 1.00000000 [tmapput
> 0~7664] ondisk = 0) v4 ==== 109+0+0 (1007110430 0 0) 0x1e64800 con 0x1e7a7e0
>    -51> 2014-08-26 17:08:34.363092 7f1c2c704700  5 mds.0.log _expired
> segment 293601899 2548 events
>    -50> 2014-08-26 17:08:34.363117 7f1c2c704700  1 -- 10.20.0.21:6800/22154
> <== osd.17 10.20.0.21:6941/17572 9 ==== osd_op_reply(264
> 10000000489.00000000 [getxattr] ack = -2 (No such file or directory)) v4
> ==== 119+0+0 (1979034473 0 0) 0x1e62200 con 0x1e7b180
>    -49> 2014-08-26 17:08:34.363177 7f1c2c704700  5 mds.0.log submit_entry
> 427631148~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
>    -48> 2014-08-26 17:08:34.363197 7f1c2c704700  1 -- 10.20.0.21:6800/22154
> <== osd.1 10.20.0.21:6872/13227 6 ==== osd_op_reply(265 10000000491.00000000
> [getxattr] ack = -2 (No such file or directory)) v4 ==== 119+0+0 (1231782695
> 0 0) 0x1e63400 con 0x1e7ac00
>    -47> 2014-08-26 17:08:34.363255 7f1c2c704700  5 mds.0.log submit_entry
> 427632693~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
>    -46> 2014-08-26 17:08:34.363274 7f1c2c704700  1 -- 10.20.0.21:6800/22154
> <== osd.11 10.20.0.21:6884/7018 5 ==== osd_op_reply(266 1000000047d.00000000
> [getxattr] ack = -2 (No such file or directory)) v4 ==== 119+0+0 (2737916920
> 0 0) 0x1e61e00 con 0x1e7bc80
>
> ---------------------------------------------------------------------------------------------------------------------------------------------
> I try to restart MDSs, but after a few seconds in a state of "active", MDS
> switch to state "laggy or crashed". I have a lot of important data on it.
> I do not want to use the command:
> ceph mds newfs <metadata pool id> <data pool id> --yes-i-really-mean-it
>
> :(
>
> Tien Bui.
>
>
>
> --
> Bui Minh Tien
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>