Hi,
I tried to keep 2 mds, and re-ran the MR example. Still, the mds crashed. I configured ceph to dump logs in syslog. Here is the error I came across while going through the logs:
Apr 25 13:54:33 varunc4-virtual-machine ceph-mds: 2013-04-25 13:54:33.207001 bf148b40 0 -- 10.72.148.209:6800/3568 >> 10.72.148.217:0/3429858484 pipe(0xd49e700 sd=1447 :6800 s=2 pgs=2 cs=1 l=0).fault, server, going to standby
Apr 25 13:54:35 varunc4-virtual-machine ceph-mds: 2013-04-25 13:54:35.564033 bf54cb40 0 -- 10.72.148.209:6800/3568 >> 10.72.148.209:0/36310026 pipe(0xa3e9e00 sd=1449 :6800 s=2 pgs=2 cs=1 l=0).fault, server, going to standby
Apr 25 13:54:35 varunc4-virtual-machine ceph-mds: 2013-04-25 13:54:35.571412 bf34ab40 0 -- 10.72.148.209:6800/3568 >> 10.72.148.209:0/1543769098 pipe(0xd49e540 sd=1448 :6800 s=2 pgs=2 cs=1 l=0).fault, server, going to standby
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: 2013-04-25 13:54:36.029255 bf74eb40 0 -- 10.72.148.209:6800/3568 >> 10.72.148.217:0/346914060 pipe(0xa3e9c40 sd=1450 :6800 s=2 pgs=2 cs=1 l=0).fault, server, going to standby
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: 2013-04-25 13:54:36.043133 bf950b40 0 -- 10.72.148.209:6800/3568 >> 10.72.148.217:0/3560078604 pipe(0xa3e9a80 sd=1451 :6800 s=2 pgs=2 cs=1 l=0).fault, server, going to standby
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: 2013-04-25 13:54:36.182188 bff8cb40 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread bff8cb40 time 2013-04-25 13:54:36.053392#012common/Thread.cc: 110: FAILED assert(ret == 0)#012#012 ceph version 0.58-500-gaf3b163 (af3b16349a49a8aee401e27c1b71fd704b31297c)#012 1: (Thread::create(unsigned int)+0xdc) [0x843866c]#012 2: (Pipe::start_writer()+0x4e) [0x84d837e]#012 3: (Pipe::accept()+0x4955) [0x84ee625]#012 4: (Pipe::reader()+0x1758) [0x84f10b8]#012 5: (Pipe::Reader::entry()+0x1e) [0x84f2dee]#012 6: (Thread::_entry_func(void*)+0xf) [0x843833f]#012 7: (()+0x6d4c) [0xb7784d4c]#012 8: (clone()+0x5e) [0xb7106ace]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: --- begin dump of recent events ---
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9999> 2013-04-25 13:52:26.560836 b4812b40 1 -- 10.72.148.209:6800/3568 <== client.22266 10.72.148.217:0/421228409 9 ==== client_request(client.22266:7 lookup #100000003f8/varunc) v1 ==== 120+0+0 (551676947 0 0) 0xd4e3380 con 0xa4b0700
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9998> 2013-04-25 13:52:26.560856 b4812b40 4 mds.0.server handle_client_request client_request(client.22266:7 lookup #100000003f8/varunc) v1
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9997> 2013-04-25 13:52:26.560894 b4812b40 1 -- 10.72.148.209:6800/3568 --> 10.72.148.217:0/421228409 -- client_reply(???:7 = 0 Success) v1 -- ?+0 0xd4e31c0 con 0xa4b0700
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9996> 2013-04-25 13:52:26.562127 b4812b40 1 -- 10.72.148.209:6800/3568 <== client.22266 10.72.148.217:0/421228409 10 ==== client_request(client.22266:8 lookup #10000002bc3/.staging) v1 ==== 122+0+0 (1321999201 0 0) 0xd4e31c0 con 0xa4b0700
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9995> 2013-04-25 13:52:26.562147 b4812b40 4 mds.0.server handle_client_request client_request(client.22266:8 lookup #10000002bc3/.staging) v1
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9994> 2013-04-25 13:52:26.562185 b4812b40 1 -- 10.72.148.209:6800/3568 --> 10.72.148.217:0/421228409 -- client_reply(???:8 = 0 Success) v1 -- ?+0 0xd4e3380 con 0xa4b0700
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9993> 2013-04-25 13:52:26.563104 b4812b40 1 -- 10.72.148.209:6800/3568 <== client.22266 10.72.148.217:0/421228409 11 ==== client_request(client.22266:9 lookup #10000002bc4/job_201304241607_0004) v1 ==== 135+0+0 (2278942053 0 0) 0xd4e3380 con 0xa4b0700
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9992> 2013-04-25 13:52:26.563125 b4812b40 4 mds.0.server handle_client_request client_request(client.22266:9 lookup #10000002bc4/job_201304241607_0004) v1
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9991> 2013-04-25 13:52:26.563167 b4812b40 1 -- 10.72.148.209:6800/3568 --> 10.72.148.217:0/421228409 -- client_reply(???:9 = 0 Success) v1 -- ?+0 0xd4e31c0 con 0xa4b0700
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9990> 2013-04-25 13:52:26.565095 b4812b40 1 -- 10.72.148.209:6800/3568 <== client.22266 10.72.148.217:0/421228409 12 ==== client_request(client.22266:10 lookup #100000043f8/job.split) v1 ==== 123+0+0 (3692973820 0 0) 0xd4e31c0 con 0xa4b0700
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9989> 2013-04-25 13:52:26.565115 b4812b40 4 mds.0.server handle_client_request client_request(client.22266:10 lookup #100000043f8/job.split) v1
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9988> 2013-04-25 13:52:26.565160 b4812b40 1 -- 10.72.148.209:6800/3568 --> 10.72.148.217:0/421228409 -- client_reply(???:10 = 0 Success) v1 -- ?+0 0xd4e3380 con 0xa4b0700
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9987> 2013-04-25 13:52:26.566273 b4812b40 1 -- 10.72.148.209:6800/3568 <== client.22266 10.72.148.217:0/421228409 13 ==== client_caps(update ino 100000043fa 165811 seq 1 caps=pAsLsXsFscr dirty=- wanted=pFscr follows 0 size 470551/0 ts 1 mtime 2013-04-25 13:31:20.151178) v2 ==== 180+0+0 (1828676688 0 0) 0xc341000 con 0xa4b0700
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9986> 2013-04-25 13:52:26.566363 b4812b40 1 -- 10.72.148.209:6800/3568 <== client.22266 10.72.148.217:0/421228409 14 ==== client_request(client.22266:11 getattr pAsxLsxXsxFsxcrwbal #100000043fa) v1 ==== 114+0+0 (1693703406 0 0) 0xd4e3380 con 0xa4b0700
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9985> 2013-04-25 13:52:26.566383 b4812b40 4 mds.0.server handle_client_request client_request(client.22266:11 getattr pAsxLsxXsxFsxcrwbal #100000043fa) v1
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9984> 2013-04-25 13:52:26.566423 b4812b40 1 -- 10.72.148.209:6800/3568 --> 10.72.148.217:0/421228409 -- client_reply(???:11 = 0 Success) v1 -- ?+0 0xd4e31c0 con 0xa4b0700
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9983> 2013-04-25 13:52:26.568112 b4812b40 1 -- 10.72.148.209:6800/3568 <== client.22266 10.72.148.217:0/421228409 15 ==== client_request(client.22266:12 getattr pAsxLsxXsxFsxcrwbal #100000043fa) v1 ==== 114+0+0 (2040589323 0 0) 0xd4e31c0 con 0xa4b0700
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9982> 2013-04-25 13:52:26.568135 b4812b40 4 mds.0.server handle_client_request client_request(client.22266:12 getattr pAsxLsxXsxFsxcrwbal #100000043fa) v1
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9981> 2013-04-25 13:52:26.568176 b4812b40 1 -- 10.72.148.209:6800/3568 --> 10.72.148.217:0/421228409 -- client_reply(???:12 = 0 Success) v1 -- ?+0 0xd4e3380 con 0xa4b0700
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9980> 2013-04-25 13:52:26.727649 b4812b40 1 -- 10.72.148.209:6800/3568 <== client.22264 10.72.148.209:0/1833401918 16 ==== client_request(client.22264:13 lookup #1/user) v1 ==== 118+0+0 (1985232403 0 0) 0xd4e3000 con 0xa4b0900
Apr 25 13:54:36 varunc4-virtual-machine ceph-mds: -9979> 2013-04-25 13:52:26.727682 b4812b40 4 mds.0.server handle_client_request client_request(client.22264:13 lookup #1/user) v1
There are thousands of similar lines (no other failure though). I can send the syslog file in case you want the entire thing. Once the 1st mds crashes, the 2nd one takes over, but eventually, even it crashes:
Apr 25 14:17:24 varunc5-virtual-machine ceph-mds: 2013-04-25 14:17:24.999679 bed24b40 0 -- 10.72.148.217:6803/28375 >> 10.72.148.209:0/553350901 pipe(0xbc221c0 sd=1452 :6803 s=2 pgs=2 cs=1 l=0).fault, server, going to standby
Apr 25 14:17:25 varunc5-virtual-machine ceph-mds: 2013-04-25 14:17:25.007104 beb22b40 0 -- 10.72.148.217:6803/28375 >> 10.72.148.209:0/414480117 pipe(0xbc22380 sd=1451 :6803 s=2 pgs=2 cs=1 l=0).fault, server, going to standby
Apr 25 14:17:26 varunc5-virtual-machine ceph-mds: 2013-04-25 14:17:26.866577 bf32ab40 0 -- 10.72.148.217:6803/28375 >> 10.72.148.217:0/4145749993 pipe(0xbc0dc40 sd=1455 :6803 s=2 pgs=2 cs=1 l=0).fault, server, going to standby
Apr 25 14:17:26 varunc5-virtual-machine ceph-mds: 2013-04-25 14:17:26.876829 bf52cb40 0 -- 10.72.148.217:6803/28375 >> 10.72.148.217:0/2687836137 pipe(0xbc0de00 sd=1456 :6803 s=2 pgs=2 cs=1 l=0).fault, server, going to standby
Apr 25 14:17:30 varunc5-virtual-machine ceph-mds: 2013-04-25 14:17:30.046497 b3ff1b40 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread b3ff1b40 time 2013-04-25 14:17:29.871249#012common/Thread.cc: 110: FAILED assert(ret == 0)#012#012 ceph version 0.58-500-gaf3b163 (af3b16349a49a8aee401e27c1b71fd704b31297c)#012 1: (Thread::create(unsigned int)+0xdc) [0x843866c]#012 2: (Pipe::start_reader()+0x7d) [0x84d82bd]#012 3: (SimpleMessenger::add_accept_pipe(int)+0x95) [0x8432535]#012 4: (Accepter::entry()+0x21a) [0x84a7f7a]#012 5: (Thread::_entry_func(void*)+0xf) [0x843833f]#012 6: (()+0x6d4c) [0xb7764d4c]#012 7: (clone()+0x5e) [0xb70e6ace]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Apr 25 14:17:30 varunc5-virtual-machine ceph-mds: --- begin dump of recent events ---
Apr 25 14:17:30 varunc5-virtual-machine ceph-mds: -10000> 2013-04-25 14:15:12.356518 e7c6b40 2 -- 10.72.148.217:6803/28375 >> 10.72.148.209:0/4068633961 pipe(0xbacd8c0 sd=1316 :6803 s=2 pgs=2 cs=1 l=0).reader couldn't read tag, Success
Apr 25 14:17:30 varunc5-virtual-machine ceph-mds: -9999> 2013-04-25 14:15:12.356543 e7c6b40 2 -- 10.72.148.217:6803/28375 >> 10.72.148.209:0/4068633961 pipe(0xbacd8c0 sd=1316 :6803 s=2 pgs=2 cs=1 l=0).fault 0: Success
Apr 25 14:17:30 varunc5-virtual-machine ceph-mds: -9998> 2013-04-25 14:15:12.356561 e7c6b40 0 -- 10.72.148.217:6803/28375 >> 10.72.148.209:0/4068633961 pipe(0xbacd8c0 sd=1316 :6803 s=2 pgs=2 cs=1 l=0).fault, server, going to standby
Apr 25 14:17:30 varunc5-virtual-machine ceph-mds: -9997> 2013-04-25 14:15:12.356654 b47f2b40 1 -- 10.72.148.217:6803/28375 <== client.23686 10.72.148.209:0/1333423465 9 ==== client_caps(update ino 1 186036 seq 2 caps=p dirty=- wanted=- follows 0 size 0/0 ts 1 mtime 2013-04-16 14:05:25.007509) v2 ==== 180+0+0 (1907592120 0 0) 0xbac0d80 con 0xbac5700
Apr 25 14:17:30 varunc5-virtual-machine ceph-mds: -9996> 2013-04-25 14:15:12.356704 b47f2b40 1 -- 10.72.148.217:6803/28375 <== client.23686 10.72.148.209:0/1333423465 10 ==== client_caps(update ino 100000003ee 186070 seq 2 caps=p dirty=- wanted=- follows 0 size 0/0 ts 1 mtime 2013-04-23 16:51:03.067472) v2 ==== 180+0+0 (1693437802 0 0) 0xbacf900 con 0xbac5700
Apr 25 14:17:30 varunc5-virtual-machine ceph-mds: -9995> 2013-04-25 14:15:12.356736 b47f2b40 1 -- 10.72.148.217:6803/28375 <== client.23686 10.72.148.209:0/1333423465 11 ==== client_caps(update ino 100000003ef 186071 seq 2 caps=p dirty=- wanted=- follows 0 size 0/0 ts 1 mtime 2013-04-25 13:31:20.516544) v2 ==== 180+0+0 (93328228 0 0) 0xbacfd80 con 0xbac5700
Apr 25 14:17:30 varunc5-virtual-machine ceph-mds: -9994> 2013-04-25 14:15:12.356765 b47f2b40 1 -- 10.72.148.217:6803/28375 <== client.23686 10.72.148.209:0/1333423465 12 ==== client_caps(update ino 100000047e9 186072 seq 2 caps=p dirty=- wanted=- follows 0 size 0/0 ts 1 mtime 2013-04-25 13:31:23.160509) v2 ==== 180+0+0 (2422112513 0 0) 0xbad4000 con 0xbac5700
Apr 25 14:17:30 varunc5-virtual-machine ceph-mds: -9993> 2013-04-25 14:15:12.356795 b47f2b40 1 -- 10.72.148.217:6803/28375 <== client.23686 10.72.148.209:0/1333423465 13 ==== client_caps(update ino 10000004bd8 186073 seq 1 caps=p dirty=- wanted=- follows 0 size 0/0 ts 1 mtime 2013-04-25 13:31:23.160509) v2 ==== 180+0+0 (1874277852 0 0) 0xbad4b40 con 0xbac5700
Apr 25 14:17:30 varunc5-virtual-machine ceph-mds: -9992> 2013-04-25 14:15:12.356831 b47f2b40 1 -- 10.72.148.217:6803/28375 <== client.23686 10.72.148.209:0/1333423465 14 ==== client_session(request_close) v1 ==== 28+0+0 (1103585433 0 0) 0xba59780 con 0xbac5700
Apr 25 14:17:30 varunc5-virtual-machine ceph-mds: -9991> 2013-04-25 14:15:12.356850 b47f2b40 3 mds.0.server handle_client_session client_session(request_close) v1 from client.23686
Apr 25 14:17:30 varunc5-virtual-machine ceph-mds: -9990> 2013-04-25 14:15:12.356868 b47f2b40 5 mds.0.log submit_entry 324931553~194 : ESession client.23686 10.72.148.209:0/1333423465 close cmapv 58762
Apr 25 14:17:30 varunc5-virtual-machine ceph-mds: -9989> 2013-04-25 14:15:12.356916 b47f2b40 1 -- 10.72.148.217:6803/28375 --> 10.72.148.217:6800/2986 -- osd_op(mds.0.17:3021 200.0000004d [write 1970145~198] 1.91f11973 e79) v4 -- ?+0 0xb976480 con 0x9ae0700
Apr 25 14:17:30 varunc5-virtual-machine ceph-mds: -9988> 2013-04-25 14:15:12.358168 b47f2b40 1 -- 10.72.148.217:6803/28375 <== client.23693 10.72.148.209:0/1216572876 20 ==== client_caps(update ino 10000002787 186107 seq 1 caps=pAsLsXsFscr dirty=- wanted=pFscr follows 0 size 2097652/0 ts 1 mtime 2013-04-19 12:31:36.609456) v2 ==== 180+0+0 (2883313689 0 0) 0xbae7900 con 0xbac5000
Regards
Varun
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com