Hi, Today i started to see new crashes on my cluster, this time my MDS (both) are crashing with a different message. What did i do? 1. Unpack a tarball with a lot of small files 2. Chown and rename some files 3. Remove the entire directory The removal of the files didn't go as planned, some directories have become unable to remove, the message is that the directory is not empty. Afterwards i see one of my MDS (seems a random order) to use up 100% and also 100% disk (I thought the MDS didn't use any disk I/O?) Then the MDS stop out of their self: 10.05.06 13:18:50.023843 7ffcfdd59710 -- 192.168.6.209:6800/27888 >> 192.168.6.211:6800/29237 pipe(0x22bf050 sd=-1 pgs=80 cs=1 l=1).discard_queue 10.05.06 13:18:50.023869 7ffcfdd59710 -- 192.168.6.209:6800/27888 >> 192.168.6.211:6800/29237 pipe(0x22bf050 sd=-1 pgs=80 cs=1 l=1).unregister_pipe - not registered 10.05.06 13:18:50.023952 7ffcfdd59710 -- 192.168.6.209:6800/27888 reaper reaped pipe 0x22bf050 192.168.6.211:6800/29237 10.05.06 13:18:50.024079 7ffcfdd59710 -- 192.168.6.209:6800/27888 reaper deleted pipe 0x22bf050 10.05.06 13:18:50.024097 7ffcfdd59710 -- 192.168.6.209:6800/27888 wait: done. 10.05.06 13:18:50.024113 7ffcfdd59710 -- 192.168.6.209:6800/27888 shutdown complete. 10.05.06 13:18:50.025938 7ffcfdd59710 7ffcfdd59710 stopped. The lines are almost the same at both hosts, they seem to shutdown? But they do not always stop "cleanly", in some cases (haven't really been able to reproduce it) the segfault and i get the following coredump: **** ceph01 (mds) **** warning: Can't read pathname for load map: Input/output error. Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done. Loaded symbols for /lib/libpthread.so.0 Reading symbols from /lib/libcrypto.so.0.9.8...(no debugging symbols found)...done. Loaded symbols for /lib/libcrypto.so.0.9.8 Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libstdc++.so.6 Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/libm.so.6 Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/libgcc_s.so.1 Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib/libz.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/libz.so.1 Core was generated by `/usr/bin/cmds -i 0 -c /etc/ceph/ceph.conf'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000687d48 in Logger::_flush(bool) () (gdb) bt #0 0x0000000000687d48 in Logger::_flush(bool) () #1 0x000000000068894e in ?? () #2 0x000000000068c839 in SafeTimer::EventWrapper::finish(int) () #3 0x000000000068e1d4 in Timer::timer_entry() () #4 0x0000000000477b2d in Timer::TimerThread::entry() () #5 0x0000000000488b3a in Thread::_entry_func(void*) () #6 0x00007f8951a2fa04 in start_thread () from /lib/libpthread.so.0 #7 0x00007f8950c6780d in clone () from /lib/libc.so.6 #8 0x0000000000000000 in ?? () (gdb) **** ceph02 (mds) **** warning: Can't read pathname for load map: Input/output error. Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done. Loaded symbols for /lib/libpthread.so.0 Reading symbols from /lib/libcrypto.so.0.9.8...(no debugging symbols found)...done. Loaded symbols for /lib/libcrypto.so.0.9.8 Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libstdc++.so.6 Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/libm.so.6 Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/libgcc_s.so.1 Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib/libz.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/libz.so.1 Core was generated by `/usr/bin/cmds -i 1 -c /etc/ceph/ceph.conf'. Program terminated with signal 11, Segmentation fault. #0 0x00000000005c0b47 in CDir::_fetched(ceph::buffer::list&) () (gdb) bt #0 0x00000000005c0b47 in CDir::_fetched(ceph::buffer::list&) () #1 0x000000000062f5a5 in Objecter::handle_osd_op_reply(MOSDOpReply*) () #2 0x00000000004a0e9d in MDS::_dispatch(Message*) () #3 0x00000000004a0f97 in MDS::ms_dispatch(Message*) () #4 0x000000000047ef29 in SimpleMessenger::dispatch_entry() () #5 0x0000000000477d7c in SimpleMessenger::DispatchThread::entry() () #6 0x0000000000488b3a in Thread::_entry_func(void*) () #7 0x00007f235278ba04 in start_thread () from /lib/libpthread.so.0 #8 0x00007f23519c380d in clone () from /lib/libc.so.6 #9 0x0000000000000000 in ?? () (gdb) While the MDS is eating 100% CPU and disk you can clearly see it's waiting for I/O: root 11041 99.1 84.6 2052804 864932 ? Dsl 13:18 0:57 /usr/bin/cmds -i 1 -c /etc/ceph/ceph.conf Right now my MDS are running in a KVM Virtual Machine, both on the same physical machine, but that shouldn't be a problem, should it? The performance could be a lot lower, but the crashes shouldn't be there. I'll try to get some hardware for the MDS to run them directly on a physical machine. -- Met vriendelijke groet, Wido den Hollander Hoofd Systeembeheer / CSO Telefoon Support Nederland: 0900 9633 (45 cpm) Telefoon Support België: 0900 70312 (45 cpm) Telefoon Direct: (+31) (0)20 50 60 104 Fax: +31 (0)20 50 60 111 E-mail: support@xxxxxxxxxxxx Website: http://www.pcextreme.nl Kennisbank: http://support.pcextreme.nl/ Netwerkstatus: http://nmc.pcextreme.nl -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html