MDS crashing after a lot of random I/O operations

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Today i started to see new crashes on my cluster, this time my MDS
(both) are crashing with a different message.

What did i do?

1. Unpack a tarball with a lot of small files
2. Chown and rename some files
3. Remove the entire directory

The removal of the files didn't go as planned, some directories have
become unable to remove, the message is that the directory is not empty.

Afterwards i see one of my MDS (seems a random order) to use up 100% and
also 100% disk (I thought the MDS didn't use any disk I/O?)

Then the MDS stop out of their self:

10.05.06 13:18:50.023843 7ffcfdd59710 -- 192.168.6.209:6800/27888 >>
192.168.6.211:6800/29237 pipe(0x22bf050 sd=-1 pgs=80 cs=1
l=1).discard_queue
10.05.06 13:18:50.023869 7ffcfdd59710 -- 192.168.6.209:6800/27888 >>
192.168.6.211:6800/29237 pipe(0x22bf050 sd=-1 pgs=80 cs=1
l=1).unregister_pipe - not registered
10.05.06 13:18:50.023952 7ffcfdd59710 -- 192.168.6.209:6800/27888 reaper
reaped pipe 0x22bf050 192.168.6.211:6800/29237
10.05.06 13:18:50.024079 7ffcfdd59710 -- 192.168.6.209:6800/27888 reaper
deleted pipe 0x22bf050
10.05.06 13:18:50.024097 7ffcfdd59710 -- 192.168.6.209:6800/27888 wait:
done.
10.05.06 13:18:50.024113 7ffcfdd59710 -- 192.168.6.209:6800/27888
shutdown complete.
10.05.06 13:18:50.025938 7ffcfdd59710 7ffcfdd59710 stopped.

The lines are almost the same at both hosts, they seem to shutdown?

But they do not always stop "cleanly", in some cases (haven't really
been able to reproduce it) the segfault and i get the following
coredump:

**** ceph01 (mds) ****
warning: Can't read pathname for load map: Input/output error.
Reading symbols from /lib/libpthread.so.0...(no debugging symbols
found)...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libcrypto.so.0.9.8...(no debugging symbols
found)...done.
Loaded symbols for /lib/libcrypto.so.0.9.8
Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libstdc++.so.6
Reading symbols from /lib/libm.so.6...(no debugging symbols
found)...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols
found)...done.
Loaded symbols for /lib/libgcc_s.so.1
Reading symbols from /lib/libc.so.6...(no debugging symbols
found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib/libdl.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libz.so.1...(no debugging symbols
found)...done.
Loaded symbols for /lib/libz.so.1
Core was generated by `/usr/bin/cmds -i 0 -c /etc/ceph/ceph.conf'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000687d48 in Logger::_flush(bool) ()
(gdb) bt
#0  0x0000000000687d48 in Logger::_flush(bool) ()
#1  0x000000000068894e in ?? ()
#2  0x000000000068c839 in SafeTimer::EventWrapper::finish(int) ()
#3  0x000000000068e1d4 in Timer::timer_entry() ()
#4  0x0000000000477b2d in Timer::TimerThread::entry() ()
#5  0x0000000000488b3a in Thread::_entry_func(void*) ()
#6  0x00007f8951a2fa04 in start_thread () from /lib/libpthread.so.0
#7  0x00007f8950c6780d in clone () from /lib/libc.so.6
#8  0x0000000000000000 in ?? ()
(gdb) 

**** ceph02 (mds) ****
warning: Can't read pathname for load map: Input/output error.
Reading symbols from /lib/libpthread.so.0...(no debugging symbols
found)...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libcrypto.so.0.9.8...(no debugging symbols
found)...done.
Loaded symbols for /lib/libcrypto.so.0.9.8
Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libstdc++.so.6
Reading symbols from /lib/libm.so.6...(no debugging symbols
found)...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols
found)...done.
Loaded symbols for /lib/libgcc_s.so.1
Reading symbols from /lib/libc.so.6...(no debugging symbols
found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib/libdl.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libz.so.1...(no debugging symbols
found)...done.
Loaded symbols for /lib/libz.so.1
Core was generated by `/usr/bin/cmds -i 1 -c /etc/ceph/ceph.conf'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000005c0b47 in CDir::_fetched(ceph::buffer::list&) ()
(gdb) bt
#0  0x00000000005c0b47 in CDir::_fetched(ceph::buffer::list&) ()
#1  0x000000000062f5a5 in Objecter::handle_osd_op_reply(MOSDOpReply*) ()
#2  0x00000000004a0e9d in MDS::_dispatch(Message*) ()
#3  0x00000000004a0f97 in MDS::ms_dispatch(Message*) ()
#4  0x000000000047ef29 in SimpleMessenger::dispatch_entry() ()
#5  0x0000000000477d7c in SimpleMessenger::DispatchThread::entry() ()
#6  0x0000000000488b3a in Thread::_entry_func(void*) ()
#7  0x00007f235278ba04 in start_thread () from /lib/libpthread.so.0
#8  0x00007f23519c380d in clone () from /lib/libc.so.6
#9  0x0000000000000000 in ?? ()
(gdb) 

While the MDS is eating 100% CPU and disk you can clearly see it's
waiting for I/O:

root     11041 99.1 84.6 2052804 864932 ?      Dsl  13:18
0:57 /usr/bin/cmds -i 1 -c /etc/ceph/ceph.conf

Right now my MDS are running in a KVM Virtual Machine, both on the same
physical machine, but that shouldn't be a problem, should it? The
performance could be a lot lower, but the crashes shouldn't be there.

I'll try to get some hardware for the MDS to run them directly on a
physical machine.

-- 
Met vriendelijke groet,

Wido den Hollander
Hoofd Systeembeheer / CSO
Telefoon Support Nederland: 0900 9633 (45 cpm)
Telefoon Support België: 0900 70312 (45 cpm)
Telefoon Direct: (+31) (0)20 50 60 104
Fax: +31 (0)20 50 60 111
E-mail: support@xxxxxxxxxxxx
Website: http://www.pcextreme.nl
Kennisbank: http://support.pcextreme.nl/
Netwerkstatus: http://nmc.pcextreme.nl



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux