ceph-osd constantly crashing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello! 
We have simple setup as follows:

Debian GNU/Linux 6.0 x64
Linux h08 2.6.32-19-pve #1 SMP Wed May 15 07:32:52 CEST 2013 x86_64 GNU/Linux

ii  ceph                             0.61.2-1~bpo60+1             distributed storage and file system
ii  ceph-common                      0.61.2-1~bpo60+1             common utilities to mount and interact with a ceph storage cluster
ii  ceph-fs-common                   0.61.2-1~bpo60+1             common utilities to mount and interact with a ceph file system
ii  ceph-fuse                        0.61.2-1~bpo60+1             FUSE-based client for the Ceph distributed file system
ii  ceph-mds                         0.61.2-1~bpo60+1             metadata server for the ceph distributed file system
ii  libcephfs1                       0.61.2-1~bpo60+1             Ceph distributed file system client library
ii  libc-bin                         2.11.3-4                     Embedded GNU C Library: Binaries
ii  libc-dev-bin                     2.11.3-4                     Embedded GNU C Library: Development binaries
ii  libc6                            2.11.3-4                     Embedded GNU C Library: Shared libraries
ii  libc6-dev                        2.11.3-4                     Embedded GNU C Library: Development Libraries and Header Files

All programs are running fine except osd.2 which is crashing repeatedly.
All other nodes have the same operating system onboard and all the system environment is quite identical. 

#cat /etc/ceph/ceph.conf
[global]
        pid file = /var/run/ceph/$name.pid
        auth cluster required = none
        auth service required = none
        auth client required = none
        max open files = 65000

[mon]
[mon.0]
        host = h01
        mon addr = 10.1.1.3:6789
[mon.1]
        host = h07
        mon addr = 10.1.1.10:6789
[mon.2]
        host = h08
        mon addr = 10.1.1.11:6789

[mds]
[mds.3]
        host = h09

[mds.4]
        host = h06

[osd]
        osd journal size = 10000
        osd journal = /var/lib/ceph/journal/$cluster-$id/journal
        osd mkfs type = xfs

[osd.0]
        host = h01
        addr = 10.1.1.3
        devs = /dev/sda3
[osd.1]
        host = h07
        addr = 10.1.1.10
        devs = /dev/sda3
[osd.2]
        host = h08
        addr = 10.1.1.11
        devs = /dev/sda3
[osd.3]
        host = h09
        addr = 10.1.1.12
        devs = /dev/sda3

[osd.4]
        host = h06
        addr = 10.1.1.9
        devs = /dev/sda3


~#ceph osd tree

# id    weight  type name       up/down reweight
-1      5       root default
-3      5               rack unknownrack
-2      1                       host h01
0       1                               osd.0   up      1
-4      1                       host h07
1       1                               osd.1   up      1
-5      1                       host h08
2       1                               osd.2   down    0
-6      1                       host h09
3       1                               osd.3   up      1
-7      1                       host h06
4       1                               osd.4   up      1


When crashing ceph-osd process could fall into zombie state with no possibility even umount osd partition. 

My gdb show the following 

#gdb /usr/bin/ceph-osd /core
GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/ceph-osd...(no debugging symbols found)...done.
[New Thread 809630]
[New Thread 809628]
[New Thread 809631]
[New Thread 809632]
[New Thread 809633]
[New Thread 809634]
[New Thread 809672]
[New Thread 809629]
[New Thread 809524]
[New Thread 809421]
[New Thread 137559]
[New Thread 809636]
[New Thread 809635]
[New Thread 809677]
[New Thread 809679]
[New Thread 809527]
[New Thread 137560]
[New Thread 809420]
[New Thread 809637]
[New Thread 809685]
[New Thread 809525]
[New Thread 809638]
[New Thread 99663]
[New Thread 809523]
[New Thread 809639]
[New Thread 809522]
[New Thread 809640]
[New Thread 809644]
[New Thread 809641]
[New Thread 809643]
[New Thread 809648]
[New Thread 809668]
[New Thread 809669]
[New Thread 809671]
[New Thread 809676]
[New Thread 809680]
[New Thread 809681]
[New Thread 56075]
[New Thread 809682]
[New Thread 107924]
[New Thread 809683]
[New Thread 108037]
[New Thread 809684]
[New Thread 119704]
[New Thread 809686]
[New Thread 809537]
[New Thread 56073]
[New Thread 85231]
[New Thread 85232]
[New Thread 99661]
[New Thread 809535]
[New Thread 99662]
[New Thread 107922]
[New Thread 119705]
[New Thread 107928]
[New Thread 108035]
[New Thread 809410]
[New Thread 809528]
[New Thread 809530]
[New Thread 809531]
[New Thread 809533]
[New Thread 809536]
[New Thread 809642]
[New Thread 809534]
[New Thread 809411]
[New Thread 809645]
[New Thread 809667]
[New Thread 809670]
[New Thread 809526]
[New Thread 809521]
[New Thread 809532]
[New Thread 809529]

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /lib/libaio.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libaio.so.1
Reading symbols from /usr/lib/libnss3.so.1d...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libnss3.so.1d
Reading symbols from /usr/lib/libnspr4.so.0d...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libnspr4.so.0d
Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libuuid.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libuuid.so.1
Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /lib/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /usr/lib/libtcmalloc.so.0...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libtcmalloc.so.0
Reading symbols from /usr/lib/libboost_thread.so.1.42.0...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libboost_thread.so.1.42.0
Reading symbols from /usr/lib/libleveldb.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libleveldb.so.1
Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libstdc++.so.6
Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libgcc_s.so.1
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /usr/lib/libnssutil3.so.1d...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libnssutil3.so.1d
Reading symbols from /usr/lib/libplc4.so.0d...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libplc4.so.0d
Reading symbols from /usr/lib/libplds4.so.0d...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libplds4.so.0d
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib/libunwind.so.7...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libunwind.so.7
Reading symbols from /usr/lib/libsnappy.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libsnappy.so.1
Reading symbols from /usr/lib/nss/libsoftokn3.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/nss/libsoftokn3.so
Reading symbols from /usr/lib/libsqlite3.so.0...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libsqlite3.so.0
Reading symbols from /usr/lib/nss/libfreebl3.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/nss/libfreebl3.so
Reading symbols from /usr/lib/rados-classes/libcls_lock.so...done.
Loaded symbols for /usr/lib/rados-classes/libcls_lock.so
Reading symbols from /usr/lib/libboost_system.so.1.42.0...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libboost_system.so.1.42.0
Reading symbols from /usr/lib/rados-classes/libcls_rgw.so...done.
Loaded symbols for /usr/lib/rados-classes/libcls_rgw.so

warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffff87fe000
Core was generated by `/usr/bin/ceph-osd -i 2 --pid-file /var/run/ceph/osd.2.pid -c /etc/ceph/ceph.con'.
Program terminated with signal 6, Aborted.
#0  0x00007f7e994b9ebb in raise () from /lib/libpthread.so.0

(gdb) bt
#0  0x00007f7e994b9ebb in raise () from /lib/libpthread.so.0
#1  0x00000000007a16c7 in ?? ()
#2  <signal handler called>
#3  0x00007f7e97cf21b5 in raise () from /lib/libc.so.6
#4  0x00007f7e97cf4fc0 in abort () from /lib/libc.so.6
#5  0x00007f7e98586dc5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#6  0x00007f7e98585166 in ?? () from /usr/lib/libstdc++.so.6
#7  0x00007f7e98585193 in std::terminate() () from /usr/lib/libstdc++.so.6
#8  0x00007f7e9858528e in __cxa_throw () from /usr/lib/libstdc++.so.6
#9  0x00000000007f9f79 in ceph::__ceph_assert_fail(char const*, char const*, int, char const*) ()
#10 0x0000000000763ca1 in SyncEntryTimeout::finish(int) ()
#11 0x00000000005b828a in Context::complete(int) ()
#12 0x00000000008b3793 in SafeTimer::timer_thread() ()
#13 0x00000000008b595d in SafeTimerThread::entry() ()
#14 0x00007f7e994b18ca in start_thread () from /lib/libpthread.so.0
#15 0x00007f7e97d8fb6d in clone () from /lib/libc.so.6
#16 0x0000000000000000 in ?? ()
(gdb) 

Problem is common only for this one osd.2 and all other services running fine. I have a lot of core dumped if any need. 

Please help fix this issue. Our cluster running as follows 
#ceph -w 
   health HEALTH_WARN 2 pgs backfilling; 2 pgs degraded; 3 pgs recovering; 39 pgs recovery_wait; 44 pgs stuck unclean; recovery 157580/1744054 degraded (9.035%);  recovering 105 o/s, 7442KB/s; 1 mons down, quorum 0,1 0,1
   monmap e1: 3 mons at {0=10.1.1.3:6789/0,1=10.1.1.10:6789/0,2=10.1.1.11:6789/0}, election epoch 112, quorum 0,1 0,1
   osdmap e200: 6 osds: 4 up, 4 in
    pgmap v1133760: 1208 pgs: 1164 active+clean, 39 active+recovery_wait, 2 active+degraded+backfilling, 3 active+recovering; 88915 MB data, 170 GB used, 573 GB / 744 GB avail; 119KB/s rd, 763KB/s wr, 18op/s; 157580/1744054 degraded (9.035%);  recovering 105 o/s, 7442KB/s
   mdsmap e16: 1/1/1 up {0=4=up:active}, 1 up:standby

Regards, Artem Silenkov, 2GIS TM.
---
2GIS LLC
http://2gis.ru
a.silenkov@xxxxxxx
gtalk:artem.silenkov@xxxxxxxxx
cell:+79231534853
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux