Hi, I’m sure I’m doing something wrong, I hope someone can enlighten me… I’m encountering many issues when I restart a ceph server (any ceph server). This is on CentOS 7.2, ceph-0.94.6-0.el7.x86_64. Firt : I have disabled abrt. I don’t need abrt. But when I restart, I see these logs in the systemd-udevd journal : Apr 21 18:00:14 ceph4._snip_ python[1109]: detected unhandled Python exception in '/usr/sbin/ceph-disk' Apr 21 18:00:14 ceph4._snip_ python[1109]: can't communicate with ABRT daemon, is it running? [Errno 2] No such file or directory Apr 21 18:00:14 ceph4._snip_ python[1174]: detected unhandled Python exception in '/usr/sbin/ceph-disk' Apr 21 18:00:14 ceph4._snip_ python[1174]: can't communicate with ABRT daemon, is it running? [Errno 2] No such file or directory How could I possibly debug these exceptions ? Could that be related to the osd hook that I'm using to put the SSDs in another root in the crush map (that hook is a bash script, but it's calling another helper python script that I made and which is trying to use megacli
to identify the SSDs on a non-jbod controller... tricky thing.) ? Then, I see these kind of errors for most if not all drives : Apr 21 18:00:47 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate /dev/sdt1'(err) '2016-04-21 18:00:47.115322 7fc408ff9700 0 -- :/885104093 >> __MON_IP__:6789/0 pipe(0x7fc400008280 sd=6 :0 s=1 pgs=0 cs=0
l=1 c=0x7fc400012670).fault' Apr 21 18:00:50 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate /dev/sdt1'(err) '2016-04-21 18:00:50.115543 7fc408ef8700 0 -- :/885104093 >> __MON_IP__:6789/0 pipe(0x7fc400000c00 sd=6 :0 s=1 pgs=0 cs=0
l=1 c=0x7fc40000e1d0).fault' Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate /dev/sdt1'(out) 'failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.113 --keyring=/var/lib/ceph/osd/ceph-113/keyring osd
crush create-or-move -- 113 1.81 host=ceph4 root=default'' Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate /dev/sdt1'(err) 'ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.113']'
returned non-zero exit status 1' Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate /dev/sdt1' [1257] exit with return code 1 Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: adding watch on '/dev/sdt1' Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: created db file '/run/udev/data/b65:49' for '/devices/pci0000:00/0000:00:07.0/0000:03:00.0/host2/target2:2:6/2:2:6:0/block/sdt/sdt1' Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: passed unknown number of bytes to netlink monitor 0x7f4cec2f3240 Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: seq 2553 processed with 0 Please note that at that time of the boot, I think there is still no network as the interfaces are brought up later according to the network journal : Apr 21 18:02:16 ceph4._snip_ network[2904]: Bringing up interface p2p1: [ OK ] Apr 21 18:02:19 ceph4._snip_ network[2904]: Bringing up interface p2p2: [ OK ] => too bad for the OSD startups... I have to say I also disabled NetworkManager, and I'm using static network configuration files... but I don't know why the ceph init script would be called before network is up... ? But even if I had network, I'm having another issue : I'm wondering wether I'm hitting deadlocks somewhere... Apr 21 18:01:10 ceph4._snip_ systemd-udevd[779]: worker [792] /devices/pci0000:00/0000:00:07.0/0000:03:00.0/host2/target2:2:0/2:2:0:0/block/sdn/sdn2 is taking a long time (...) Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(err) 'SG_IO: bad/missing sense data, sb[]: 70 00 05 00' Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(err) ' 00 00 00 0b 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00' Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(err) '' Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(out) '=== osd.107 === ' Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(err) '2016-04-21 18:01:54.707669 7f95801ac700 0 -- :/2141879112 >> __MON_IP__:6789/0 pipe(0x7f957c05f710 sd=4 :0 s=1 pgs=0
cs=0 l=1 c=0x7f957c05bb40).fault' (...) Apr 21 18:02:12 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(err) '2016-04-21 18:02:12.709053 7f95801ac700 0 -- :/2141879112 >> __MON_IP__:6789/0 pipe(0x7f9570008280 sd=4 :0 s=1 pgs=0
cs=0 l=1 c=0x7f95700056a0).fault' Apr 21 18:02:16 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(err) 'create-or-move updated item name 'osd.107' weight 1.81 at location {host=ceph4,root=default} to crush map' Apr 21 18:02:16 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(out) 'Starting Ceph osd.107 on ceph4...' Apr 21 18:02:16 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(err) 'Running as unit ceph-osd.107.1461254514.449704730.service.' Apr 21 18:02:16 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2' [1138] exit with return code 0 Apr 21 18:02:16 ceph4._snip_ systemd-udevd[792]: adding watch on '/dev/sdn2' Apr 21 18:02:16 ceph4._snip_ systemd-udevd[792]: created db file '/run/udev/data/b8:210' for '/devices/pci0000:00/0000:00:07.0/0000:03:00.0/host2/target2:2:0/2:2:0:0/block/sdn/sdn2' If I look at this specific osd journal, I see
Apr 21 18:02:16 ceph4._snip_ systemd[3137]: Executing: /bin/bash -c 'ulimit -n 32768; /usr/bin/ceph-osd -i 107 --pid-file /var/run/ceph/osd.107.pid -c /etc/ceph/ceph.conf --cluster ceph -f' Apr 21 18:02:16 ceph4._snip_ bash[3137]: 2016-04-21 18:02:16.147602 7f109f916880 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-107: (2) No such file or directory Apr 21 18:02:16 ceph4._snip_ systemd[1]: Child 3137 belongs to ceph-osd.107.1461254514.449704730.service I'm assuming this just means that the partition was not mounted correctly because ceph-disk failed, and that the ceph OSD daemon died...? After the boot... no OSD is up. And if I run ceph-disk activate-all manually after the node booted (and gives me ssh access, which indeed takes long)... everything gets up. Any idea(s) ? Thanks |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com