On 11/12/2013 03:07 PM, Berant Lemmenes wrote:
I just restarted an OSD node and none of the admin sockets showed up on reboot (though it joined the cluster fine and all OSDs are happy. The node is a Ubuntu 12.04.3 system originally deployed via ceph-deploy on dumpling. The only thing that stands out to me is the failure on lock_fsid and the error converting store message. Here are the snip from OSD 19 of a full reboot starting with the shutdown complete entry, and going until all the reconnect messages.
This looks an awful lot like you started another instance of an OSD with the same ID while another was running. I'll walk you through the log lines that point me towards this conclusion. Would still be weird if the admin sockets vanished because of that, so maybe that's a different issue. Are you able to reproduce the admin socket issue often?
Walking through:
2013-11-12 09:44:00.757576 7fb8a8e24780 1 -- 192.168.200.54:6819/23261 <http://192.168.200.54:6819/23261> shutdown complete.
Shutdown, check. OSD restarts.
2013-11-12 09:47:05.843425 7f7918e9d780 0 ceph version 0.72 (5832e2603c7db5d40b433d0953408993a9b7c217), process ceph-osd, pid 1734 2013-11-12 09:47:05.892704 7f7918e9d780 1 filestore(/var/lib/ceph/osd/ceph-19) mount detected xfs 2013-11-12 09:47:05.892718 7f7918e9d780 1 filestore(/var/lib/ceph/osd/ceph-19) disabling 'filestore replica fadvise' due to known issues with fadvise(DONTNEED) on xfs 2013-11-12 09:47:05.944312 7f7918e9d780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features: FIEMAP ioctl is supported and appears to work 2013-11-12 09:47:05.944327 7f7918e9d780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2013-11-12 09:47:05.944743 7f7918e9d780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2013-11-12 09:47:06.258005 7f7918e9d780 0 filestore(/var/lib/ceph/osd/ceph-19) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2013-11-12 09:47:07.567405 7f7918e9d780 1 journal _open /var/lib/ceph/osd/ceph-19/journal fd 19: 10239344640 bytes, block size 4096 bytes, directio = 1, aio = 1 2013-11-12 09:47:07.570098 7f7918e9d780 1 journal _open /var/lib/ceph/osd/ceph-19/journal fd 19: 10239344640 bytes, block size 4096 bytes, directio = 1, aio = 1 2013-11-12 09:47:07.570352 7f7918e9d780 1 journal close /var/lib/ceph/osd/ceph-19/journal 2013-11-12 09:47:07.571215 7f7918e9d780 1 filestore(/var/lib/ceph/osd/ceph-19) mount detected xfs 2013-11-12 09:47:07.572742 7f7918e9d780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features: FIEMAP ioctl is supported and appears to work 2013-11-12 09:47:07.572750 7f7918e9d780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2013-11-12 09:47:07.573234 7f7918e9d780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2013-11-12 09:47:07.574879 7f7918e9d780 0 filestore(/var/lib/ceph/osd/ceph-19) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2013-11-12 09:47:07.577043 7f7918e9d780 1 journal _open /var/lib/ceph/osd/ceph-19/journal fd 23: 10239344640 bytes, block size 4096 bytes, directio = 1, aio = 1 2013-11-12 09:47:07.578649 7f7918e9d780 1 journal _open /var/lib/ceph/osd/ceph-19/journal fd 23: 10239344640 bytes, block size 4096 bytes, directio = 1, aio = 1 2013-11-12 09:47:07.680531 7f7918e9d780 0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello
OSD is running. Another instance starts running; next line is the relevant bit that shows that.
2013-11-12 09:47:09.670813 7f8151b5f780 0 ceph version 0.72 (5832e2603c7db5d40b433d0953408993a9b7c217), process ceph-osd, pid 2769 2013-11-12 09:47:09.673789 7f8151b5f780 0 filestore(/var/lib/ceph/osd/ceph-19) lock_fsid failed to lock /var/lib/ceph/osd/ceph-19/fsid, is another ceph-osd still running? (11) Resource temporarily unavailable
This last line tells us that ceph-osd believes another instance is running, so you should first find out whether there's actually another instance being run somewhere, somehow. How did you start these daemons?
2013-11-12 09:47:09.673804 7f8151b5f780 -1 filestore(/var/lib/ceph/osd/ceph-19) FileStore::mount: lock_fsid failed 2013-11-12 09:47:09.673919 7f8151b5f780 -1 ** ERROR: error converting store /var/lib/ceph/osd/ceph-19: (16) Device or resource busy 2013-11-12 09:47:14.169305 7f78fd548700 0 -- 10.200.1.54:6802/1734 <http://10.200.1.54:6802/1734> >> 10.200.1.51:6800/13263 <http://10.200.1.51:6800/13263> pipe(0x1e48c80 sd=42 :55275 s=2 pgs=5530 cs=1 l=0 c=0x1eae2c0).fault, initiating reconnect 2013-11-12 09:47:14.169444 7f78fd346700 0 -- 10.200.1.54:6802/1734 <http://10.200.1.54:6802/1734> >> 10.200.1.57:6804/8226 <http://10.200.1.57:6804/8226> pipe(0xc1ed500 sd=43 :47978 s=2 pgs=16845 cs=1 l=0 c=0x1eae840).fault, initiating reconnect 2013-11-12 09:47:14.169988 7f78fd144700 0 -- 10.200.1.54:6802/1734 <http://10.200.1.54:6802/1734> >> 10.200.1.59:6810/4862 <http://10.200.1.59:6810/4862> pipe(0xc1ed280 sd=46 :37094 s=2 pgs=42297 cs=1 l=0 c=0x1eae6e0).fault, initiating reconnect And here is roughly the same snip from just doing a 'sudo restart ceph-osd-all': 2013-11-12 09:56:36.658014 7f7918e9d780 1 -- 192.168.200.54:6811/1734 <http://192.168.200.54:6811/1734> shutdown complete. 2013-11-12 09:56:37.556988 7f3793c21780 0 ceph version 0.72 (5832e2603c7db5d40b433d0953408993a9b7c217), process ceph-osd, pid 13723 2013-11-12 09:56:37.559314 7f3793c21780 1 filestore(/var/lib/ceph/osd/ceph-19) mount detected xfs 2013-11-12 09:56:37.559319 7f3793c21780 1 filestore(/var/lib/ceph/osd/ceph-19) disabling 'filestore replica fadvise' due to known issues with fadvise(DONTNEED) on xfs 2013-11-12 09:56:37.561350 7f3793c21780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features: FIEMAP ioctl is supported and appears to work 2013-11-12 09:56:37.561360 7f3793c21780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2013-11-12 09:56:37.562357 7f3793c21780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2013-11-12 09:56:37.571030 7f3793c21780 0 filestore(/var/lib/ceph/osd/ceph-19) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2013-11-12 09:56:37.574273 7f3793c21780 1 journal _open /var/lib/ceph/osd/ceph-19/journal fd 23: 10239344640 bytes, block size 4096 bytes, directio = 1, aio = 1 2013-11-12 09:56:37.578189 7f3793c21780 1 journal _open /var/lib/ceph/osd/ceph-19/journal fd 23: 10239344640 bytes, block size 4096 bytes, directio = 1, aio = 1 2013-11-12 09:56:37.578854 7f3793c21780 1 journal close /var/lib/ceph/osd/ceph-19/journal 2013-11-12 09:56:37.579638 7f3793c21780 1 filestore(/var/lib/ceph/osd/ceph-19) mount detected xfs 2013-11-12 09:56:37.581110 7f3793c21780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features: FIEMAP ioctl is supported and appears to work 2013-11-12 09:56:37.581118 7f3793c21780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2013-11-12 09:56:37.582014 7f3793c21780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2013-11-12 09:56:37.583365 7f3793c21780 0 filestore(/var/lib/ceph/osd/ceph-19) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2013-11-12 09:56:37.585765 7f3793c21780 1 journal _open /var/lib/ceph/osd/ceph-19/journal fd 24: 10239344640 bytes, block size 4096 bytes, directio = 1, aio = 1 2013-11-12 09:56:37.588281 7f3793c21780 1 journal _open /var/lib/ceph/osd/ceph-19/journal fd 24: 10239344640 bytes, block size 4096 bytes, directio = 1, aio = 1 2013-11-12 09:56:37.589782 7f3793c21780 0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello 2013-11-12 09:56:39.723134 7f377488b700 0 -- 10.200.1.54:6807/13723 <http://10.200.1.54:6807/13723> >> 10.200.1.56:6806/563 <http://10.200.1.56:6806/563> pipe(0xc87ca00 sd=155 :38290 s=1 pgs=17864 cs=2 l=0 c=0xc893160).fault 2013-11-12 09:56:39.728798 7f3775194700 0 -- 10.200.1.54:6807/13723 <http://10.200.1.54:6807/13723> >> 10.200.1.52:6808/14464 <http://10.200.1.52:6808/14464> pipe(0xc811000 sd=52 :51030 s=1 pgs=7473 cs=6 l=0 c=0xc7fbb00).fault 2013-11-12 09:56:39.807114 7f37787ca700 0 -- 10.200.1.54:6807/13723 <http://10.200.1.54:6807/13723> >> 10.200.1.52:6805/14449 <http://10.200.1.52:6805/14449> pipe(0xc756280 sd=72 :46552 s=1 pgs=10912 cs=96 l=0 c=0xc740420).fault 2013-11-12 09:56:39.852465 7f3778ccf700 0 -- 10.200.1.54:6807/13723 <http://10.200.1.54:6807/13723> >> 10.200.1.57:6804/8226 <http://10.200.1.57:6804/8226> pipe(0x2427780 sd=83 :48234 s=1 pgs=17251 cs=128 l=0 c=0x2406dc0).fault 2013-11-12 09:56:39.898327 7f377488b700 0 -- 10.200.1.54:6807/13723 <http://10.200.1.54:6807/13723> >> 10.200.1.56:6806/563 <http://10.200.1.56:6806/563> pipe(0xc87ca00 sd=42 :40942 s=1 pgs=17945 cs=164 l=0 c=0xc893160).fault 2013-11-12 09:56:40.738437 7f3775ea1700 0 -- 10.200.1.54:6807/13723 <http://10.200.1.54:6807/13723> >> 10.200.1.60:6810/32089 <http://10.200.1.60:6810/32089> pipe(0xc7c2500 sd=72 :40289 s=2 pgs=33225 cs=109 l=0 c=0xc7fb840).fault with nothing to send, going to standby 2013-11-12 09:56:40.740185 7f376b2fd700 0 -- 10.200.1.54:6807/13723 <http://10.200.1.54:6807/13723> >> 10.200.1.60:6810/32089 <http://10.200.1.60:6810/32089> pipe(0xcd66a00 sd=279 :6807 s=0 pgs=0 cs=0 l=0 c=0xc79d000).accept connect_seq 0 vs existing 109 state standby 2013-11-12 09:56:40.740201 7f376b2fd700 0 -- 10.200.1.54:6807/13723 <http://10.200.1.54:6807/13723> >> 10.200.1.60:6810/32089 <http://10.200.1.60:6810/32089> pipe(0xcd66a00 sd=279 :6807 s=0 pgs=0 cs=0 l=0 c=0xc79d000).accept peer reset, then tried to connect to us, replacing 2013-11-12 09:56:41.639911 7f376fd47700 0 -- 192.168.200.54:6806/13723 <http://192.168.200.54:6806/13723> >> 192.168.48.127:0/234188561 <http://192.168.48.127:0/234188561> pipe(0xcf87a00 sd=127 :6806 s=0 pgs=0 cs=0 l=0 c=0xcb80580).accept peer addr is really 192.168.48.127:0/234188561 <http://192.168.48.127:0/234188561> (socket is 192.168.48.127:60893/0 <http://192.168.48.127:60893/0>) 2013-11-12 09:56:44.394952 7f37657a3700 0 -- 10.200.1.54:6807/13723 <http://10.200.1.54:6807/13723> >> 10.200.1.54:6810/13792 <http://10.200.1.54:6810/13792> pipe(0xcee7c80 sd=160 :6807 s=0 pgs=0 cs=0 l=0 c=0xd0d7160).accept connect_seq 0 vs existing 0 state connecting 2013-11-12 09:56:59.334100 7f3764396700 0 -- 192.168.200.54:6806/13723 <http://192.168.200.54:6806/13723> >> 192.168.48.102:0/663636012 <http://192.168.48.102:0/663636012> pipe(0xdbb9280 sd=197 :6806 s=0 pgs=0 cs=0 l=0 c=0xdbbc000).accept peer addr is really 192.168.48.102:0/663636012 <http://192.168.48.102:0/663636012> (socket is 192.168.48.102:35496/0 <http://192.168.48.102:35496/0>) 2013-11-12 09:57:45.805456 7f3764194700 0 -- 192.168.200.54:6806/13723 <http://192.168.200.54:6806/13723> >> 192.168.48.103:0/1090276439 <http://192.168.48.103:0/1090276439> pipe(0xdbb9000 sd=180 :6806 s=0 pgs=0 cs=0 l=0 c=0xce83dc0).accept peer addr is really 192.168.48.103:0/1090276439 <http://192.168.48.103:0/1090276439> (socket is 192.168.48.103:41220/0 <http://192.168.48.103:41220/0>) After the 'restart ceph-osd-all' the admin sockets for all 4 OSDs on this host are present. Let me know if there is additional logging or assistance I can provide to narrow it down. Thanks, Berant On Tue, Nov 12, 2013 at 4:03 AM, Joao Luis <joao.luis@xxxxxxxxxxx <mailto:joao.luis@xxxxxxxxxxx>> wrote: On Nov 12, 2013 2:38 AM, "Berant Lemmenes" <berant@xxxxxxxxxxxx <mailto:berant@xxxxxxxxxxxx>> wrote: > > I noticed the same behavior on my dumpling cluster. They wouldn't show up after boot, but after a service restart they were there. > > I haven't tested a node reboot since I upgraded to emperor today. I'll give it a shot tomorrow. > > Thanks, > Berant > > On Nov 11, 2013 9:29 PM, "Peter Matulis" <peter.matulis@xxxxxxxxxxxxx <mailto:peter.matulis@xxxxxxxxxxxxx>> wrote: >> >> After upgrading from Dumpling to Emperor on Ubuntu 12.04 I noticed the >> admin sockets for each of my monitors were missing although the cluster >> seemed to continue running fine. There wasn't anything under >> /var/run/ceph. After restarting the service on each monitor node they >> reappeared. Anyone? >> >> ~pmatulis >> Odd behavior. The monitors do remove the admin socket on shutdown and proceed to create it when they start, but as long as they are running it should exist. Have you checked the logs for some error message that could provide more insight on the cause? -Joao _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
-- Joao Eduardo Luis Software Engineer | http://inktank.com | http://ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com