Re: No monitor sockets after upgrading to Emperor

Joao Eduardo Luis <joao.luis@xxxxxxxxxxx> · Wed, 13 Nov 2013 00:28:20 +0000

On 11/12/2013 03:07 PM, Berant Lemmenes wrote:
I just restarted an OSD node and none of the admin sockets showed up on
reboot (though it joined the cluster fine and all OSDs are happy. The
node is a Ubuntu 12.04.3 system originally deployed via ceph-deploy on
dumpling.

The only thing that stands out to me is the failure on lock_fsid and the
error converting store message.

Here are the snip from OSD 19 of a full reboot starting with the
shutdown complete entry, and going until all the reconnect messages.

This looks an awful lot like you started another instance of an OSD with 
the same ID while another was running.  I'll walk you through the log 
lines that point me towards this conclusion.  Would still be weird if 
the admin sockets vanished because of that, so maybe that's a different 
issue.  Are you able to reproduce the admin socket issue often?

Walking through:

2013-11-12 09:44:00.757576 7fb8a8e24780  1 -- 192.168.200.54:6819/23261
<http://192.168.200.54:6819/23261> shutdown complete.

Shutdown, check.  OSD restarts.

2013-11-12 09:47:05.843425 7f7918e9d780  0 ceph version 0.72
(5832e2603c7db5d40b433d0953408993a9b7c217), process ceph-osd, pid 1734
2013-11-12 09:47:05.892704 7f7918e9d780  1
filestore(/var/lib/ceph/osd/ceph-19) mount detected xfs
2013-11-12 09:47:05.892718 7f7918e9d780  1
filestore(/var/lib/ceph/osd/ceph-19)  disabling 'filestore replica
fadvise' due to known issues with fadvise(DONTNEED) on xfs
2013-11-12 09:47:05.944312 7f7918e9d780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features:
FIEMAP ioctl is supported and appears to work
2013-11-12 09:47:05.944327 7f7918e9d780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features:
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2013-11-12 09:47:05.944743 7f7918e9d780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2013-11-12 09:47:06.258005 7f7918e9d780  0
filestore(/var/lib/ceph/osd/ceph-19) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2013-11-12 09:47:07.567405 7f7918e9d780  1 journal _open
/var/lib/ceph/osd/ceph-19/journal fd 19: 10239344640 bytes, block size
4096 bytes, directio = 1, aio = 1
2013-11-12 09:47:07.570098 7f7918e9d780  1 journal _open
/var/lib/ceph/osd/ceph-19/journal fd 19: 10239344640 bytes, block size
4096 bytes, directio = 1, aio = 1
2013-11-12 09:47:07.570352 7f7918e9d780  1 journal close
/var/lib/ceph/osd/ceph-19/journal
2013-11-12 09:47:07.571215 7f7918e9d780  1
filestore(/var/lib/ceph/osd/ceph-19) mount detected xfs
2013-11-12 09:47:07.572742 7f7918e9d780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features:
FIEMAP ioctl is supported and appears to work
2013-11-12 09:47:07.572750 7f7918e9d780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features:
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2013-11-12 09:47:07.573234 7f7918e9d780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2013-11-12 09:47:07.574879 7f7918e9d780  0
filestore(/var/lib/ceph/osd/ceph-19) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2013-11-12 09:47:07.577043 7f7918e9d780  1 journal _open
/var/lib/ceph/osd/ceph-19/journal fd 23: 10239344640 bytes, block size
4096 bytes, directio = 1, aio = 1
2013-11-12 09:47:07.578649 7f7918e9d780  1 journal _open
/var/lib/ceph/osd/ceph-19/journal fd 23: 10239344640 bytes, block size
4096 bytes, directio = 1, aio = 1
2013-11-12 09:47:07.680531 7f7918e9d780  0 <cls>
cls/hello/cls_hello.cc:271: loading cls_hello

OSD is running.  Another instance starts running; next line is the 
relevant bit that shows that.

2013-11-12 09:47:09.670813 7f8151b5f780  0 ceph version 0.72
(5832e2603c7db5d40b433d0953408993a9b7c217), process ceph-osd, pid 2769
2013-11-12 09:47:09.673789 7f8151b5f780  0
filestore(/var/lib/ceph/osd/ceph-19) lock_fsid failed to lock
/var/lib/ceph/osd/ceph-19/fsid, is another ceph-osd still running? (11)
Resource temporarily unavailable

This last line tells us that ceph-osd believes another instance is 
running, so you should first find out whether there's actually another 
instance being run somewhere, somehow.  How did you start these daemons?

2013-11-12 09:47:09.673804 7f8151b5f780 -1
filestore(/var/lib/ceph/osd/ceph-19) FileStore::mount: lock_fsid failed
2013-11-12 09:47:09.673919 7f8151b5f780 -1  ** ERROR: error converting
store /var/lib/ceph/osd/ceph-19: (16) Device or resource busy
2013-11-12 09:47:14.169305 7f78fd548700  0 -- 10.200.1.54:6802/1734
<http://10.200.1.54:6802/1734> >> 10.200.1.51:6800/13263
<http://10.200.1.51:6800/13263> pipe(0x1e48c80 sd=42 :55275 s=2 pgs=5530
cs=1 l=0 c=0x1eae2c0).fault, initiating reconnect
2013-11-12 09:47:14.169444 7f78fd346700  0 -- 10.200.1.54:6802/1734
<http://10.200.1.54:6802/1734> >> 10.200.1.57:6804/8226
<http://10.200.1.57:6804/8226> pipe(0xc1ed500 sd=43 :47978 s=2 pgs=16845
cs=1 l=0 c=0x1eae840).fault, initiating reconnect
2013-11-12 09:47:14.169988 7f78fd144700  0 -- 10.200.1.54:6802/1734
<http://10.200.1.54:6802/1734> >> 10.200.1.59:6810/4862
<http://10.200.1.59:6810/4862> pipe(0xc1ed280 sd=46 :37094 s=2 pgs=42297
cs=1 l=0 c=0x1eae6e0).fault, initiating reconnect

And here is roughly the same snip from just doing a 'sudo restart
ceph-osd-all':

2013-11-12 09:56:36.658014 7f7918e9d780  1 -- 192.168.200.54:6811/1734
<http://192.168.200.54:6811/1734> shutdown complete.
2013-11-12 09:56:37.556988 7f3793c21780  0 ceph version 0.72
(5832e2603c7db5d40b433d0953408993a9b7c217), process ceph-osd, pid 13723
2013-11-12 09:56:37.559314 7f3793c21780  1
filestore(/var/lib/ceph/osd/ceph-19) mount detected xfs
2013-11-12 09:56:37.559319 7f3793c21780  1
filestore(/var/lib/ceph/osd/ceph-19)  disabling 'filestore replica
fadvise' due to known issues with fadvise(DONTNEED) on xfs
2013-11-12 09:56:37.561350 7f3793c21780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features:
FIEMAP ioctl is supported and appears to work
2013-11-12 09:56:37.561360 7f3793c21780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features:
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2013-11-12 09:56:37.562357 7f3793c21780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2013-11-12 09:56:37.571030 7f3793c21780  0
filestore(/var/lib/ceph/osd/ceph-19) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2013-11-12 09:56:37.574273 7f3793c21780  1 journal _open
/var/lib/ceph/osd/ceph-19/journal fd 23: 10239344640 bytes, block size
4096 bytes, directio = 1, aio = 1
2013-11-12 09:56:37.578189 7f3793c21780  1 journal _open
/var/lib/ceph/osd/ceph-19/journal fd 23: 10239344640 bytes, block size
4096 bytes, directio = 1, aio = 1
2013-11-12 09:56:37.578854 7f3793c21780  1 journal close
/var/lib/ceph/osd/ceph-19/journal
2013-11-12 09:56:37.579638 7f3793c21780  1
filestore(/var/lib/ceph/osd/ceph-19) mount detected xfs
2013-11-12 09:56:37.581110 7f3793c21780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features:
FIEMAP ioctl is supported and appears to work
2013-11-12 09:56:37.581118 7f3793c21780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features:
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2013-11-12 09:56:37.582014 7f3793c21780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-19) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2013-11-12 09:56:37.583365 7f3793c21780  0
filestore(/var/lib/ceph/osd/ceph-19) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2013-11-12 09:56:37.585765 7f3793c21780  1 journal _open
/var/lib/ceph/osd/ceph-19/journal fd 24: 10239344640 bytes, block size
4096 bytes, directio = 1, aio = 1
2013-11-12 09:56:37.588281 7f3793c21780  1 journal _open
/var/lib/ceph/osd/ceph-19/journal fd 24: 10239344640 bytes, block size
4096 bytes, directio = 1, aio = 1
2013-11-12 09:56:37.589782 7f3793c21780  0 <cls>
cls/hello/cls_hello.cc:271: loading cls_hello
2013-11-12 09:56:39.723134 7f377488b700  0 -- 10.200.1.54:6807/13723
<http://10.200.1.54:6807/13723> >> 10.200.1.56:6806/563
<http://10.200.1.56:6806/563> pipe(0xc87ca00 sd=155 :38290 s=1 pgs=17864
cs=2 l=0 c=0xc893160).fault
2013-11-12 09:56:39.728798 7f3775194700  0 -- 10.200.1.54:6807/13723
<http://10.200.1.54:6807/13723> >> 10.200.1.52:6808/14464
<http://10.200.1.52:6808/14464> pipe(0xc811000 sd=52 :51030 s=1 pgs=7473
cs=6 l=0 c=0xc7fbb00).fault
2013-11-12 09:56:39.807114 7f37787ca700  0 -- 10.200.1.54:6807/13723
<http://10.200.1.54:6807/13723> >> 10.200.1.52:6805/14449
<http://10.200.1.52:6805/14449> pipe(0xc756280 sd=72 :46552 s=1
pgs=10912 cs=96 l=0 c=0xc740420).fault
2013-11-12 09:56:39.852465 7f3778ccf700  0 -- 10.200.1.54:6807/13723
<http://10.200.1.54:6807/13723> >> 10.200.1.57:6804/8226
<http://10.200.1.57:6804/8226> pipe(0x2427780 sd=83 :48234 s=1 pgs=17251
cs=128 l=0 c=0x2406dc0).fault
2013-11-12 09:56:39.898327 7f377488b700  0 -- 10.200.1.54:6807/13723
<http://10.200.1.54:6807/13723> >> 10.200.1.56:6806/563
<http://10.200.1.56:6806/563> pipe(0xc87ca00 sd=42 :40942 s=1 pgs=17945
cs=164 l=0 c=0xc893160).fault
2013-11-12 09:56:40.738437 7f3775ea1700  0 -- 10.200.1.54:6807/13723
<http://10.200.1.54:6807/13723> >> 10.200.1.60:6810/32089
<http://10.200.1.60:6810/32089> pipe(0xc7c2500 sd=72 :40289 s=2
pgs=33225 cs=109 l=0 c=0xc7fb840).fault with nothing to send, going to
standby
2013-11-12 09:56:40.740185 7f376b2fd700  0 -- 10.200.1.54:6807/13723
<http://10.200.1.54:6807/13723> >> 10.200.1.60:6810/32089
<http://10.200.1.60:6810/32089> pipe(0xcd66a00 sd=279 :6807 s=0 pgs=0
cs=0 l=0 c=0xc79d000).accept connect_seq 0 vs existing 109 state standby
2013-11-12 09:56:40.740201 7f376b2fd700  0 -- 10.200.1.54:6807/13723
<http://10.200.1.54:6807/13723> >> 10.200.1.60:6810/32089
<http://10.200.1.60:6810/32089> pipe(0xcd66a00 sd=279 :6807 s=0 pgs=0
cs=0 l=0 c=0xc79d000).accept peer reset, then tried to connect to us,
replacing
2013-11-12 09:56:41.639911 7f376fd47700  0 -- 192.168.200.54:6806/13723
<http://192.168.200.54:6806/13723> >> 192.168.48.127:0/234188561
<http://192.168.48.127:0/234188561> pipe(0xcf87a00 sd=127 :6806 s=0
pgs=0 cs=0 l=0 c=0xcb80580).accept peer addr is really
192.168.48.127:0/234188561 <http://192.168.48.127:0/234188561> (socket
is 192.168.48.127:60893/0 <http://192.168.48.127:60893/0>)
2013-11-12 09:56:44.394952 7f37657a3700  0 -- 10.200.1.54:6807/13723
<http://10.200.1.54:6807/13723> >> 10.200.1.54:6810/13792
<http://10.200.1.54:6810/13792> pipe(0xcee7c80 sd=160 :6807 s=0 pgs=0
cs=0 l=0 c=0xd0d7160).accept connect_seq 0 vs existing 0 state connecting
2013-11-12 09:56:59.334100 7f3764396700  0 -- 192.168.200.54:6806/13723
<http://192.168.200.54:6806/13723> >> 192.168.48.102:0/663636012
<http://192.168.48.102:0/663636012> pipe(0xdbb9280 sd=197 :6806 s=0
pgs=0 cs=0 l=0 c=0xdbbc000).accept peer addr is really
192.168.48.102:0/663636012 <http://192.168.48.102:0/663636012> (socket
is 192.168.48.102:35496/0 <http://192.168.48.102:35496/0>)
2013-11-12 09:57:45.805456 7f3764194700  0 -- 192.168.200.54:6806/13723
<http://192.168.200.54:6806/13723> >> 192.168.48.103:0/1090276439
<http://192.168.48.103:0/1090276439> pipe(0xdbb9000 sd=180 :6806 s=0
pgs=0 cs=0 l=0 c=0xce83dc0).accept peer addr is really
192.168.48.103:0/1090276439 <http://192.168.48.103:0/1090276439> (socket
is 192.168.48.103:41220/0 <http://192.168.48.103:41220/0>)

After the 'restart ceph-osd-all' the admin sockets for all 4 OSDs on
this host are present.

Let me know if there is additional logging or assistance I can provide
to narrow it down.

Thanks,
Berant

On Tue, Nov 12, 2013 at 4:03 AM, Joao Luis <joao.luis@xxxxxxxxxxx
<mailto:joao.luis@xxxxxxxxxxx>> wrote:

    On Nov 12, 2013 2:38 AM, "Berant Lemmenes" <berant@xxxxxxxxxxxx
    <mailto:berant@xxxxxxxxxxxx>> wrote:
     >
     > I noticed the same behavior on my dumpling cluster. They wouldn't
    show up after boot, but after a service restart they were there.
     >
     > I haven't tested a node reboot since I upgraded to emperor today.
    I'll give it a shot tomorrow.
     >
     > Thanks,
     > Berant
     >
     > On Nov 11, 2013 9:29 PM, "Peter Matulis"
    <peter.matulis@xxxxxxxxxxxxx <mailto:peter.matulis@xxxxxxxxxxxxx>>
    wrote:
     >>
     >> After upgrading from Dumpling to Emperor on Ubuntu 12.04 I
    noticed the
     >> admin sockets for each of my monitors were missing although the
    cluster
     >> seemed to continue running fine.  There wasn't anything under
     >> /var/run/ceph.  After restarting the service on each monitor
    node they
     >> reappeared.  Anyone?
     >>
     >> ~pmatulis
     >>

    Odd behavior. The monitors do remove the admin socket on shutdown
    and proceed to create it when they start, but as long as they are
    running it should exist. Have you checked the logs for some error
    message that could provide more insight on the cause?

       -Joao

    _______________________________________________
     >> ceph-users mailing list
     >> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
     >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     >
     >
     > _______________________________________________
     > ceph-users mailing list
     > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     >

--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com