Re: mds port 6800 socket closed when accessing mount point

Yonggang Liu <myidpt@xxxxxxxxx> · Wed, 27 Oct 2010 15:05:27 -0400

Hi Colin,

Thank you for your reply. I ran "ps uax | grep ceph" on 4 machines to
check if the daemons are running. I got:
On client:
root      1721  0.0  0.0      0     0 ?        S    17:53   0:00 [ceph-msgr/0]
root      1725  0.0  0.0      0     0 ?        S    17:53   0:00
[ceph-writeback/]
root      1726  0.0  0.0      0     0 ?        S    17:53   0:00
[ceph-pg-invalid]
root      1727  0.0  0.0      0     0 ?        S    17:53   0:00 [ceph-trunc]
On mds0:
root      1930  0.5  1.2  48468  3280 ?        S<sl 15:55   0:01
/usr/bin/cmon -i 0 -c /tmp/ceph.conf.1869
root      1979  4.7  1.4  76340  3740 ?        S<sl 15:55   0:13
/usr/bin/cmds -i 0 -c /tmp/ceph.conf.1869
On osd0:
root      1657  0.4  7.4 214316 18988 ?        S<sl 15:55   0:01
/usr/bin/cosd -i 0 -c /tmp/ceph.conf.1869
On osd1:
root      1676  0.7 10.6 220624 27180 ?        S<sl 15:55   0:02
/usr/bin/cosd -i 1 -c /tmp/ceph.conf.1869

These are the last few lines of log files I got from /var/log/ceph/:
On mds0
mds.0.log:
2010-10-27 15:56:09.238267 b362fb70 failed to decode message of type
784 v4865: buffer::end_of_buffer
2010-10-27 15:56:09.238960 b362fb70 -- 192.168.89.133:6800/1978 >>
192.168.89.131:0/1344076202 pipe(0x9fceb20 sd=-1 pgs=9 cs=7 l=0).fault
with nothing to send, going to standby
2010-10-27 15:56:23.330586 b352eb70 -- 192.168.89.133:6800/1978 >>
192.168.89.131:0/1344076202 pipe(0x9fd0168 sd=12 pgs=0 cs=0
l=0).accept peer addr is really 192.168.89.131:0/1344076202 (socket is
192.168.89.131:41725/0)
2010-10-27 15:56:23.330744 b352eb70 -- 192.168.89.133:6800/1978 >>
192.168.89.131:0/1344076202 pipe(0x9fd0168 sd=12 pgs=0 cs=0
l=0).accept connect_seq 7 vs existing 7 state 3
2010-10-27 15:56:23.332473 b352eb70 failed to decode message of type
784 v4865: buffer::end_of_buffer
2010-10-27 15:56:23.332730 b352eb70 -- 192.168.89.133:6800/1978 >>
192.168.89.131:0/1344076202 pipe(0x9fd0168 sd=-1 pgs=10 cs=8
l=0).fault with nothing to send, going to standby
2010-10-27 15:56:52.022573 b332cb70 -- 192.168.89.133:6800/1978 >>
192.168.89.131:0/1344076202 pipe(0x9fd1720 sd=12 pgs=0 cs=0
l=0).accept peer addr is really 192.168.89.131:0/1344076202 (socket is
192.168.89.131:41726/0)
2010-10-27 15:56:52.022945 b332cb70 -- 192.168.89.133:6800/1978 >>
192.168.89.131:0/1344076202 pipe(0x9fd1720 sd=12 pgs=0 cs=0
l=0).accept connect_seq 8 vs existing 8 state 3
2010-10-27 15:56:52.025299 b332cb70 failed to decode message of type
784 v4865: buffer::end_of_buffer
2010-10-27 15:56:52.025569 b332cb70 -- 192.168.89.133:6800/1978 >>
192.168.89.131:0/1344076202 pipe(0x9fd1720 sd=-1 pgs=11 cs=9
l=0).fault with nothing to send, going to standby
2010-10-27 15:57:48.106337 b362fb70 -- 192.168.89.133:6800/1978 >>
192.168.89.131:0/1344076202 pipe(0x9fd07c0 sd=12 pgs=0 cs=0
l=0).accept peer addr is really 192.168.89.131:0/1344076202 (socket is
192.168.89.131:41727/0)
2010-10-27 15:57:48.106522 b362fb70 -- 192.168.89.133:6800/1978 >>
192.168.89.131:0/1344076202 pipe(0x9fd07c0 sd=12 pgs=0 cs=0
l=0).accept connect_seq 9 vs existing 9 state 3
2010-10-27 15:57:48.109498 b362fb70 failed to decode message of type
784 v4865: buffer::end_of_buffer
2010-10-27 15:57:48.109761 b362fb70 -- 192.168.89.133:6800/1978 >>
192.168.89.131:0/1344076202 pipe(0x9fd07c0 sd=-1 pgs=12 cs=10
l=0).fault with nothing to send, going to standby

On osd0
osd.0.log:
2010-10-27 15:55:01.470502 --- 1633 opened log /var/log/ceph/osd.0.log ---
ceph version 0.22.1 (commit:7464f9688001aa89f9673ba14e6d075d0ee33541)
2010-10-27 15:55:01.485828 b72d38e0 filestore(/data/osd0) mkfs in /data/osd0
2010-10-27 15:55:01.486106 b72d38e0 filestore(/data/osd0) mkfs
removing old file fsid
2010-10-27 15:55:01.516527 b72d38e0 filestore(/data/osd0) mkjournal
created journal on /journal
2010-10-27 15:55:01.516734 b72d38e0 filestore(/data/osd0) mkfs done in
/data/osd0
2010-10-27 15:55:01.519606 b72d38e0 filestore(/data/osd0) mount did
NOT detect btrfs
2010-10-27 15:55:01.519794 b72d38e0 filestore(/data/osd0) mount found snaps <>
2010-10-27 15:55:01.548307 b5acfb70 FileStore::op_tp worker finish
2010-10-27 15:55:01.548455 b52ceb70 FileStore::op_tp worker finish
2010-10-27 15:55:01.548812 b72d38e0 journal close /journal
2010-10-27 15:55:25.201288 --- 1656 opened log /var/log/ceph/osd.0.log ---
ceph version 0.22.1 (commit:7464f9688001aa89f9673ba14e6d075d0ee33541)
2010-10-27 15:55:25.223381 b74508e0 filestore(/data/osd0) mount did
NOT detect btrfs
2010-10-27 15:55:25.224253 b74508e0 filestore(/data/osd0) mount found snaps <>
2010-10-27 15:55:25.225396 b74508e0 journal read_entry 4096 : seq 1 203 bytes
2010-10-27 15:55:26.841749 abffeb70 -- 0.0.0.0:6801/1656 >>
192.168.89.135:6801/1675 pipe(0x94ced18 sd=13 pgs=0 cs=0 l=0).connect
claims to be 0.0.0.0:6801/1675 not 192.168.89.135:6801/1675 -
presumably this is the same node!
2010-10-27 15:56:04.219613 abaf9b70 -- 192.168.89.134:6800/1656 >>
192.168.89.131:0/1344076202 pipe(0x96463e0 sd=16 pgs=0 cs=0
l=0).accept peer addr is really 192.168.89.131:0/1344076202 (socket is
192.168.89.131:36746/0)

On osd1
osd.1.log:
2010-10-27 15:54:59.752615 --- 1652 opened log /var/log/ceph/osd.1.log ---
ceph version 0.22.1 (commit:7464f9688001aa89f9673ba14e6d075d0ee33541)
2010-10-27 15:54:59.766128 b73518e0 filestore(/data/osd1) mkfs in /data/osd1
2010-10-27 15:54:59.766658 b73518e0 filestore(/data/osd1) mkfs
removing old file fsid
2010-10-27 15:54:59.796938 b73518e0 filestore(/data/osd1) mkjournal
created journal on /journal
2010-10-27 15:54:59.797816 b73518e0 filestore(/data/osd1) mkfs done in
/data/osd1
2010-10-27 15:54:59.800957 b73518e0 filestore(/data/osd1) mount did
NOT detect btrfs
2010-10-27 15:54:59.801087 b73518e0 filestore(/data/osd1) mount found snaps <>
2010-10-27 15:54:59.832202 b534cb70 FileStore::op_tp worker finish
2010-10-27 15:54:59.832504 b5b4db70 FileStore::op_tp worker finish
2010-10-27 15:54:59.832723 b73518e0 journal close /journal
2010-10-27 15:55:23.050042 --- 1675 opened log /var/log/ceph/osd.1.log ---
ceph version 0.22.1 (commit:7464f9688001aa89f9673ba14e6d075d0ee33541)
2010-10-27 15:55:23.056671 b72e18e0 filestore(/data/osd1) mount did
NOT detect btrfs
2010-10-27 15:55:23.056921 b72e18e0 filestore(/data/osd1) mount found snaps <>
2010-10-27 15:55:23.057368 b72e18e0 journal read_entry 4096 : seq 1 203 bytes
2010-10-27 15:55:23.207216 b12d4b70 osd1 2 map says i am down or have
a different address.  switching to boot state.
2010-10-27 15:55:23.207540 b12d4b70 log [WRN] : map e2 wrongly marked me down

I noticed on osd1, the last two lines of osd.1.log are unusual, but
I'm not sure if they are the reason to the problem ...

Thank you very much,

On Wed, Oct 27, 2010 at 1:10 PM, Colin McCabe <cmccabe@xxxxxxxxxxxxxx> wrote:
> Hi Yonggang,
>
> Are all of the daemons still running? What is at the end of the logfiles?
>
> regards,
> Colin
>
>
> On Wed, Oct 27, 2010 at 9:42 AM, Yonggang Liu <myidpt@xxxxxxxxx> wrote:
>> Hello,
>>
>> I'm totally new to Ceph. Last a few days I set up 4 VMs to run Ceph:
>> "mds0" for the metadata server and monitor, "osd0" and "osd1" for two
>> data servers, and "client" for the client machine. The VMs are running
>> Debian 5.0 with kernel 2.6.32-5-686 (Ceph module is enabled).
>> I followed "Building kernel client" and "Debian" from the wiki, and I
>> was able to start Ceph and mount Ceph at the client. But the problem
>> is, the mounted point always fail with an infinite response time
>> (after I mount Ceph for about 1 min or less). To illustrate it better,
>> I will show you the information I got on the client and mds0 machines:
>>
>> mds0 (192.168.89.133):
>> debian:~# mkcephfs -c /etc/ceph/ceph.conf --allhosts -v
>> (A lot of info)
>> debian:~# /etc/init.d/ceph -a start
>> (some info)
>>
>> client (192.168.89.131):
>> debian:~# mount -t ceph 192.168.89.133:/ /ceph
>> debian:~# cd /ceph
>> debian:/ceph# cp ~/app_ch.xls .
>> debian:/ceph# ls
>> (waiting for ever)
>> ^C
>>
>> After the failure I ran dmesg at the client side and got:
>> client (192.168.89.131):
>> debian:/ceph# dmesg -c
>> [  636.664425] ceph: loaded (mon/mds/osd proto 15/32/24, osdmap 5/5 5/5)
>> [  636.694973] ceph: client4100 fsid 423ad64c-bbf0-3011-bb47-36a89f8787c6
>> [  636.700716] ceph: mon0 192.168.89.133:6789 session established
>> [  664.114551] ceph: mds0 192.168.89.133:6800 socket closed
>> [  664.848722] ceph: mds0 192.168.89.133:6800 socket closed
>> [  665.914923] ceph: mds0 192.168.89.133:6800 socket closed
>> [  667.840396] ceph: mds0 192.168.89.133:6800 socket closed
>> [  672.054106] ceph: mds0 192.168.89.133:6800 socket closed
>> [  680.894531] ceph: mds0 192.168.89.133:6800 socket closed
>> [  696.928496] ceph: mds0 192.168.89.133:6800 socket closed
>> [  720.171754] ceph: mds0 caps stale
>> [  728.999701] ceph: mds0 192.168.89.133:6800 socket closed
>> [  794.640943] ceph: mds0 192.168.89.133:6800 socket closed
>>
>> Immediately after the failure, I ran netstat at mds0:
>> mds0 (192.168.89.133):
>> debian:~# netstat -anp
>> Active Internet connections (servers and established)
>> Proto Recv-Q Send-Q Local Address           Foreign Address
>> State       PID/Program name
>> tcp        0      0 0.0.0.0:6800            0.0.0.0:*
>> LISTEN      1889/cmds
>> tcp        0      0 0.0.0.0:22              0.0.0.0:*
>> LISTEN      1529/sshd
>> tcp        0      0 192.168.89.133:6789     0.0.0.0:*
>> LISTEN      1840/cmon
>> tcp        0      0 192.168.89.133:6789     192.168.89.131:56855
>> ESTABLISHED 1840/cmon
>> tcp        0      0 192.168.89.133:43647    192.168.89.133:6789
>> ESTABLISHED 1889/cmds
>> tcp        0      0 192.168.89.133:22       192.168.89.1:58304
>> ESTABLISHED 1530/0
>> tcp        0      0 192.168.89.133:39826    192.168.89.134:6800
>> ESTABLISHED 1889/cmds
>> tcp        0      0 192.168.89.133:6789     192.168.89.134:41289
>> ESTABLISHED 1840/cmon
>> tcp        0      0 192.168.89.133:6800     192.168.89.131:52814
>> TIME_WAIT   -
>> tcp        0      0 192.168.89.133:6789     192.168.89.135:41021
>> ESTABLISHED 1840/cmon
>> tcp        0      0 192.168.89.133:42069    192.168.89.135:6800
>> ESTABLISHED 1889/cmds
>> tcp        0      0 192.168.89.133:6789     192.168.89.133:43647
>> ESTABLISHED 1840/cmon
>> tcp        0      0 192.168.89.133:6800     192.168.89.131:52815
>> TIME_WAIT   -
>> tcp        0      0 192.168.89.133:6800     192.168.89.131:52816
>> TIME_WAIT   -
>> tcp6       0      0 :::22                   :::*
>> LISTEN      1529/sshd
>> udp        0      0 0.0.0.0:68              0.0.0.0:*
>>         1490/dhclient3
>> Active UNIX domain sockets (servers and established)
>> Proto RefCnt Flags       Type       State         I-Node   PID/Program
>> name    Path
>> unix  2      [ ]         DGRAM                    2972     546/udevd
>>        @/org/kernel/udev/udevd
>> unix  4      [ ]         DGRAM                    5343
>> 1358/rsyslogd       /dev/log
>> unix  2      [ ]         DGRAM                    5662     1530/0
>> unix  2      [ ]         DGRAM                    5486     1490/dhclient3
>> debian:~#
>> debian:~# dmesg -c
>> debian:~# (nothing shows up)
>>
>> I saw the port 6800 on the metadata server talking with the client is
>> on "TIME_WAIT" stage. That means the connection is closed.
>> This is the ceph.conf I have:
>> [global]
>>       pid file = /var/run/ceph/$type.$id.pid
>> [mon]
>>       mon data = /data/mon$id
>>       mon subscribe interval = 6000
>>       mon osd down out interval = 6000
>> [mon0]
>>       host = mds0
>>       mon addr = 192.168.89.133:6789
>> [mds]
>>       mds session timeout = 6000
>>       mds session autoclose = 6000
>>       mds client lease = 6000
>>       keyring = /data/keyring.$name
>> [mds0]
>>       host = mds0
>> [osd]
>>       sudo = true
>>       osd data = /data/osd$id
>>       osd journal = /journal
>>       osd journal size = 1024
>>       filestore journal writeahead = true
>> [osd0]
>>       host = osd0
>> [osd1]
>>       host = osd1
>> [group everyone]
>>       addr = 0.0.0.0/0
>> [mount]
>>       allow = %everyone
>> ;-----------------------------------end-----------------------------------
>>
>> The Ceph version I was using is 0.22.1.
>>
>> Can anyone help me to solve this problem? Thanks in advance!
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

-- 
Yonggang Liu
Advanced Computing and Information Systems Laboratory
University of Florida
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html