Hi Colin, Thank you for your reply. I ran "ps uax | grep ceph" on 4 machines to check if the daemons are running. I got: On client: root 1721 0.0 0.0 0 0 ? S 17:53 0:00 [ceph-msgr/0] root 1725 0.0 0.0 0 0 ? S 17:53 0:00 [ceph-writeback/] root 1726 0.0 0.0 0 0 ? S 17:53 0:00 [ceph-pg-invalid] root 1727 0.0 0.0 0 0 ? S 17:53 0:00 [ceph-trunc] On mds0: root 1930 0.5 1.2 48468 3280 ? S<sl 15:55 0:01 /usr/bin/cmon -i 0 -c /tmp/ceph.conf.1869 root 1979 4.7 1.4 76340 3740 ? S<sl 15:55 0:13 /usr/bin/cmds -i 0 -c /tmp/ceph.conf.1869 On osd0: root 1657 0.4 7.4 214316 18988 ? S<sl 15:55 0:01 /usr/bin/cosd -i 0 -c /tmp/ceph.conf.1869 On osd1: root 1676 0.7 10.6 220624 27180 ? S<sl 15:55 0:02 /usr/bin/cosd -i 1 -c /tmp/ceph.conf.1869 These are the last few lines of log files I got from /var/log/ceph/: On mds0 mds.0.log: 2010-10-27 15:56:09.238267 b362fb70 failed to decode message of type 784 v4865: buffer::end_of_buffer 2010-10-27 15:56:09.238960 b362fb70 -- 192.168.89.133:6800/1978 >> 192.168.89.131:0/1344076202 pipe(0x9fceb20 sd=-1 pgs=9 cs=7 l=0).fault with nothing to send, going to standby 2010-10-27 15:56:23.330586 b352eb70 -- 192.168.89.133:6800/1978 >> 192.168.89.131:0/1344076202 pipe(0x9fd0168 sd=12 pgs=0 cs=0 l=0).accept peer addr is really 192.168.89.131:0/1344076202 (socket is 192.168.89.131:41725/0) 2010-10-27 15:56:23.330744 b352eb70 -- 192.168.89.133:6800/1978 >> 192.168.89.131:0/1344076202 pipe(0x9fd0168 sd=12 pgs=0 cs=0 l=0).accept connect_seq 7 vs existing 7 state 3 2010-10-27 15:56:23.332473 b352eb70 failed to decode message of type 784 v4865: buffer::end_of_buffer 2010-10-27 15:56:23.332730 b352eb70 -- 192.168.89.133:6800/1978 >> 192.168.89.131:0/1344076202 pipe(0x9fd0168 sd=-1 pgs=10 cs=8 l=0).fault with nothing to send, going to standby 2010-10-27 15:56:52.022573 b332cb70 -- 192.168.89.133:6800/1978 >> 192.168.89.131:0/1344076202 pipe(0x9fd1720 sd=12 pgs=0 cs=0 l=0).accept peer addr is really 192.168.89.131:0/1344076202 (socket is 192.168.89.131:41726/0) 2010-10-27 15:56:52.022945 b332cb70 -- 192.168.89.133:6800/1978 >> 192.168.89.131:0/1344076202 pipe(0x9fd1720 sd=12 pgs=0 cs=0 l=0).accept connect_seq 8 vs existing 8 state 3 2010-10-27 15:56:52.025299 b332cb70 failed to decode message of type 784 v4865: buffer::end_of_buffer 2010-10-27 15:56:52.025569 b332cb70 -- 192.168.89.133:6800/1978 >> 192.168.89.131:0/1344076202 pipe(0x9fd1720 sd=-1 pgs=11 cs=9 l=0).fault with nothing to send, going to standby 2010-10-27 15:57:48.106337 b362fb70 -- 192.168.89.133:6800/1978 >> 192.168.89.131:0/1344076202 pipe(0x9fd07c0 sd=12 pgs=0 cs=0 l=0).accept peer addr is really 192.168.89.131:0/1344076202 (socket is 192.168.89.131:41727/0) 2010-10-27 15:57:48.106522 b362fb70 -- 192.168.89.133:6800/1978 >> 192.168.89.131:0/1344076202 pipe(0x9fd07c0 sd=12 pgs=0 cs=0 l=0).accept connect_seq 9 vs existing 9 state 3 2010-10-27 15:57:48.109498 b362fb70 failed to decode message of type 784 v4865: buffer::end_of_buffer 2010-10-27 15:57:48.109761 b362fb70 -- 192.168.89.133:6800/1978 >> 192.168.89.131:0/1344076202 pipe(0x9fd07c0 sd=-1 pgs=12 cs=10 l=0).fault with nothing to send, going to standby On osd0 osd.0.log: 2010-10-27 15:55:01.470502 --- 1633 opened log /var/log/ceph/osd.0.log --- ceph version 0.22.1 (commit:7464f9688001aa89f9673ba14e6d075d0ee33541) 2010-10-27 15:55:01.485828 b72d38e0 filestore(/data/osd0) mkfs in /data/osd0 2010-10-27 15:55:01.486106 b72d38e0 filestore(/data/osd0) mkfs removing old file fsid 2010-10-27 15:55:01.516527 b72d38e0 filestore(/data/osd0) mkjournal created journal on /journal 2010-10-27 15:55:01.516734 b72d38e0 filestore(/data/osd0) mkfs done in /data/osd0 2010-10-27 15:55:01.519606 b72d38e0 filestore(/data/osd0) mount did NOT detect btrfs 2010-10-27 15:55:01.519794 b72d38e0 filestore(/data/osd0) mount found snaps <> 2010-10-27 15:55:01.548307 b5acfb70 FileStore::op_tp worker finish 2010-10-27 15:55:01.548455 b52ceb70 FileStore::op_tp worker finish 2010-10-27 15:55:01.548812 b72d38e0 journal close /journal 2010-10-27 15:55:25.201288 --- 1656 opened log /var/log/ceph/osd.0.log --- ceph version 0.22.1 (commit:7464f9688001aa89f9673ba14e6d075d0ee33541) 2010-10-27 15:55:25.223381 b74508e0 filestore(/data/osd0) mount did NOT detect btrfs 2010-10-27 15:55:25.224253 b74508e0 filestore(/data/osd0) mount found snaps <> 2010-10-27 15:55:25.225396 b74508e0 journal read_entry 4096 : seq 1 203 bytes 2010-10-27 15:55:26.841749 abffeb70 -- 0.0.0.0:6801/1656 >> 192.168.89.135:6801/1675 pipe(0x94ced18 sd=13 pgs=0 cs=0 l=0).connect claims to be 0.0.0.0:6801/1675 not 192.168.89.135:6801/1675 - presumably this is the same node! 2010-10-27 15:56:04.219613 abaf9b70 -- 192.168.89.134:6800/1656 >> 192.168.89.131:0/1344076202 pipe(0x96463e0 sd=16 pgs=0 cs=0 l=0).accept peer addr is really 192.168.89.131:0/1344076202 (socket is 192.168.89.131:36746/0) On osd1 osd.1.log: 2010-10-27 15:54:59.752615 --- 1652 opened log /var/log/ceph/osd.1.log --- ceph version 0.22.1 (commit:7464f9688001aa89f9673ba14e6d075d0ee33541) 2010-10-27 15:54:59.766128 b73518e0 filestore(/data/osd1) mkfs in /data/osd1 2010-10-27 15:54:59.766658 b73518e0 filestore(/data/osd1) mkfs removing old file fsid 2010-10-27 15:54:59.796938 b73518e0 filestore(/data/osd1) mkjournal created journal on /journal 2010-10-27 15:54:59.797816 b73518e0 filestore(/data/osd1) mkfs done in /data/osd1 2010-10-27 15:54:59.800957 b73518e0 filestore(/data/osd1) mount did NOT detect btrfs 2010-10-27 15:54:59.801087 b73518e0 filestore(/data/osd1) mount found snaps <> 2010-10-27 15:54:59.832202 b534cb70 FileStore::op_tp worker finish 2010-10-27 15:54:59.832504 b5b4db70 FileStore::op_tp worker finish 2010-10-27 15:54:59.832723 b73518e0 journal close /journal 2010-10-27 15:55:23.050042 --- 1675 opened log /var/log/ceph/osd.1.log --- ceph version 0.22.1 (commit:7464f9688001aa89f9673ba14e6d075d0ee33541) 2010-10-27 15:55:23.056671 b72e18e0 filestore(/data/osd1) mount did NOT detect btrfs 2010-10-27 15:55:23.056921 b72e18e0 filestore(/data/osd1) mount found snaps <> 2010-10-27 15:55:23.057368 b72e18e0 journal read_entry 4096 : seq 1 203 bytes 2010-10-27 15:55:23.207216 b12d4b70 osd1 2 map says i am down or have a different address. switching to boot state. 2010-10-27 15:55:23.207540 b12d4b70 log [WRN] : map e2 wrongly marked me down I noticed on osd1, the last two lines of osd.1.log are unusual, but I'm not sure if they are the reason to the problem ... Thank you very much, On Wed, Oct 27, 2010 at 1:10 PM, Colin McCabe <cmccabe@xxxxxxxxxxxxxx> wrote: > Hi Yonggang, > > Are all of the daemons still running? What is at the end of the logfiles? > > regards, > Colin > > > On Wed, Oct 27, 2010 at 9:42 AM, Yonggang Liu <myidpt@xxxxxxxxx> wrote: >> Hello, >> >> I'm totally new to Ceph. Last a few days I set up 4 VMs to run Ceph: >> "mds0" for the metadata server and monitor, "osd0" and "osd1" for two >> data servers, and "client" for the client machine. The VMs are running >> Debian 5.0 with kernel 2.6.32-5-686 (Ceph module is enabled). >> I followed "Building kernel client" and "Debian" from the wiki, and I >> was able to start Ceph and mount Ceph at the client. But the problem >> is, the mounted point always fail with an infinite response time >> (after I mount Ceph for about 1 min or less). To illustrate it better, >> I will show you the information I got on the client and mds0 machines: >> >> mds0 (192.168.89.133): >> debian:~# mkcephfs -c /etc/ceph/ceph.conf --allhosts -v >> (A lot of info) >> debian:~# /etc/init.d/ceph -a start >> (some info) >> >> client (192.168.89.131): >> debian:~# mount -t ceph 192.168.89.133:/ /ceph >> debian:~# cd /ceph >> debian:/ceph# cp ~/app_ch.xls . >> debian:/ceph# ls >> (waiting for ever) >> ^C >> >> After the failure I ran dmesg at the client side and got: >> client (192.168.89.131): >> debian:/ceph# dmesg -c >> [ 636.664425] ceph: loaded (mon/mds/osd proto 15/32/24, osdmap 5/5 5/5) >> [ 636.694973] ceph: client4100 fsid 423ad64c-bbf0-3011-bb47-36a89f8787c6 >> [ 636.700716] ceph: mon0 192.168.89.133:6789 session established >> [ 664.114551] ceph: mds0 192.168.89.133:6800 socket closed >> [ 664.848722] ceph: mds0 192.168.89.133:6800 socket closed >> [ 665.914923] ceph: mds0 192.168.89.133:6800 socket closed >> [ 667.840396] ceph: mds0 192.168.89.133:6800 socket closed >> [ 672.054106] ceph: mds0 192.168.89.133:6800 socket closed >> [ 680.894531] ceph: mds0 192.168.89.133:6800 socket closed >> [ 696.928496] ceph: mds0 192.168.89.133:6800 socket closed >> [ 720.171754] ceph: mds0 caps stale >> [ 728.999701] ceph: mds0 192.168.89.133:6800 socket closed >> [ 794.640943] ceph: mds0 192.168.89.133:6800 socket closed >> >> Immediately after the failure, I ran netstat at mds0: >> mds0 (192.168.89.133): >> debian:~# netstat -anp >> Active Internet connections (servers and established) >> Proto Recv-Q Send-Q Local Address Foreign Address >> State PID/Program name >> tcp 0 0 0.0.0.0:6800 0.0.0.0:* >> LISTEN 1889/cmds >> tcp 0 0 0.0.0.0:22 0.0.0.0:* >> LISTEN 1529/sshd >> tcp 0 0 192.168.89.133:6789 0.0.0.0:* >> LISTEN 1840/cmon >> tcp 0 0 192.168.89.133:6789 192.168.89.131:56855 >> ESTABLISHED 1840/cmon >> tcp 0 0 192.168.89.133:43647 192.168.89.133:6789 >> ESTABLISHED 1889/cmds >> tcp 0 0 192.168.89.133:22 192.168.89.1:58304 >> ESTABLISHED 1530/0 >> tcp 0 0 192.168.89.133:39826 192.168.89.134:6800 >> ESTABLISHED 1889/cmds >> tcp 0 0 192.168.89.133:6789 192.168.89.134:41289 >> ESTABLISHED 1840/cmon >> tcp 0 0 192.168.89.133:6800 192.168.89.131:52814 >> TIME_WAIT - >> tcp 0 0 192.168.89.133:6789 192.168.89.135:41021 >> ESTABLISHED 1840/cmon >> tcp 0 0 192.168.89.133:42069 192.168.89.135:6800 >> ESTABLISHED 1889/cmds >> tcp 0 0 192.168.89.133:6789 192.168.89.133:43647 >> ESTABLISHED 1840/cmon >> tcp 0 0 192.168.89.133:6800 192.168.89.131:52815 >> TIME_WAIT - >> tcp 0 0 192.168.89.133:6800 192.168.89.131:52816 >> TIME_WAIT - >> tcp6 0 0 :::22 :::* >> LISTEN 1529/sshd >> udp 0 0 0.0.0.0:68 0.0.0.0:* >> 1490/dhclient3 >> Active UNIX domain sockets (servers and established) >> Proto RefCnt Flags Type State I-Node PID/Program >> name Path >> unix 2 [ ] DGRAM 2972 546/udevd >> @/org/kernel/udev/udevd >> unix 4 [ ] DGRAM 5343 >> 1358/rsyslogd /dev/log >> unix 2 [ ] DGRAM 5662 1530/0 >> unix 2 [ ] DGRAM 5486 1490/dhclient3 >> debian:~# >> debian:~# dmesg -c >> debian:~# (nothing shows up) >> >> I saw the port 6800 on the metadata server talking with the client is >> on "TIME_WAIT" stage. That means the connection is closed. >> This is the ceph.conf I have: >> [global] >> pid file = /var/run/ceph/$type.$id.pid >> [mon] >> mon data = /data/mon$id >> mon subscribe interval = 6000 >> mon osd down out interval = 6000 >> [mon0] >> host = mds0 >> mon addr = 192.168.89.133:6789 >> [mds] >> mds session timeout = 6000 >> mds session autoclose = 6000 >> mds client lease = 6000 >> keyring = /data/keyring.$name >> [mds0] >> host = mds0 >> [osd] >> sudo = true >> osd data = /data/osd$id >> osd journal = /journal >> osd journal size = 1024 >> filestore journal writeahead = true >> [osd0] >> host = osd0 >> [osd1] >> host = osd1 >> [group everyone] >> addr = 0.0.0.0/0 >> [mount] >> allow = %everyone >> ;-----------------------------------end----------------------------------- >> >> The Ceph version I was using is 0.22.1. >> >> Can anyone help me to solve this problem? Thanks in advance! >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- Yonggang Liu Advanced Computing and Information Systems Laboratory University of Florida -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html