Hi, On Wed, 27 Oct 2010, Yonggang Liu wrote: > 2010-10-27 15:56:23.332473 b352eb70 failed to decode message of type > 784 v4865: buffer::end_of_buffer This is the problem. Can you tell what 192.168.89.133:6800/1978 is? (That 1978 is a pid, btw.) It's probably a 'cfuse' I'm guessing? (The ceph daemons all start with 'c', but don't have 'ceph' in the name.) 784 is a client_caps message. The 'v4865' looks like an uninitialized variable to me, as it should be a small integer. Whatever daemon that is, can you run it with the '-v' argument to see exactly which version it is? Thanks! sage > 2010-10-27 15:56:23.332730 b352eb70 -- 192.168.89.133:6800/1978 >> > 192.168.89.131:0/1344076202 pipe(0x9fd0168 sd=-1 pgs=10 cs=8 > l=0).fault with nothing to send, going to standby > 2010-10-27 15:56:52.022573 b332cb70 -- 192.168.89.133:6800/1978 >> > 192.168.89.131:0/1344076202 pipe(0x9fd1720 sd=12 pgs=0 cs=0 > l=0).accept peer addr is really 192.168.89.131:0/1344076202 (socket is > 192.168.89.131:41726/0) > 2010-10-27 15:56:52.022945 b332cb70 -- 192.168.89.133:6800/1978 >> > 192.168.89.131:0/1344076202 pipe(0x9fd1720 sd=12 pgs=0 cs=0 > l=0).accept connect_seq 8 vs existing 8 state 3 > 2010-10-27 15:56:52.025299 b332cb70 failed to decode message of type > 784 v4865: buffer::end_of_buffer > 2010-10-27 15:56:52.025569 b332cb70 -- 192.168.89.133:6800/1978 >> > 192.168.89.131:0/1344076202 pipe(0x9fd1720 sd=-1 pgs=11 cs=9 > l=0).fault with nothing to send, going to standby > 2010-10-27 15:57:48.106337 b362fb70 -- 192.168.89.133:6800/1978 >> > 192.168.89.131:0/1344076202 pipe(0x9fd07c0 sd=12 pgs=0 cs=0 > l=0).accept peer addr is really 192.168.89.131:0/1344076202 (socket is > 192.168.89.131:41727/0) > 2010-10-27 15:57:48.106522 b362fb70 -- 192.168.89.133:6800/1978 >> > 192.168.89.131:0/1344076202 pipe(0x9fd07c0 sd=12 pgs=0 cs=0 > l=0).accept connect_seq 9 vs existing 9 state 3 > 2010-10-27 15:57:48.109498 b362fb70 failed to decode message of type > 784 v4865: buffer::end_of_buffer > 2010-10-27 15:57:48.109761 b362fb70 -- 192.168.89.133:6800/1978 >> > 192.168.89.131:0/1344076202 pipe(0x9fd07c0 sd=-1 pgs=12 cs=10 > l=0).fault with nothing to send, going to standby > > On osd0 > osd.0.log: > 2010-10-27 15:55:01.470502 --- 1633 opened log /var/log/ceph/osd.0.log --- > ceph version 0.22.1 (commit:7464f9688001aa89f9673ba14e6d075d0ee33541) > 2010-10-27 15:55:01.485828 b72d38e0 filestore(/data/osd0) mkfs in /data/osd0 > 2010-10-27 15:55:01.486106 b72d38e0 filestore(/data/osd0) mkfs > removing old file fsid > 2010-10-27 15:55:01.516527 b72d38e0 filestore(/data/osd0) mkjournal > created journal on /journal > 2010-10-27 15:55:01.516734 b72d38e0 filestore(/data/osd0) mkfs done in > /data/osd0 > 2010-10-27 15:55:01.519606 b72d38e0 filestore(/data/osd0) mount did > NOT detect btrfs > 2010-10-27 15:55:01.519794 b72d38e0 filestore(/data/osd0) mount found snaps <> > 2010-10-27 15:55:01.548307 b5acfb70 FileStore::op_tp worker finish > 2010-10-27 15:55:01.548455 b52ceb70 FileStore::op_tp worker finish > 2010-10-27 15:55:01.548812 b72d38e0 journal close /journal > 2010-10-27 15:55:25.201288 --- 1656 opened log /var/log/ceph/osd.0.log --- > ceph version 0.22.1 (commit:7464f9688001aa89f9673ba14e6d075d0ee33541) > 2010-10-27 15:55:25.223381 b74508e0 filestore(/data/osd0) mount did > NOT detect btrfs > 2010-10-27 15:55:25.224253 b74508e0 filestore(/data/osd0) mount found snaps <> > 2010-10-27 15:55:25.225396 b74508e0 journal read_entry 4096 : seq 1 203 bytes > 2010-10-27 15:55:26.841749 abffeb70 -- 0.0.0.0:6801/1656 >> > 192.168.89.135:6801/1675 pipe(0x94ced18 sd=13 pgs=0 cs=0 l=0).connect > claims to be 0.0.0.0:6801/1675 not 192.168.89.135:6801/1675 - > presumably this is the same node! > 2010-10-27 15:56:04.219613 abaf9b70 -- 192.168.89.134:6800/1656 >> > 192.168.89.131:0/1344076202 pipe(0x96463e0 sd=16 pgs=0 cs=0 > l=0).accept peer addr is really 192.168.89.131:0/1344076202 (socket is > 192.168.89.131:36746/0) > > On osd1 > osd.1.log: > 2010-10-27 15:54:59.752615 --- 1652 opened log /var/log/ceph/osd.1.log --- > ceph version 0.22.1 (commit:7464f9688001aa89f9673ba14e6d075d0ee33541) > 2010-10-27 15:54:59.766128 b73518e0 filestore(/data/osd1) mkfs in /data/osd1 > 2010-10-27 15:54:59.766658 b73518e0 filestore(/data/osd1) mkfs > removing old file fsid > 2010-10-27 15:54:59.796938 b73518e0 filestore(/data/osd1) mkjournal > created journal on /journal > 2010-10-27 15:54:59.797816 b73518e0 filestore(/data/osd1) mkfs done in > /data/osd1 > 2010-10-27 15:54:59.800957 b73518e0 filestore(/data/osd1) mount did > NOT detect btrfs > 2010-10-27 15:54:59.801087 b73518e0 filestore(/data/osd1) mount found snaps <> > 2010-10-27 15:54:59.832202 b534cb70 FileStore::op_tp worker finish > 2010-10-27 15:54:59.832504 b5b4db70 FileStore::op_tp worker finish > 2010-10-27 15:54:59.832723 b73518e0 journal close /journal > 2010-10-27 15:55:23.050042 --- 1675 opened log /var/log/ceph/osd.1.log --- > ceph version 0.22.1 (commit:7464f9688001aa89f9673ba14e6d075d0ee33541) > 2010-10-27 15:55:23.056671 b72e18e0 filestore(/data/osd1) mount did > NOT detect btrfs > 2010-10-27 15:55:23.056921 b72e18e0 filestore(/data/osd1) mount found snaps <> > 2010-10-27 15:55:23.057368 b72e18e0 journal read_entry 4096 : seq 1 203 bytes > 2010-10-27 15:55:23.207216 b12d4b70 osd1 2 map says i am down or have > a different address. switching to boot state. > 2010-10-27 15:55:23.207540 b12d4b70 log [WRN] : map e2 wrongly marked me down > > I noticed on osd1, the last two lines of osd.1.log are unusual, but > I'm not sure if they are the reason to the problem ... > > Thank you very much, > > > On Wed, Oct 27, 2010 at 1:10 PM, Colin McCabe <cmccabe@xxxxxxxxxxxxxx> wrote: > > Hi Yonggang, > > > > Are all of the daemons still running? What is at the end of the logfiles? > > > > regards, > > Colin > > > > > > On Wed, Oct 27, 2010 at 9:42 AM, Yonggang Liu <myidpt@xxxxxxxxx> wrote: > >> Hello, > >> > >> I'm totally new to Ceph. Last a few days I set up 4 VMs to run Ceph: > >> "mds0" for the metadata server and monitor, "osd0" and "osd1" for two > >> data servers, and "client" for the client machine. The VMs are running > >> Debian 5.0 with kernel 2.6.32-5-686 (Ceph module is enabled). > >> I followed "Building kernel client" and "Debian" from the wiki, and I > >> was able to start Ceph and mount Ceph at the client. But the problem > >> is, the mounted point always fail with an infinite response time > >> (after I mount Ceph for about 1 min or less). To illustrate it better, > >> I will show you the information I got on the client and mds0 machines: > >> > >> mds0 (192.168.89.133): > >> debian:~# mkcephfs -c /etc/ceph/ceph.conf --allhosts -v > >> (A lot of info) > >> debian:~# /etc/init.d/ceph -a start > >> (some info) > >> > >> client (192.168.89.131): > >> debian:~# mount -t ceph 192.168.89.133:/ /ceph > >> debian:~# cd /ceph > >> debian:/ceph# cp ~/app_ch.xls . > >> debian:/ceph# ls > >> (waiting for ever) > >> ^C > >> > >> After the failure I ran dmesg at the client side and got: > >> client (192.168.89.131): > >> debian:/ceph# dmesg -c > >> [ 636.664425] ceph: loaded (mon/mds/osd proto 15/32/24, osdmap 5/5 5/5) > >> [ 636.694973] ceph: client4100 fsid 423ad64c-bbf0-3011-bb47-36a89f8787c6 > >> [ 636.700716] ceph: mon0 192.168.89.133:6789 session established > >> [ 664.114551] ceph: mds0 192.168.89.133:6800 socket closed > >> [ 664.848722] ceph: mds0 192.168.89.133:6800 socket closed > >> [ 665.914923] ceph: mds0 192.168.89.133:6800 socket closed > >> [ 667.840396] ceph: mds0 192.168.89.133:6800 socket closed > >> [ 672.054106] ceph: mds0 192.168.89.133:6800 socket closed > >> [ 680.894531] ceph: mds0 192.168.89.133:6800 socket closed > >> [ 696.928496] ceph: mds0 192.168.89.133:6800 socket closed > >> [ 720.171754] ceph: mds0 caps stale > >> [ 728.999701] ceph: mds0 192.168.89.133:6800 socket closed > >> [ 794.640943] ceph: mds0 192.168.89.133:6800 socket closed > >> > >> Immediately after the failure, I ran netstat at mds0: > >> mds0 (192.168.89.133): > >> debian:~# netstat -anp > >> Active Internet connections (servers and established) > >> Proto Recv-Q Send-Q Local Address Foreign Address > >> State PID/Program name > >> tcp 0 0 0.0.0.0:6800 0.0.0.0:* > >> LISTEN 1889/cmds > >> tcp 0 0 0.0.0.0:22 0.0.0.0:* > >> LISTEN 1529/sshd > >> tcp 0 0 192.168.89.133:6789 0.0.0.0:* > >> LISTEN 1840/cmon > >> tcp 0 0 192.168.89.133:6789 192.168.89.131:56855 > >> ESTABLISHED 1840/cmon > >> tcp 0 0 192.168.89.133:43647 192.168.89.133:6789 > >> ESTABLISHED 1889/cmds > >> tcp 0 0 192.168.89.133:22 192.168.89.1:58304 > >> ESTABLISHED 1530/0 > >> tcp 0 0 192.168.89.133:39826 192.168.89.134:6800 > >> ESTABLISHED 1889/cmds > >> tcp 0 0 192.168.89.133:6789 192.168.89.134:41289 > >> ESTABLISHED 1840/cmon > >> tcp 0 0 192.168.89.133:6800 192.168.89.131:52814 > >> TIME_WAIT - > >> tcp 0 0 192.168.89.133:6789 192.168.89.135:41021 > >> ESTABLISHED 1840/cmon > >> tcp 0 0 192.168.89.133:42069 192.168.89.135:6800 > >> ESTABLISHED 1889/cmds > >> tcp 0 0 192.168.89.133:6789 192.168.89.133:43647 > >> ESTABLISHED 1840/cmon > >> tcp 0 0 192.168.89.133:6800 192.168.89.131:52815 > >> TIME_WAIT - > >> tcp 0 0 192.168.89.133:6800 192.168.89.131:52816 > >> TIME_WAIT - > >> tcp6 0 0 :::22 :::* > >> LISTEN 1529/sshd > >> udp 0 0 0.0.0.0:68 0.0.0.0:* > >> 1490/dhclient3 > >> Active UNIX domain sockets (servers and established) > >> Proto RefCnt Flags Type State I-Node PID/Program > >> name Path > >> unix 2 [ ] DGRAM 2972 546/udevd > >> @/org/kernel/udev/udevd > >> unix 4 [ ] DGRAM 5343 > >> 1358/rsyslogd /dev/log > >> unix 2 [ ] DGRAM 5662 1530/0 > >> unix 2 [ ] DGRAM 5486 1490/dhclient3 > >> debian:~# > >> debian:~# dmesg -c > >> debian:~# (nothing shows up) > >> > >> I saw the port 6800 on the metadata server talking with the client is > >> on "TIME_WAIT" stage. That means the connection is closed. > >> This is the ceph.conf I have: > >> [global] > >> pid file = /var/run/ceph/$type.$id.pid > >> [mon] > >> mon data = /data/mon$id > >> mon subscribe interval = 6000 > >> mon osd down out interval = 6000 > >> [mon0] > >> host = mds0 > >> mon addr = 192.168.89.133:6789 > >> [mds] > >> mds session timeout = 6000 > >> mds session autoclose = 6000 > >> mds client lease = 6000 > >> keyring = /data/keyring.$name > >> [mds0] > >> host = mds0 > >> [osd] > >> sudo = true > >> osd data = /data/osd$id > >> osd journal = /journal > >> osd journal size = 1024 > >> filestore journal writeahead = true > >> [osd0] > >> host = osd0 > >> [osd1] > >> host = osd1 > >> [group everyone] > >> addr = 0.0.0.0/0 > >> [mount] > >> allow = %everyone > >> ;-----------------------------------end----------------------------------- > >> > >> The Ceph version I was using is 0.22.1. > >> > >> Can anyone help me to solve this problem? Thanks in advance! > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >> the body of a message to majordomo@xxxxxxxxxxxxxxx > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> > > > > > > -- > Yonggang Liu > Advanced Computing and Information Systems Laboratory > University of Florida > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > >