Hi Wido, Thanks for your answer and your kind help. I tried to give you all useful information but maybe something is missing. Let me know if you want me to do more tests. Please find the output of ceph -s below: [root@node91 ~]# ceph -s 2011-09-26 22:48:08.048659 pg v297: 792 pgs: 792 active+clean; 24 KB data, 80512 KB used, 339 GB / 340 GB avail 2011-09-26 22:48:08.049742 mds e5: 1/1/1 up {0=alpha=up:active}, 1 up:standby 2011-09-26 22:48:08.049764 osd e5: 4 osds: 4 up, 4 in 2011-09-26 22:48:08.049800 log 2011-09-26 19:38:14.372125 osd3 138.96.126.95:6800/2973 242 : [INF] 2.1p3 scrub ok 2011-09-26 22:48:08.049847 mon e1: 3 mons at {alpha=138.96.126.91:6789/0,beta=138.96.126.92:6789/0,gamma=138.96.126.93:6789/0} The same command ten minutes after the cfuse hangs on the client node : [root@node91 ~]# ceph -s 2011-09-26 23:07:49.403774 pg v335: 792 pgs: 101 active, 276 active+clean, 415 active+clean+degraded; 4806 KB data, 114 MB used, 339 GB / 340 GB avail; 24/56 degraded (42.857%) 2011-09-26 23:07:49.404847 mds e5: 1/1/1 up {0=alpha=up:active}, 1 up:standby 2011-09-26 23:07:49.404867 osd e13: 4 osds: 2 up, 4 in 2011-09-26 23:07:49.404929 log 2011-09-26 23:07:46.093670 mds0 138.96.126.91:6800/4682 2 : [INF] closing stale session client4124 138.96.126.91:0/5563 after 455.778957 2011-09-26 23:07:49.404966 mon e1: 3 mons at {alpha=138.96.126.91:6789/0,beta=138.96.126.92:6789/0,gamma=138.96.126.93:6789/0} [root@node91 ~]# /etc/init.d/ceph -a status === mon.alpha === running... === mon.beta === running... === mon.gamma === running... === mds.alpha === running... === mds.beta === running... === osd.0 === dead. === osd.1 === running... === osd.2 === running... === osd.3 === dead. I finally paste the last lines of osd.0 : 2011-09-26 22:57:06.822182 7faf6a6f8700 -- 138.96.126.92:6802/3157 >> 138.96.126.93:6801/3162 pipe(0x7faf50001320 sd=20 pgs=0 cs=0 l=0).accept connect_seq 2 vs existing 1 state 3 2011-09-26 23:07:09.084901 7faf8e1b5700 FileStore: sync_entry timed out after 600 seconds. ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6) 2011-09-26 23:07:09.084934 1: (SafeTimer::timer_thread()+0x323) [0x5c95a3] 2011-09-26 23:07:09.084943 2: (SafeTimerThread::entry()+0xd) [0x5cbc7d] 2011-09-26 23:07:09.084950 3: /lib64/libpthread.so.0() [0x31fec077e1] 2011-09-26 23:07:09.084957 4: (clone()+0x6d) [0x31fe4e18ed] 2011-09-26 23:07:09.084963 *** Caught signal (Aborted) ** in thread 0x7faf8e1b5700 ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6) 1: /usr/bin/cosd() [0x649ca9] 2: /lib64/libpthread.so.0() [0x31fec0f4c0] 3: (gsignal()+0x35) [0x31fe4329a5] 4: (abort()+0x175) [0x31fe434185] 5: (__assert_fail()+0xf5) [0x31fe42b935] 6: (SyncEntryTimeout::finish(int)+0x130) [0x683400] 7: (SafeTimer::timer_thread()+0x323) [0x5c95a3] 8: (SafeTimerThread::entry()+0xd) [0x5cbc7d] 9: /lib64/libpthread.so.0() [0x31fec077e1] 10: (clone()+0x6d) [0x31fe4e18ed] ceph.conf: [global] max open files = 131072 log file = /var/log/ceph/$name.log pid file = /var/run/ceph/$name.pid [mon] mon data = /data/$name mon clock drift allowed = 1 [mon.alpha] host = node91 mon addr = 138.96.126.91:6789 [mon.beta] host = node92 mon addr = 138.96.126.92:6789 [mon.gamma] host = node93 mon addr = 138.96.126.93:6789 [mds] keyring = /data/keyring.$name [mds.alpha] host = node91 [mds.beta] host = node92 [osd] osd data = /data/$name osd journal = /data/$name/journal osd journal size = 1000 [osd.0] host = node92 [osd.1] host = node93 [osd.2] host = node94 [osd.3] host = node95 ---- Thank you one more time for your help. Regards Cédric Le 23 sept. 2011 à 19:20, Wido den Hollander a écrit : > Hi. > > Could you sent us your ceph.conf and the output of "ceph -s" ? > > Wido > > On Fri, 2011-09-23 at 17:58 +0200, Cedric Morandin wrote: >> Hi everybody, >> >> I didn't find any ceph-users list so I post here. If this is not the right place to do it please let me know. >> I'm currently trying to test ceph but I'm probably doing something wrong because I have a really strange behavior. >> >> Context: >> Ceph compiled and installed on five Centos6 machines. >> A BTRFS partition is available on each machine. >> This partition is mounted under /data/osd.[0-3] >> Clients are using cfuse compiled for FC11 ( 2.6.29.4-167.fc11.x86_64 ) >> >> What happen: >> I configured everything in ceph.conf, started ceph daemons on all nodes. >> When I issue ceph health, I have a HEALTH_OK answer. >> I can access the filesystem through cfuse and create some files on it, but when I try to create files bigger than 2 or 3 Mo, the filesystem hangs. >> When I try to copy an entire directory ( ceph sources for instance) I have the same problem. >> When the system is in this state, the cosd daemon die on OSD machines: [INF] osd0 out (down for 304.836218) >> Even killing it doesn't release the mountpoint : >> cosd 9170 root 10uW REG 8,6 8 2506754 /data/osd.0/fsid >> cosd 9170 root 11r DIR 8,6 4096 2506753 /data/osd.0 >> cosd 9170 root 12r DIR 8,6 24576 2506755 /data/osd.0/current >> cosd 9170 root 13u REG 8,6 4 2506757 /data/osd.0/current/commit_op_seq >> >> >> I tried to change some parameters but it results in the same problem: >> Tried both with the 0.34 and 0.35 releases and using both BTRFS or EXTR3 with user_attr attribute. >> I tried the cfuse client on one of the Centos 6 machine. >> >> I read everything on http://ceph.newdream.net/wiki but I can't figure out the problem. >> Does somebody have any clue of the problem's origin ? >> >> Regards, >> >> Cedric Morandin >> >> >> > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html