Re: Ceph hangs when accessed

Cédric Morandin <cedric.morandin@xxxxxxxx> · Mon, 26 Sep 2011 23:23:07 +0200

Hi Wido,

Thanks for your answer and your kind help.
I tried to give you all useful information but maybe something is missing.
Let me know if you want me to do more tests.

Please find the output of ceph -s below:
[root@node91 ~]# ceph -s
2011-09-26 22:48:08.048659    pg v297: 792 pgs: 792 active+clean; 24 KB data, 80512 KB used, 339 GB / 340 GB avail
2011-09-26 22:48:08.049742   mds e5: 1/1/1 up {0=alpha=up:active}, 1 up:standby
2011-09-26 22:48:08.049764   osd e5: 4 osds: 4 up, 4 in
2011-09-26 22:48:08.049800   log 2011-09-26 19:38:14.372125 osd3 138.96.126.95:6800/2973 242 : [INF] 2.1p3 scrub ok
2011-09-26 22:48:08.049847   mon e1: 3 mons at {alpha=138.96.126.91:6789/0,beta=138.96.126.92:6789/0,gamma=138.96.126.93:6789/0}

The same command ten minutes after the cfuse hangs on the client node :

[root@node91 ~]# ceph -s
2011-09-26 23:07:49.403774    pg v335: 792 pgs: 101 active, 276 active+clean, 415 active+clean+degraded; 4806 KB data, 114 MB used, 339 GB / 340 GB avail; 24/56 degraded (42.857%)
2011-09-26 23:07:49.404847   mds e5: 1/1/1 up {0=alpha=up:active}, 1 up:standby
2011-09-26 23:07:49.404867   osd e13: 4 osds: 2 up, 4 in
2011-09-26 23:07:49.404929   log 2011-09-26 23:07:46.093670 mds0 138.96.126.91:6800/4682 2 : [INF] closing stale session client4124 138.96.126.91:0/5563 after 455.778957
2011-09-26 23:07:49.404966   mon e1: 3 mons at {alpha=138.96.126.91:6789/0,beta=138.96.126.92:6789/0,gamma=138.96.126.93:6789/0}

[root@node91 ~]# /etc/init.d/ceph -a status
=== mon.alpha === 
running...
=== mon.beta === 
running...
=== mon.gamma === 
running...
=== mds.alpha === 
running...
=== mds.beta === 
running...
=== osd.0 === 
dead.
=== osd.1 === 
running...
=== osd.2 === 
running...
=== osd.3 === 
dead.

I finally paste the last lines of osd.0 :

2011-09-26 22:57:06.822182 7faf6a6f8700 -- 138.96.126.92:6802/3157 >> 138.96.126.93:6801/3162 pipe(0x7faf50001320 sd=20 pgs=0 cs=0 l=0).accept connect_seq 2 vs existing 1 state 3
2011-09-26 23:07:09.084901 7faf8e1b5700 FileStore: sync_entry timed out after 600 seconds.
 ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6)
2011-09-26 23:07:09.084934 1: (SafeTimer::timer_thread()+0x323) [0x5c95a3]
2011-09-26 23:07:09.084943 2: (SafeTimerThread::entry()+0xd) [0x5cbc7d]
2011-09-26 23:07:09.084950 3: /lib64/libpthread.so.0() [0x31fec077e1]
2011-09-26 23:07:09.084957 4: (clone()+0x6d) [0x31fe4e18ed]
2011-09-26 23:07:09.084963 *** Caught signal (Aborted) **
 in thread 0x7faf8e1b5700
 ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6)
 1: /usr/bin/cosd() [0x649ca9]
 2: /lib64/libpthread.so.0() [0x31fec0f4c0]
 3: (gsignal()+0x35) [0x31fe4329a5]
 4: (abort()+0x175) [0x31fe434185]
 5: (__assert_fail()+0xf5) [0x31fe42b935]
 6: (SyncEntryTimeout::finish(int)+0x130) [0x683400]
 7: (SafeTimer::timer_thread()+0x323) [0x5c95a3]
 8: (SafeTimerThread::entry()+0xd) [0x5cbc7d]
 9: /lib64/libpthread.so.0() [0x31fec077e1]
 10: (clone()+0x6d) [0x31fe4e18ed]

ceph.conf:

[global]
        max open files = 131072
        log file = /var/log/ceph/$name.log
        pid file = /var/run/ceph/$name.pid
[mon]
        mon data = /data/$name
        mon clock drift allowed = 1
[mon.alpha]
        host = node91
        mon addr = 138.96.126.91:6789
[mon.beta]
        host = node92
        mon addr = 138.96.126.92:6789
[mon.gamma]
        host = node93
        mon addr = 138.96.126.93:6789
[mds]

        keyring = /data/keyring.$name
[mds.alpha]
        host = node91
[mds.beta]
        host = node92
[osd]
        osd data = /data/$name
        osd journal = /data/$name/journal
        osd journal size = 1000
[osd.0]
        host = node92
[osd.1]
        host = node93
[osd.2]
        host = node94
[osd.3]
        host = node95

----

Thank you one more time for your help.

Regards

Cédric

Le 23 sept. 2011 à 19:20, Wido den Hollander a écrit :

> Hi.
> 
> Could you sent us your ceph.conf and the output of "ceph -s" ?
> 
> Wido
> 
> On Fri, 2011-09-23 at 17:58 +0200, Cedric Morandin wrote:
>> Hi everybody,
>> 
>> I didn't find any ceph-users list so I post here. If this is not the right place to do it please let me know.
>> I'm currently trying to test ceph but I'm probably doing something wrong because I have a really strange behavior.
>> 
>> Context:
>> Ceph compiled and installed on five Centos6 machines.
>> A BTRFS partition is available on each machine.
>> This partition is mounted under /data/osd.[0-3]
>> Clients are using cfuse compiled for FC11 ( 2.6.29.4-167.fc11.x86_64 )
>> 
>> What happen:
>> I configured everything in ceph.conf, started ceph daemons on all nodes.
>> When I issue ceph health, I have a HEALTH_OK answer.
>> I can access the filesystem through cfuse and create some files on it, but when I try to create files bigger than 2 or 3 Mo, the filesystem hangs.
>> When I try to copy an entire directory ( ceph sources for instance) I have the same problem.
>> When the system is in this state, the cosd daemon die on OSD machines: [INF] osd0 out (down for 304.836218)
>> Even killing it doesn't release the mountpoint :
>> cosd       9170      root   10uW     REG                8,6          8    2506754 /data/osd.0/fsid
>> cosd       9170      root   11r      DIR                8,6       4096    2506753 /data/osd.0
>> cosd       9170      root   12r      DIR                8,6      24576    2506755 /data/osd.0/current
>> cosd       9170      root   13u      REG                8,6          4    2506757 /data/osd.0/current/commit_op_seq
>> 
>> 
>> I tried to change some parameters but it results in the same problem:
>> Tried both with the 0.34 and 0.35 releases and using both BTRFS or EXTR3 with user_attr attribute.
>> I tried the cfuse client on one of the Centos 6 machine.
>> 
>> I read everything on  http://ceph.newdream.net/wiki but I can't figure out the problem.
>> Does somebody have any clue of the problem's origin ?
>> 
>> Regards,
>> 
>> Cedric Morandin 
>> 
>> 
>> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html