Re: Ceph hangs when accessed

huang jun <hjwsm1989@xxxxxxxxx> · Tue, 27 Sep 2011 07:56:40 +0800



2011/9/27 Cédric Morandin <cedric.morandin@xxxxxxxx>:
> Hi Wido,
>
> Thanks for your answer and your kind help.
> I tried to give you all useful information but maybe something is missing.
> Let me know if you want me to do more tests.
>
> Please find the output of ceph -s below:
> [root@node91 ~]# ceph -s
> 2011-09-26 22:48:08.048659    pg v297: 792 pgs: 792 active+clean; 24 KB data, 80512 KB used, 339 GB / 340 GB avail
> 2011-09-26 22:48:08.049742   mds e5: 1/1/1 up {0=alpha=up:active}, 1 up:standby
> 2011-09-26 22:48:08.049764   osd e5: 4 osds: 4 up, 4 in
> 2011-09-26 22:48:08.049800   log 2011-09-26 19:38:14.372125 osd3 138.96.126.95:6800/2973 242 : [INF] 2.1p3 scrub ok
> 2011-09-26 22:48:08.049847   mon e1: 3 mons at {alpha=138.96.126.91:6789/0,beta=138.96.126.92:6789/0,gamma=138.96.126.93:6789/0}
>
> The same command ten minutes after the cfuse hangs on the client node :
>
> [root@node91 ~]# ceph -s
> 2011-09-26 23:07:49.403774    pg v335: 792 pgs: 101 active, 276 active+clean, 415 active+clean+degraded; 4806 KB data, 114 MB used, 339 GB / 340 GB avail; 24/56 degraded (42.857%)
> 2011-09-26 23:07:49.404847   mds e5: 1/1/1 up {0=alpha=up:active}, 1 up:standby
> 2011-09-26 23:07:49.404867   osd e13: 4 osds: 2 up, 4 in
> 2011-09-26 23:07:49.404929   log 2011-09-26 23:07:46.093670 mds0 138.96.126.91:6800/4682 2 : [INF] closing stale session client4124 138.96.126.91:0/5563 after 455.778957
> 2011-09-26 23:07:49.404966   mon e1: 3 mons at {alpha=138.96.126.91:6789/0,beta=138.96.126.92:6789/0,gamma=138.96.126.93:6789/0}
>
> [root@node91 ~]# /etc/init.d/ceph -a status
> === mon.alpha ===
> running...
> === mon.beta ===
> running...
> === mon.gamma ===
> running...
> === mds.alpha ===
> running...
> === mds.beta ===
> running...
> === osd.0 ===
> dead.
> === osd.1 ===
> running...
> === osd.2 ===
> running...
> === osd.3 ===
> dead.
>
> I finally paste the last lines of osd.0 :
>
> 2011-09-26 22:57:06.822182 7faf6a6f8700 -- 138.96.126.92:6802/3157 >> 138.96.126.93:6801/3162 pipe(0x7faf50001320 sd=20 pgs=0 cs=0 l=0).accept connect_seq 2 vs existing 1 state 3
> 2011-09-26 23:07:09.084901 7faf8e1b5700 FileStore: sync_entry timed out after 600 seconds.
>  ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6)
> 2011-09-26 23:07:09.084934 1: (SafeTimer::timer_thread()+0x323) [0x5c95a3]
> 2011-09-26 23:07:09.084943 2: (SafeTimerThread::entry()+0xd) [0x5cbc7d]
> 2011-09-26 23:07:09.084950 3: /lib64/libpthread.so.0() [0x31fec077e1]
> 2011-09-26 23:07:09.084957 4: (clone()+0x6d) [0x31fe4e18ed]
> 2011-09-26 23:07:09.084963 *** Caught signal (Aborted) **
>  in thread 0x7faf8e1b5700
>  ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6)
>  1: /usr/bin/cosd() [0x649ca9]
>  2: /lib64/libpthread.so.0() [0x31fec0f4c0]
>  3: (gsignal()+0x35) [0x31fe4329a5]
>  4: (abort()+0x175) [0x31fe434185]
>  5: (__assert_fail()+0xf5) [0x31fe42b935]
>  6: (SyncEntryTimeout::finish(int)+0x130) [0x683400]
>  7: (SafeTimer::timer_thread()+0x323) [0x5c95a3]
>  8: (SafeTimerThread::entry()+0xd) [0x5cbc7d]
>  9: /lib64/libpthread.so.0() [0x31fec077e1]
>  10: (clone()+0x6d) [0x31fe4e18ed]
may be the underlayer fs(btrfs/ext4)  is busy or hang that result the
sync_commit took more than 600s,the osd think it is dead, so use
ceph_abort to terminate cosd process.
> ceph.conf:
>
> [global]
>        max open files = 131072
>        log file = /var/log/ceph/$name.log
>        pid file = /var/run/ceph/$name.pid
> [mon]
>        mon data = /data/$name
>        mon clock drift allowed = 1
> [mon.alpha]
>        host = node91
>        mon addr = 138.96.126.91:6789
> [mon.beta]
>        host = node92
>        mon addr = 138.96.126.92:6789
> [mon.gamma]
>        host = node93
>        mon addr = 138.96.126.93:6789
> [mds]
>
>        keyring = /data/keyring.$name
> [mds.alpha]
>        host = node91
> [mds.beta]
>        host = node92
> [osd]
>        osd data = /data/$name
>        osd journal = /data/$name/journal
>        osd journal size = 1000
> [osd.0]
>        host = node92
> [osd.1]
>        host = node93
> [osd.2]
>        host = node94
> [osd.3]
>        host = node95
>
> ----
>
> Thank you one more time for your help.
>
> Regards
>
> Cédric
>
> Le 23 sept. 2011 à 19:20, Wido den Hollander a écrit :
>
>> Hi.
>>
>> Could you sent us your ceph.conf and the output of "ceph -s" ?
>>
>> Wido
>>
>> On Fri, 2011-09-23 at 17:58 +0200, Cedric Morandin wrote:
>>> Hi everybody,
>>>
>>> I didn't find any ceph-users list so I post here. If this is not the right place to do it please let me know.
>>> I'm currently trying to test ceph but I'm probably doing something wrong because I have a really strange behavior.
>>>
>>> Context:
>>> Ceph compiled and installed on five Centos6 machines.
>>> A BTRFS partition is available on each machine.
>>> This partition is mounted under /data/osd.[0-3]
>>> Clients are using cfuse compiled for FC11 ( 2.6.29.4-167.fc11.x86_64 )
>>>
>>> What happen:
>>> I configured everything in ceph.conf, started ceph daemons on all nodes.
>>> When I issue ceph health, I have a HEALTH_OK answer.
>>> I can access the filesystem through cfuse and create some files on it, but when I try to create files bigger than 2 or 3 Mo, the filesystem hangs.
>>> When I try to copy an entire directory ( ceph sources for instance) I have the same problem.
>>> When the system is in this state, the cosd daemon die on OSD machines: [INF] osd0 out (down for 304.836218)
>>> Even killing it doesn't release the mountpoint :
>>> cosd       9170      root   10uW     REG                8,6          8    2506754 /data/osd.0/fsid
>>> cosd       9170      root   11r      DIR                8,6       4096    2506753 /data/osd.0
>>> cosd       9170      root   12r      DIR                8,6      24576    2506755 /data/osd.0/current
>>> cosd       9170      root   13u      REG                8,6          4    2506757 /data/osd.0/current/commit_op_seq
>>>
>>>
>>> I tried to change some parameters but it results in the same problem:
>>> Tried both with the 0.34 and 0.35 releases and using both BTRFS or EXTR3 with user_attr attribute.
>>> I tried the cfuse client on one of the Centos 6 machine.
>>>
>>> I read everything on  http://ceph.newdream.net/wiki but I can't figure out the problem.
>>> Does somebody have any clue of the problem's origin ?
>>>
>>> Regards,
>>>
>>> Cedric Morandin
>>>
>>>
>>>
>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html