Hi > ----- "David Saez Padros" <david at ols.es> wrote: >> in the problem we have the server also hang to the point that there >> where no way to access it and we end rebooting the server to gain >> acces to it > > Do you mean you were unable to login to the machine over the network? unable to have a responsive console shell? machine would not respond to ICMP on the network? Do you still have the logfiles and volfiles and can you describe the steps to reproduce in a bug report? > As a thumb rule, if your server hangs to the degree of not even having a usable shell, it just means that heavy IO via glusterfs triggered some bug in the operating system. try to get kernel output via dmesg or console logs if you have any. glusterfsd only issues system calls and does not do anything funky with the server. Think of some application local to the server causing such a hung. glusterfsd is no different in that respect. wa had again the same problem with the following setup: server1 and server2 are Dell PE 2900 computers with Debian (kernel 2.6.26-2 x64) running gluster 2.0.1-1. Each server has 6 sas disks unifed and exported with gluster, which are mounted as replicated gluster fs in all clients. the problem happen when sever2 was under heavy load (8 wrf processes running) and one client was copying a large amount of files to the replicated gluster fs. We have been running the same setup using the same computers but using nfs instead of glusterfs for almost 2 years without having this problem. when the problem happen server2 was totally locked, responds to pings, can ssh to server but once the username and password is correctly entered nothing happens and in some seconds ssh is disconnected. Direct terminal access (keyboard) is also impossible. Kernel log shows a "BUG: soft lockup - CPU#1 stuck for ..." for each core (all for wrf process) with a trace at different points of wrf.exe. Everytime we had this problem it was triggered by copying a lot of files to the gluster fs. It may or not be related to glusterfs but the worse thing is that when this happen access to the replicated gluster fs from any client also hangs. In that case df hangs on one of the glusterfs shares (df cannot be terminated in any way including ctrl-c or kill -KILL). Altough df shows the other share any ls operation on that share hangs, and ls also cannot be terminated in any way including ctrl-c or kill -KILL. Glusterfs log only shows lines like this ones: [2009-08-28 09:19:28] E [client-protocol.c:292:call_bail] data2: bailing out frame LOOKUP(32) frame sent = 2009-08-28 08:49:18. frame-timeout = 1800 [2009-08-28 09:23:38] E [client-protocol.c:292:call_bail] data2: bailing out frame LOOKUP(32) frame sent = 2009-08-28 08:53:28. frame-timeout = 1800 Once server2 has been rebooted all gluster fs become available again on all clients and the hanged df and ls processes terminate, but difficult to understand why a replicated share that must survive to failure on one server does not. -- Best regards ... ---------------------------------------------------------------- David Saez Padros http://www.ols.es On-Line Services 2000 S.L. telf +34 902 50 29 75 ----------------------------------------------------------------