Hi >>> As far as debugging the system hang is concerned, you need to be >>> looking for kernel logs and dmesg output. You really are wasting your >>> time trying to debug a kernel fs hang by looking for logs from a user >>> application. The kernel oops backtrace shows you exactly where the >>> kernel is locking up. >> there is no oops backtrace and nothing in logs that could say what >> is happening apaprt from the pcu stuck messages > > Can you post the complete kernel output from that session? That what i sent you on 28/08/2009: server1 and server2 are Dell PE 2900 computers with Debian (kernel 2.6.26-2 x64) running gluster 2.0.1-1. Each server has 6 sas disks unifed and exported with gluster, which are mounted as replicated gluster fs in all clients. the problem happen when sever2 was under heavy load (8 wrf processes running) and one client was copying a large amount of files to the replicated gluster fs. We have been running the same setup using the same computers but using nfs instead of glusterfs for almost 2 years without having this problem. when the problem happen server2 was totally locked, responds to pings, can ssh to server but once the username and password is correctly entered nothing happens and in some seconds ssh is disconnected. Direct terminal access (keyboard) is also impossible. Kernel log shows a "BUG: soft lockup - CPU#1 stuck for ..." for each core (all for wrf process) with a trace at different points of wrf.exe (or other daemons). Everytime we had this problem it was triggered by copying a lot of files to the gluster fs. It may or not be related to glusterfs but the worse thing is that when this happen access to the replicated gluster fs from any client also hangs. In that case df hangs on one of the glusterfs shares (df cannot be terminated in any way including ctrl-c or kill -KILL). Altough df shows the other share any ls operation on that share hangs, and ls also cannot be terminated in any way including ctrl-c or kill -KILL. Glusterfs log only shows lines like this ones: [2009-08-28 09:19:28] E [client-protocol.c:292:call_bail] data2: bailing out frame LOOKUP(32) frame sent = 2009-08-28 08:49:18. frame-timeout = 1800 [2009-08-28 09:23:38] E [client-protocol.c:292:call_bail] data2: bailing out frame LOOKUP(32) frame sent = 2009-08-28 08:53:28. frame-timeout = 1800 Once server2 has been rebooted all gluster fs become available again on all clients and the hanged df and ls processes terminate, but difficult to understand why a replicated share that must survive to failure on one server does not. -- Best regards ... ---------------------------------------------------------------- David Saez Padros http://www.ols.es On-Line Services 2000 S.L. telf +34 902 50 29 75 ----------------------------------------------------------------