Replication not working on server hang

david at ols.es (David Saez Padros) · Fri, 28 Aug 2009 10:49:32 +0200

Hi

> ----- "David Saez Padros" <david at ols.es> wrote:
>> in the problem we have the server also hang to the point that there
>> where no way to access it and we end rebooting the server to gain
>> acces to it
> 
> Do you mean you were unable to login to the machine over the network? unable to have a responsive console shell? machine would not respond to ICMP on the network? Do you still have the logfiles and volfiles and can you describe the steps to reproduce in a bug report?
> As a thumb rule, if your server hangs to the degree of not even having a usable shell, it just means that heavy IO via glusterfs triggered some bug in the operating system. try to get kernel output via dmesg or console logs if you have any. glusterfsd only issues system calls and does not do anything funky with the server. Think of some application local to the server causing such a hung. glusterfsd is no different in that respect.

wa had again the same problem with the following setup:

server1 and server2 are Dell PE 2900 computers with Debian
(kernel 2.6.26-2 x64) running gluster 2.0.1-1. Each server
has 6 sas disks unifed and exported with gluster, which are
mounted as replicated gluster fs in all clients.

the problem happen when sever2 was under heavy load (8 wrf
processes running) and one client was copying a large amount
of files to the replicated gluster fs. We have been running
the same setup using the same computers but using nfs instead
of glusterfs for almost 2 years without having this problem.

when the problem happen server2 was totally locked, responds
to pings, can ssh to server but once the username and password
is correctly entered nothing happens and in some seconds ssh is
disconnected. Direct terminal access (keyboard) is also impossible.
Kernel log shows a "BUG: soft lockup - CPU#1 stuck for ..." for
each core (all for wrf process) with a trace at different points
of wrf.exe. Everytime we had this problem it was triggered by
copying a lot of files to the gluster fs.

It may or not be related to glusterfs but the worse thing is that
when this happen access to the replicated gluster fs from any client
also hangs. In that case df hangs on one of the glusterfs shares
(df cannot be terminated in any way including ctrl-c or kill
-KILL). Altough df shows the other share any ls operation on
that share hangs, and ls also cannot  be terminated in any way
including ctrl-c or kill -KILL.

Glusterfs log only shows lines like this ones:

[2009-08-28 09:19:28] E [client-protocol.c:292:call_bail] data2: bailing 
out frame LOOKUP(32) frame sent = 2009-08-28 08:49:18. frame-timeout = 1800
[2009-08-28 09:23:38] E [client-protocol.c:292:call_bail] data2: bailing 
out frame LOOKUP(32) frame sent = 2009-08-28 08:53:28. frame-timeout = 1800

Once server2 has been rebooted all gluster fs become available
again on all clients and the hanged df and ls processes terminate,
but difficult to understand why a replicated share that must survive
to failure on one server does not.

-- 
Best regards ...

----------------------------------------------------------------
    David Saez Padros                http://www.ols.es
    On-Line Services 2000 S.L.       telf    +34 902 50 29 75
----------------------------------------------------------------