The continuing story ...

david at ols.es (David Saez Padros) · Wed, 09 Sep 2009 11:00:14 +0200

Hi

>>> As far as debugging the system hang is concerned, you need to be
>>> looking for kernel logs and dmesg output. You really are wasting your
>>> time trying to debug a kernel fs hang by looking for logs from a user
>>> application. The kernel oops backtrace shows you exactly where the
>>> kernel is locking up.
>> there is no oops backtrace and nothing in logs that could say what
>> is happening apaprt from the pcu stuck messages
> 
> Can you post the complete kernel output from that session?

That what i sent you on 28/08/2009:

server1 and server2 are Dell PE 2900 computers with Debian
(kernel 2.6.26-2 x64) running gluster 2.0.1-1. Each server
has 6 sas disks unifed and exported with gluster, which are
mounted as replicated gluster fs in all clients.

the problem happen when sever2 was under heavy load (8 wrf
processes running) and one client was copying a large amount
of files to the replicated gluster fs. We have been running
the same setup using the same computers but using nfs instead
of glusterfs for almost 2 years without having this problem.

when the problem happen server2 was totally locked, responds
to pings, can ssh to server but once the username and password
is correctly entered nothing happens and in some seconds ssh is
disconnected. Direct terminal access (keyboard) is also impossible.
Kernel log shows a "BUG: soft lockup - CPU#1 stuck for ..." for
each core (all for wrf process) with a trace at different points
of wrf.exe (or other daemons). Everytime we had this problem it was
triggered by copying a lot of files to the gluster fs.

It may or not be related to glusterfs but the worse thing is that
when this happen access to the replicated gluster fs from any client
also hangs. In that case df hangs on one of the glusterfs shares
(df cannot be terminated in any way including ctrl-c or kill
-KILL). Altough df shows the other share any ls operation on
that share hangs, and ls also cannot  be terminated in any way
including ctrl-c or kill -KILL.

Glusterfs log only shows lines like this ones:

[2009-08-28 09:19:28] E [client-protocol.c:292:call_bail] data2: bailing 
out frame LOOKUP(32) frame sent = 2009-08-28 08:49:18. frame-timeout = 1800
[2009-08-28 09:23:38] E [client-protocol.c:292:call_bail] data2: bailing 
out frame LOOKUP(32) frame sent = 2009-08-28 08:53:28. frame-timeout = 1800

Once server2 has been rebooted all gluster fs become available
again on all clients and the hanged df and ls processes terminate,
but difficult to understand why a replicated share that must survive
to failure on one server does not.

-- 
Best regards ...

----------------------------------------------------------------
    David Saez Padros                http://www.ols.es
    On-Line Services 2000 S.L.       telf    +34 902 50 29 75
----------------------------------------------------------------