The continuing story ...

avati at gluster.com (Anand Avati) · Tue, 8 Sep 2009 05:37:09 -0700

>> > I doubt that this can be a real solution. My guess is that glusterfsd runs
>> > into some race condition where it locks itself up completely.
>> > It is not funny to debug something the like on a production setup. Best would
>> > be to have debugging output sent from the servers' glusterfsd directly to a
>> > client to save the logs. I would not count on syslog in this case, if it
>> > survives one could use a serial console for syslog output though.

I'm going to iterate through this yet again at the risk of frustrating
you. glusterfsd (on the server side) is yet another process running
only system calls. If glusterfsd has a race condition and locks itself
up, then it locks _only its own process_ up. What you are having is a
frozen system. There is no way glusterfsd can lock up your system
through just VFS system calls, even if it wanted to, intentionally. It
is a pure user space process and has no power to lock up the system.
The worst glusterfsd can do to your system is deadlock its own process
resulting in a glusterfs fuse mountpoint hang, or segfault and result
in a core dump.

Please consult system/kernel programmers you trust. Or ask on the
kernel-devel mailing list. The system freeze you are facing is not
something which can be caused by _any_ user space application. The
correlation you see that the freeze happens only when glusterfsd is
running does NOT make glusterfsd _responsible_ for it.  I'm not sure
if you understand how user processes and kernels work and interact
with each other. Think of this almost-perfect analogy. If you have an
ftp daemon on a system and your system ends up freezing in the way you
describe, you blame the kernel, not the ftp daemon. glusterfsd is no
different from an ftp daemon in terms of how potentially disastrous it
can be.

glusterfs has other bugs, we admit it, but what you are describing
here is really a problem in the kernel. I say this confidently because
glusterfsd CANNOT freeze a system, even if it wanted to,
intentionally. It is a user-space process. If glusterfs has bugs, then
it segfaults, or the process hangs. That is fundamentally very
different from a system lock up.

As far as your problem is concerned, we can point you to the right
place if you can report with kernel/dmesg logs. Please understand that
even if we wanted to somehow solve your server lock-up problem by that
hypothetical fix in glusterfs, it is just not possible, even
theoretically. The fix you need is not in glusterfs. It is not a
userspace application you fix for system lock ups.

> The system acts as pure server for both glusterfs and nfs. It has no fuse nor
> nfs client mount points.

However, if you are facing hangs on the glusterfs fuse mountpoint,
then it is very likely that it is a glusterfs bug. We are very much
interested to hear about those issues.

Avati