The continuing story ...

avati at gluster.com (Anand Avati) · Wed, 9 Sep 2009 10:25:03 +0530

>  Altough is clear that the bug itself is a kernel bug it's also
>  clear that glusterfs is triggering that bug. The same system under
>  the same load but using nfs instead of gluster does not have this
>  problem. This problem also does not happen copying lots of data
>  using scp. Also, i have never seen such this hangs in more than
>  10 years using unix boxes. But the more strange thing is that this
>  is a bug that can make glusterfs totally unusable and the developers
>  seem to don't worry even in finding what is exactly causing that
>  problem.

I would like to politely disagree with your final statement. In a
previous thread we have indeed promised that we will be fixing the
timeout techniques to take into consideration the situation where the
backend fs is hanging so that the entire glusterfs volume does not
become unusable.

As far as debugging the system hang is concerned, you need to be
looking for kernel logs and dmesg output. You really are wasting your
time trying to debug a kernel fs hang by looking for logs from a user
application. The kernel oops backtrace shows you exactly where the
kernel is locking up. Take the backtrace to the kernel developers and
they will tell you the next step. It is for this very reason the
kernel supports serial console logging to extract hints when the
system cannot log to files.

It is not that we do not want to help, but there is only so much we
can do as a user application. We issue system calls and process the
result. The effort needed to programmatically figure out which the
hanging system call is (with wierd and awkwardly implemented ad-hoc
timeouts in the code) and the amount of hint you get from that is far
less worth than directly going to the heart of the problem - get the
kernel backtrace from a serial console and you will be just one step
from your solution.

If you can also post back a link to the thread on the appropriate ML
where you post your kernel backtrace, we would be interested to keep a
watch on it, or provide more (specific) info if found necessary by
those developers. Almost always the kernel backtrace would be
sufficient. That is the correct first step for debugging this problem.

Avati