Hi I have experienced similar problems with df hanging, transport endpoint disconnected, server locked ... with no apparent reason when copying a lot of files to the glusterfs file system. No idea if this is the real cause but since i changed something in the configuration the problems stop (by now). I had two servers exporting two unified bricks that were replicated on clients (two different replicated bricks), on the client configuration i had two vol files on each client, one for each different gluster bricks but both of them use the same names for the bricks (vol files where identical except for the option remote-subvolume brick values). Once i changed the name of the bricks so each file had bricks with names not identical to the other vol file the problems disapeared. Not sure if this was the problem or not but by now the problem has no appeared again. > On Sat, 22 Aug 2009 05:42:45 -0500 (CDT) > Anand Avati <avati at gluster.com> wrote: > >>> It is perfectly clear to us that glusterfs(d) is the reason for the >>> box >>> becoming instable and producing a hang even on a local fs (you cannot >>> df on >>> the exported partition for example). >>> We will therefore continue with debugging as told before. >> glusterfsd is just another application as far as the backend export filesystem is concerned. If your backend export fs is hung and refuses to respond to df, I would refuse to accept that glusterfsd is guilty. If your backend filesystem ended in that state, it is a bug in the backend fs. glusterfsd is just another application which issues system calls and does not do anything funky at all. If an application issuing system calls is causing the export fs to stop responding to df, it is not the fault of the application. If you can get dmesg output at the time of such a hang, that might have some hard evidence. >> >> Avati > > Ok, please stay serious. As described in my original email from 19th > effectively _all_ four physical boxes have not-moving (I deny to use "hanging" > here) gluster processes. The mount points on the clients hang (which made > bonnies stop), the primary server looks pretty much ok, but does obviously > serve nothing to the clients, and the secondary has a hanging local fs for > what causes ever. > Now can you please elaborate how you come to the conclusion that this complete > service lock up derives from one hanging fs on one secondary server of a > replicate setup (which you declare as the cause and I as the effect of locked > up gluster processes). > > ? > -- Salu-2 y hasta pronto ... ---------------------------------------------------------------- David Saez Padros http://www.ols.es On-Line Services 2000 S.L. telf +34 902 50 29 75 ----------------------------------------------------------------