Re: Too many open files

Anand Avati <avati@xxxxxxxxxxxxx> · Fri, 6 Apr 2007 13:30:47 -0700

Brent,
  can you please send me your spec files? because I am able to 'ls'
without any problems and there is no fd leak observed. I have loaded
just cluster/afr, and previously had loaded all performance xlators on
both server and clietn side together and in both the cases things
worked perfectly fine.

 I'm guessing the encrytpion makefile issue caused a bad build? (things
were changed in libglusterfs). the makefile is committed now though
(along with the -l fix). please do a make uninstall/clean/install
since quit a chunk of changes have gone in the last few days.

avati

On Fri, Apr 06, 2007 at 03:33:30PM -0400, Brent A Nelson wrote:
> glusterfsd dies on both nodes almost immediately (I can ls succesfully 
> once before it dies, but cd in and they're dead).  The glusterfs processes 
> are still running, but I of course have "Transport endpoint is not 
> connected."
> 
> Also, glusterfsd and glusterfs no longer seem to know where to log by 
> default and refuse to start unless I give the -l option on each.
> 
> Thanks,
> 
> Brent
> 
> On Fri, 6 Apr 2007, Anand Avati wrote:
> 
> >Brent,
> > the fix has been committed. can you please check if it works for you?
> >
> > regards,
> > avati
> >
> >On Thu, Apr 05, 2007 at 02:09:26AM -0400, Brent A Nelson wrote:
> >>That's correct.  I had commented out unify when narrowing down the mtime
> >>bug (which turned out to be writebehind) and then decided I had no reason
> >>to put it back in for this two-brick filesystem.  It was mounted without
> >>unify when this issue occurred.
> >>
> >>Thanks,
> >>
> >>Brent
> >>
> >>On Wed, 4 Apr 2007, Anand Avati wrote:
> >>
> >>>Can you confirm that you were NOT using unify int he setup??
> >>>
> >>>regards,
> >>>avati
> >>>
> >>>
> >>>On Thu, Apr 05, 2007 at 01:09:16AM -0400, Brent A Nelson wrote:
> >>>>Awesome!
> >>>>
> >>>>Thanks,
> >>>>
> >>>>Brent
> >>>>
> >>>>On Wed, 4 Apr 2007, Anand Avati wrote:
> >>>>
> >>>>>Brent,
> >>>>>thank you so much for your efforts of sending the output!
> >>>>>from the log it is clear the leak fd's are all for directories. Indeed
> >>>>>there was an issue with releasedir() call reaching all the nodes. The
> >>>>>fix should be committed today to tla.
> >>>>>
> >>>>>Thanks!!
> >>>>>
> >>>>>avati
> >>>>>
> >>>>>
> >>>>>
> >>>>>On Wed, Apr 04, 2007 at 09:18:48PM -0400, Brent A Nelson wrote:
> >>>>>>I avoided restarting, as this issue would take a while to reproduce.
> >>>>>>
> >>>>>>jupiter01 and jupiter02 are mirrors of each other.  All performance
> >>>>>>translators are in use, except for writebehind (due to the mtime bug).
> >>>>>>
> >>>>>>jupiter01:
> >>>>>>ls -l /proc/26466/fd |wc
> >>>>>>65536  655408 7358168
> >>>>>>See attached for ls -l output.
> >>>>>>
> >>>>>>jupiter02:
> >>>>>>ls -l /proc/3651/fd |wc
> >>>>>>ls -l /proc/3651/fd
> >>>>>>total 11
> >>>>>>lrwx------ 1 root root 64 2007-04-04 20:43 0 -> /dev/null
> >>>>>>lrwx------ 1 root root 64 2007-04-04 20:43 1 -> /dev/null
> >>>>>>lrwx------ 1 root root 64 2007-04-04 20:43 10 -> socket:[2565251]
> >>>>>>lrwx------ 1 root root 64 2007-04-04 20:43 2 -> /dev/null
> >>>>>>l-wx------ 1 root root 64 2007-04-04 20:43 3 ->
> >>>>>>/var/log/glusterfs/glusterfsd.log
> >>>>>>lrwx------ 1 root root 64 2007-04-04 20:43 4 -> socket:[2255275]
> >>>>>>lrwx------ 1 root root 64 2007-04-04 20:43 5 -> socket:[2249710]
> >>>>>>lr-x------ 1 root root 64 2007-04-04 20:43 6 -> eventpoll:[2249711]
> >>>>>>lrwx------ 1 root root 64 2007-04-04 20:43 7 -> socket:[2255306]
> >>>>>>lr-x------ 1 root root 64 2007-04-04 20:43 8 ->
> >>>>>>/etc/glusterfs/glusterfs-client.vol
> >>>>>>lr-x------ 1 root root 64 2007-04-04 20:43 9 ->
> >>>>>>/etc/glusterfs/glusterfs-client.vol
> >>>>>>
> >>>>>>Note that it looks like all those extra directories listed on 
> >>>>>>jupiter01
> >>>>>>were locally rsynched from jupiter01's Lustre filesystems onto the
> >>>>>>glusterfs client on jupiter01.  A very large rsync from a different
> >>>>>>machine to jupiter02 didn't go nuts.
> >>>>>>
> >>>>>>Thanks,
> >>>>>>
> >>>>>>Brent
> >>>>>>
> >>>>>>On Wed, 4 Apr 2007, Anand Avati wrote:
> >>>>>>
> >>>>>>>Brent,
> >>>>>>>I hope the system is still in the same state to dig some info out.
> >>>>>>>To verify that it is a file descriptor leak, can you please run this
> >>>>>>>test. On the server, run ps ax and get the PID of glusterfsd. then do
> >>>>>>>an ls -l on /proc/<pid>/fd/ and please mail the output of that. That
> >>>>>>>should give a precise idea of what is happening.
> >>>>>>>If the system has been reset out of the state, please give us the
> >>>>>>>spec file you are using and the commands you ran (of some major jobs
> >>>>>>>like heavy rsync) so that we will try to reproduce the error in our
> >>>>>>>setup.
> >>>>>>>
> >>>>>>>regards,
> >>>>>>>avati
> >>>>>>>
> >>>>>>>
> >>>>>>>On Wed, Apr 04, 2007 at 01:12:33PM -0400, Brent A Nelson wrote:
> >>>>>>>>I put a 2-node GlusterFS mirror into use internally yesterday, as
> >>>>>>>>GlusterFS was looking pretty solid, and I rsynced a whole bunch of
> >>>>>>>>stuff
> >>>>>>>>to it.  Today, however, an ls on any of the three clients gives me:
> >>>>>>>>
> >>>>>>>>ls: /backup: Too many open files
> >>>>>>>>
> >>>>>>>>It looks like glusterfsd hit a limit.  Is this a bug
> >>>>>>>>(glusterfs/glusterfsd
> >>>>>>>>forgetting to close files; essentially, a file descriptor leak), or
> >>>>>>>>do I
> >>>>>>>>just need to increase the limit somewhere?
> >>>>>>>>
> >>>>>>>>Thanks,
> >>>>>>>>
> >>>>>>>>Brent
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>_______________________________________________
> >>>>>>>>Gluster-devel mailing list
> >>>>>>>>Gluster-devel@xxxxxxxxxx
> >>>>>>>>http://lists.nongnu.org/mailman/listinfo/gluster-devel
> >>>>>>>>
> >>>>>>>
> >>>>>>>--
> >>>>>>>Shaw's Principle:
> >>>>>>>    Build a system that even a fool can use,
> >>>>>>>    and only a fool will want to use it.
> >>>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>--
> >>>>>Shaw's Principle:
> >>>>>     Build a system that even a fool can use,
> >>>>>     and only a fool will want to use it.
> >>>>>
> >>>>
> >>>
> >>>--
> >>>Shaw's Principle:
> >>>      Build a system that even a fool can use,
> >>>      and only a fool will want to use it.
> >>>
> >>
> >
> >-- 
> >Shaw's Principle:
> >       Build a system that even a fool can use,
> >       and only a fool will want to use it.
> >
> 

-- 
Shaw's Principle:
        Build a system that even a fool can use,
        and only a fool will want to use it.