Re: Too many open files

Anand Avati <avati@xxxxxxxxxxxxx> · Fri, 6 Apr 2007 08:32:37 -0700

Brent,
  the fix has been committed. can you please check if it works for you?

  regards,
  avati

On Thu, Apr 05, 2007 at 02:09:26AM -0400, Brent A Nelson wrote:
> That's correct.  I had commented out unify when narrowing down the mtime 
> bug (which turned out to be writebehind) and then decided I had no reason 
> to put it back in for this two-brick filesystem.  It was mounted without 
> unify when this issue occurred.
> 
> Thanks,
> 
> Brent
> 
> On Wed, 4 Apr 2007, Anand Avati wrote:
> 
> >Can you confirm that you were NOT using unify int he setup??
> >
> >regards,
> >avati
> >
> >
> >On Thu, Apr 05, 2007 at 01:09:16AM -0400, Brent A Nelson wrote:
> >>Awesome!
> >>
> >>Thanks,
> >>
> >>Brent
> >>
> >>On Wed, 4 Apr 2007, Anand Avati wrote:
> >>
> >>>Brent,
> >>>thank you so much for your efforts of sending the output!
> >>>from the log it is clear the leak fd's are all for directories. Indeed
> >>>there was an issue with releasedir() call reaching all the nodes. The
> >>>fix should be committed today to tla.
> >>>
> >>>Thanks!!
> >>>
> >>>avati
> >>>
> >>>
> >>>
> >>>On Wed, Apr 04, 2007 at 09:18:48PM -0400, Brent A Nelson wrote:
> >>>>I avoided restarting, as this issue would take a while to reproduce.
> >>>>
> >>>>jupiter01 and jupiter02 are mirrors of each other.  All performance
> >>>>translators are in use, except for writebehind (due to the mtime bug).
> >>>>
> >>>>jupiter01:
> >>>>ls -l /proc/26466/fd |wc
> >>>> 65536  655408 7358168
> >>>>See attached for ls -l output.
> >>>>
> >>>>jupiter02:
> >>>>ls -l /proc/3651/fd |wc
> >>>>ls -l /proc/3651/fd
> >>>>total 11
> >>>>lrwx------ 1 root root 64 2007-04-04 20:43 0 -> /dev/null
> >>>>lrwx------ 1 root root 64 2007-04-04 20:43 1 -> /dev/null
> >>>>lrwx------ 1 root root 64 2007-04-04 20:43 10 -> socket:[2565251]
> >>>>lrwx------ 1 root root 64 2007-04-04 20:43 2 -> /dev/null
> >>>>l-wx------ 1 root root 64 2007-04-04 20:43 3 ->
> >>>>/var/log/glusterfs/glusterfsd.log
> >>>>lrwx------ 1 root root 64 2007-04-04 20:43 4 -> socket:[2255275]
> >>>>lrwx------ 1 root root 64 2007-04-04 20:43 5 -> socket:[2249710]
> >>>>lr-x------ 1 root root 64 2007-04-04 20:43 6 -> eventpoll:[2249711]
> >>>>lrwx------ 1 root root 64 2007-04-04 20:43 7 -> socket:[2255306]
> >>>>lr-x------ 1 root root 64 2007-04-04 20:43 8 ->
> >>>>/etc/glusterfs/glusterfs-client.vol
> >>>>lr-x------ 1 root root 64 2007-04-04 20:43 9 ->
> >>>>/etc/glusterfs/glusterfs-client.vol
> >>>>
> >>>>Note that it looks like all those extra directories listed on jupiter01
> >>>>were locally rsynched from jupiter01's Lustre filesystems onto the
> >>>>glusterfs client on jupiter01.  A very large rsync from a different
> >>>>machine to jupiter02 didn't go nuts.
> >>>>
> >>>>Thanks,
> >>>>
> >>>>Brent
> >>>>
> >>>>On Wed, 4 Apr 2007, Anand Avati wrote:
> >>>>
> >>>>>Brent,
> >>>>>I hope the system is still in the same state to dig some info out.
> >>>>>To verify that it is a file descriptor leak, can you please run this
> >>>>>test. On the server, run ps ax and get the PID of glusterfsd. then do
> >>>>>an ls -l on /proc/<pid>/fd/ and please mail the output of that. That
> >>>>>should give a precise idea of what is happening.
> >>>>>If the system has been reset out of the state, please give us the
> >>>>>spec file you are using and the commands you ran (of some major jobs
> >>>>>like heavy rsync) so that we will try to reproduce the error in our
> >>>>>setup.
> >>>>>
> >>>>>regards,
> >>>>>avati
> >>>>>
> >>>>>
> >>>>>On Wed, Apr 04, 2007 at 01:12:33PM -0400, Brent A Nelson wrote:
> >>>>>>I put a 2-node GlusterFS mirror into use internally yesterday, as
> >>>>>>GlusterFS was looking pretty solid, and I rsynced a whole bunch of 
> >>>>>>stuff
> >>>>>>to it.  Today, however, an ls on any of the three clients gives me:
> >>>>>>
> >>>>>>ls: /backup: Too many open files
> >>>>>>
> >>>>>>It looks like glusterfsd hit a limit.  Is this a bug
> >>>>>>(glusterfs/glusterfsd
> >>>>>>forgetting to close files; essentially, a file descriptor leak), or 
> >>>>>>do I
> >>>>>>just need to increase the limit somewhere?
> >>>>>>
> >>>>>>Thanks,
> >>>>>>
> >>>>>>Brent
> >>>>>>
> >>>>>>
> >>>>>>_______________________________________________
> >>>>>>Gluster-devel mailing list
> >>>>>>Gluster-devel@xxxxxxxxxx
> >>>>>>http://lists.nongnu.org/mailman/listinfo/gluster-devel
> >>>>>>
> >>>>>
> >>>>>--
> >>>>>Shaw's Principle:
> >>>>>     Build a system that even a fool can use,
> >>>>>     and only a fool will want to use it.
> >>>>>
> >>>
> >>>
> >>>
> >>>--
> >>>Shaw's Principle:
> >>>      Build a system that even a fool can use,
> >>>      and only a fool will want to use it.
> >>>
> >>
> >
> >-- 
> >Shaw's Principle:
> >       Build a system that even a fool can use,
> >       and only a fool will want to use it.
> >
> 

-- 
Shaw's Principle:
        Build a system that even a fool can use,
        and only a fool will want to use it.