Re: About file descriptor leak in glusterfsd daemon after network failure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Niels,

We have tested the patch for some days. It works well when the gluster peer status 
change to disconnected. However, if we retore the network just before the peer 
status change to disconnected status, we found out that glusterfsd will still 
open a new fd, and leave the old one not released even stop the file process. 

Why does glusterfsd open a new fd instead of reusing the original reopened fd?
Does glusterfsd have any kind of mechanism retrieve such fds?



2014-08-20 21:54 GMT+08:00 Niels de Vos <ndevos@xxxxxxxxxx>:
On Wed, Aug 20, 2014 at 07:16:16PM +0800, Jaden Liang wrote:
> Hi gluster-devel team,
>
> We are running a 2 replica volume in 2 servers. One of our service daemon
> open a file with 'flock' in the volume. We can see every glusterfsd daemon
> open the replica files in its own server(in /proc/pid/fd). When we pull off
> the cable of one server about 10 minutes then re-plug in. We found that the
> glusterfsd open a 'NEW' file descriptor while still holding the old one
> which is opened in the first file access.
>
> Then we stop our service daemon, but the glusterfsd(the re-plug cable one)
> only closes the new fd, leave the old fd open, we think that may be a fd
> leak issue. And we restart our service daemon. It flocked the same file,
> and get a flock failure. The errno is Resource Temporary Unavailable.
>
> However, this situation is not replay every time but often come out. We are
> still looking into the source code of glusterfsd, but it is not a easy job.
> So we want to look for some help in here. Here are our questions:
>
> 1. Has this issue been solved? Or is it a known issue?
> 2. Does anyone know the file descriptor maintenance logic in
> glusterfsd(server-side)? When the fd will be closed or held?

I think you are hitting bug 1129787:
- https://bugzilla.redhat.com/show_bug.cgi?id=1129787
   file locks are not released within an acceptable time when
   a fuse-client uncleanly disconnects

There has been a (short) discussion about this earlier, see
http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040748.html

Updating the proposed change is on my TODO list, in the end, the
network.ping-timeout option should be used to define the timeout towards
storage servers (like it is now) and the timeout from storage server to
GlusterFS-client.

You can try out the patch at http://review.gluster.org/8065 and see if
the network.tcp-timeout option works for you. Just remember that the
option will get fold into the network.ping-timeout one later on. If you
are interested in sending an updated patch, let me know :)

Cheers,
Niels



--
Best regards,
Jaden Liang
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux