EBADFD with large number of concurrent files

B.Candler at pobox.com (Brian Candler) · Mon, 6 Aug 2012 22:43:24 +0100

I have an application where there are 48 processes, and each one has opens
1000 files (different files for all 48 processes).  They are opened onto a
distributed gluster volume, distributed between two nodes.

It works initially, but after a while, some of the processes abort. perror
prints "File descriptor in bad state" (I think this means EBADFD)

This is with glusterfs 3.3.0 under Ubuntu 12.04 (both the storage nodes and
the application servers)

Looking on the two backend bricks, each has two glusterfsd processes.  On
both bricks, the one with the lower pid has 24168 open FDs
(ls /proc/<pid>/fd | wc -l), and also 1.5-2.5GB of RSS.  So it's pretty
clear that glusterfsd keeps one open file handle per file opened by the
client. That's pretty reasonable.

I don't think I'm hitting a system limit for this:

# cat /proc/sys/fs/file-max
808870

and it's clearly working for the first few minutes.  So I wonder if anyone
has any other suggestions for why EBADFD is getting returned after a while?

Thanks,

Brian.