incomplete listing of a directory, sometimes getdents loops until out of memory

john_brunelle at harvard.edu (John Brunelle) · Fri, 14 Jun 2013 13:04:48 -0400

Thanks, Jeff!  I ran readdir.c on all 23 bricks on the gluster nfs
server to which my test clients are connected (one client that's
working, and one that's not; and I ran on those, too).  The results
are attached.

The values it prints are all well within 32 bits, *except* for one
that's suspiciously the max 32-bit signed int:

$ cat readdir.out.* | awk '{print $1}' | sort | uniq | tail
0x000000000000fd59
0x000000000000fd6b
0x000000000000fd7d
0x000000000000fd8f
0x000000000000fda1
0x000000000000fdb3
0x000000000000fdc5
0x000000000000fdd7
0x000000000000fde8
0x000000007fffffff

That outlier is the same subdirectory on all 23 bricks.  Could this be
the issue?

Thanks,

John

On Fri, Jun 14, 2013 at 11:05 AM, John Brunelle
<john_brunelle at harvard.edu> wrote:
> Thanks for the reply, Vijay.  I set that parameter "On", but it hasn't
> helped, and in fact it seems a bit worse.  After making the change on
> the volume and dropping caches on some test clients, some are now
> seeing zero subdirectories at all.  In my tests before, after dropping
> caches clients go back to seeing all the subdirectories, and it's only
> after a while they start disappearing (and have never gone to zero
> before).
>
> Any other ideas?
>
> Thanks,
>
> John
>
> On Fri, Jun 14, 2013 at 10:35 AM, Vijay Bellur <vbellur at redhat.com> wrote:
>> On 06/13/2013 03:38 PM, John Brunelle wrote:
>>>
>>> Hello,
>>>
>>> We're having an issue with our distributed gluster filesystem:
>>>
>>> * gluster 3.3.1 servers and clients
>>> * distributed volume -- 69 bricks (4.6T each) split evenly across 3 nodes
>>> * xfs backend
>>> * nfs clients
>>> * nfs.enable-ino32: On
>>>
>>> * servers: CentOS 6.3, 2.6.32-279.14.1.el6.centos.plus.x86_64
>>> * cleints: CentOS 5.7, 2.6.18-274.12.1.el5
>>>
>>> We have a directory containing 3,343 subdirectories.  On some clients,
>>> ls lists only a subset of the directories (a different amount on
>>> different clients).  On others, ls gets stuck in a getdents loop and
>>> consumes more and more memory until it hits ENOMEM.  On yet others, it
>>> works fine.  Having the bad clients remount or drop caches makes the
>>> problem temporarily go away, but eventually it comes back.  The issue
>>> sounds a lot like bug #838784, but we are using xfs on the backend,
>>> and this seems like more of a client issue.
>>
>>
>> Turning on "cluster.readdir-optimize" can help readdir when a directory
>> contains a number of sub-directories and there are more bricks in the
>> volume. Do you observe any change with this option enabled?
>>
>> -Vijay
>>
>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: readdir_output.tar.bz2
Type: application/x-bzip2
Size: 327378 bytes
Desc: not available
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130614/f91c9ec0/attachment-0001.bz2>