Hi team, I hate "me too" emails sometimes not at all constructive, but I feel I really ought chip in from real world systems we use in anger and at massive scale here. So we also use NFS to "mask" this and other performance issues. The cluster.readdir-optimize gave us similar results unfortunately. We reported our other challenge back last summer but we stalled on this: http://www.gluster.org/pipermail/gluster-users/2013-June/036252.html We also unfortunately now see a new NFS phenotype that I've pasted below which is again is causing real heartburn. Small files, always difficult for any FS, might be worth doing some regression testing with small file directory scenarios in test - it's an easy reproducer on even moderately sized gluster clusters. Hope some good progress can be made, and I understand it's a tough one to track down performance hangs and issues. I just wanted to say that we really do see them, and have tried many things to avoid them. Here's the note from my team: We were hitting 30 minute timeouts on getxattr/system.posix_acl_access calls on directories in a NFS v3 mount (w/ acl option) of a 10-node 40-brick gluster 3.4.0 volume. Strace shows where the client hangs: $ strace -tt -T getfacl d6h_take1 ... 18:43:57.929225 lstat("d6h_take1", {st_mode=S_IFDIR|0755, st_size=7024, ...}) = 0 <0.257107> 18:43:58.186461 getxattr("d6h_take1", "system.posix_acl_access", 0x7fffdf2b9f50, 132) = -1 ENODATA (No data available) <1806.296893> 19:14:04.483556 stat("d6h_take1", {st_mode=S_IFDIR|0755, st_size=7024, ...}) = 0 <0.642362> 19:14:05.126025 getxattr("d6h_take1", "system.posix_acl_default", 0x7fffdf2b9f50, 132) = -1 ENODATA (No data available) <0.000024> 19:14:05.126114 stat("d6h_take1", {st_mode=S_IFDIR|0755, st_size=7024, ...}) = 0 <0.000010> ... Load on the servers was moderate. While the above was hanging, getfacl worked nearly instantaneously on that directory on all bricks. When it finally hit the 30 minute timeout, gluster logged it in nfs.log: [2014-03-31 23:14:04.481154] E [rpc-clnt.c:207:call_bail] 0-holyscratch-client-36: bailing out frame type(GlusterFS 3.3) op(GETXATTR(18)) xid = 0x8168809x sent = 2014-03-31 22:43:58.442411. timeout = 1800 [2014-03-31 23:14:04.481233] W [client-rpc-fops.c:1112:client3_3_getxattr_cbk] 0-holyscratch-client-36: remote operation failed: Transport endpoint is not connected. Path: <gfid:b116fb01-b13d-448a-90d0-a8693a98698b> (b116fb01-b13d-448a-90d0-a8693a98698b). Key: (null) Other than that, we didn't see anything directly related in the nfs or brick logs or anything out of sorts with the gluster services. A couple other errors raise eyebrows, but these are different directories (neighbors of the example above) and at different times: holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:30:47.794454] I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 overlaps=0 holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:31:47.794447] I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 overlaps=0 holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:33:47.802135] I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 overlaps=0 holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:34:47.802182] I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 overlaps=0 holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:36:47.764329] I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 overlaps=0 holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:37:47.773164] I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 overlaps=0 holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:39:47.774285] I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 overlaps=0 holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:40:47.780338] I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 overlaps=0 holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:42:47.730345] I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 overlaps=0 holyscratch08: /var/log/glusterfs/bricks/holyscratch08_03-brick.log:[2014-03-31 00:57:51.973565] E [posix-helpers.c:696:posix_handle_pair] 0-holyscratch-posix: /holyscratch08_03/brick/ramanathan_lab/dhuh/d9_take2_BGI/cuffdiffRN.txt: key:system.posix_acl_access error:Invalid argument holyscratch08: /var/log/glusterfs/bricks/holyscratch08_03-brick.log:[2014-03-31 01:18:12.345818] E [posix-helpers.c:696:posix_handle_pair] 0-holyscratch-posix: /holyscratch08_03/brick/ramanathan_lab/dhuh/d9_take2_BGI/cuffdiffRN.txt: key:system.posix_acl_access error:Invalid argument holyscratch05: /var/log/glusterfs/bricks/holyscratch05_04-brick.log:[2014-03-31 21:16:37.057674] E [posix-helpers.c:696:posix_handle_pair] 0-holyscratch-posix: /holyscratch05_04/brick/ramanathan_lab/dhuh/d9_take2_BGI/Diffreg/cuffdiffRN.txt: key:system.posix_acl_access error:Invalid argument -- dr. james cuff, assistant dean for research computing, harvard university | division of science | thirty eight oxford street, cambridge. ma. 02138 | +1 617 384 7647 | http://rc.fas.harvard.edu On Wed, Apr 9, 2014 at 9:52 AM, <james.bellinger@xxxxxxxxxxxxxxxx> wrote: > I am seeing something perhaps similar. 3.4.2-1, 2 servers, each with 1 > brick, replicated. A du of a local (ZFS) directory tree of 297834 files > and 525GB takes about 17 minutes. A du of the gluster copy is still not > finished after 22 hours. Network activity has been about 5-6KB/sec until > (I gather) du hit a directory with 22450 files, when activity jumped to > 300KB/sec (200 packets/sec) for about 15-20 minutes. If I assume that the > spike came from scanning the two largest directories, that looks like > about 8K of traffic per file, and about 5 packets. > > A 3.3.2 gluster installation that we are trying to retire is not afflicted > this way. > > James Bellinger > >> >> Am I the only person using Gluster suffering from very slow directory >> access? It's so seriously bad that it almost makes Gluster unusable. >> >> Using NFS instead of the Fuse client masks the problem as long as the >> directories are cached but it's still hellishly slow when you first >> access them. >> >> Has there been any progress at all fixing this bug? >> >> https://bugzilla.redhat.com/show_bug.cgi?id=1067256 >> >> Cheers, >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users@xxxxxxxxxxx >> http://supercolony.gluster.org/mailman/listinfo/gluster-users >> > > > _______________________________________________ > Gluster-users mailing list > Gluster-users@xxxxxxxxxxx > http://supercolony.gluster.org/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-users