Hi James, This definitely looks worthy of investigation. Could you file a bug? We need to get our guys on this. Thanks for doing your homework. Send us the BZ #, and we'll start poking around. -JM ----- Original Message ----- > Hey Joe! > > Yeah we are all XFS all the time round here - none of that nasty ext4 > combo that we know causes raised levels of mercury :-) > > The brick errors, we have not seen any we have been busy grepping and > alerting on anything suspect in our logs. Mind you there are hundreds > of brick logs to search through I'm not going to say we may have > missed one, but after asking the boys in chat just now they are pretty > convinced that was not the smoking gun. I'm sure they will chip in on > this thread if there is anything. > > > j. > > -- > dr. james cuff, assistant dean for research computing, harvard > university | division of science | thirty eight oxford street, > cambridge. ma. 02138 | +1 617 384 7647 | http://rc.fas.harvard.edu > > > On Wed, Apr 9, 2014 at 10:36 AM, Joe Julian <joe@xxxxxxxxxxxxxxxx> wrote: > > What's the backend filesystem? > > Were there any brick errors, probably around 2014-03-31 22:44:04 (half an > > hour before the frame timeout)? > > > > > > On April 9, 2014 7:10:58 AM PDT, James Cuff <james_cuff@xxxxxxxxxxx> wrote: > >> > >> Hi team, > >> > >> I hate "me too" emails sometimes not at all constructive, but I feel I > >> really ought chip in from real world systems we use in anger and at > >> massive scale here. > >> > >> So we also use NFS to "mask" this and other performance issues. The > >> cluster.readdir-optimize gave us similar results unfortunately. > >> > >> We reported our other challenge back last summer but we stalled on this: > >> > >> http://www.gluster.org/pipermail/gluster-users/2013-June/036252.html > >> > >> We also unfortunately now see a new NFS phenotype that I've pasted > >> below which is again is causing real heartburn. > >> > >> Small files, always difficult for any FS, might be worth doing some > >> regression testing with small file directory scenarios in test - it's > >> an easy reproducer on even moderately sized gluster clusters. Hope > >> some good progress can be > >> made, and I understand it's a tough one to > >> track down performance hangs and issues. I just wanted to say that we > >> really do see them, and have tried many things to avoid them. > >> > >> Here's the note from my team: > >> > >> We were hitting 30 minute timeouts on getxattr/system.posix_acl_access > >> calls on directories in a NFS v3 mount (w/ acl option) of a 10-node > >> 40-brick gluster 3.4.0 volume. Strace shows where the client hangs: > >> > >> $ strace -tt -T getfacl d6h_take1 > >> ... > >> 18:43:57.929225 lstat("d6h_take1", {st_mode=S_IFDIR|0755, > >> st_size=7024, ...}) = 0 <0.257107> > >> 18:43:58.186461 getxattr("d6h_take1", "system.posix_acl_access", > >> 0x7fffdf2b9f50, 132) = -1 ENODATA (No data available) <1806.296893> > >> 19:14:04.483556 stat("d6h_take1", {st_mode=S_IFDIR|0755, st_size=7024, > >> ...}) = 0 <0.642362> > >> 19:14:05.126025 getxattr("d6h_take1", "system.posix_acl_default", > >> 0x7fffdf2b9f50, 132) = -1 ENODATA (No data > >> available) <0.000024> > >> 19:14:05.126114 stat("d6h_take1", {st_mode=S_IFDIR|0755, st_size=7024, > >> ...}) = 0 <0.000010> > >> ... > >> > >> Load on the servers was moderate. While the above was hanging, > >> getfacl worked nearly instantaneously on that directory on all bricks. > >> When it finally hit the 30 minute timeout, gluster logged it in > >> nfs.log: > >> > >> [2014-03-31 23:14:04.481154] E [rpc-clnt.c:207:call_bail] > >> 0-holyscratch-client-36: bailing out frame type(GlusterFS 3.3) > >> op(GETXATTR(18)) xid = 0x8168809x sent = 2014-03-31 22:43:58.442411. > >> timeout = 1800 > >> [2014-03-31 23:14:04.481233] W > >> [client-rpc-fops.c:1112:client3_3_getxattr_cbk] > >> 0-holyscratch-client-36: remote operation failed: Transport endpoint > >> is not connected. Path: <gfid:b116fb01-b13d-448a-90d0-a8693a98698b> > >> (b116fb01-b13d-448a-90d0-a8693a98698b). Key: (null) > >> > >> Other than that, we didn't see anything directly related in the nfs or > >> brick logs or anything out of sorts with the gluster services. A > >> couple other errors raise eyebrows, but these are different > >> directories (neighbors of the example above) and at different times: > >> > >> holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:30:47.794454] > >> I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found > >> anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 > >> overlaps=0 > >> holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:31:47.794447] > >> I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found > >> anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 > >> overlaps=0 > >> holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:33:47.802135] > >> I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found > >> anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 > >> overlaps=0 > >> holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:34:47.802182] > >> I > >> [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found > >> anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 > >> overlaps=0 > >> holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:36:47.764329] > >> I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found > >> anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 > >> overlaps=0 > >> holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:37:47.773164] > >> I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found > >> anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 > >> overlaps=0 > >> holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:39:47.774285] > >> I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found > >> anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 > >> overlaps=0 > >> holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:40:47.780338] > >> I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: > >> found > >> anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 > >> overlaps=0 > >> holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:42:47.730345] > >> I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found > >> anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 > >> overlaps=0 > >> > >> holyscratch08: > >> /var/log/glusterfs/bricks/holyscratch08_03-brick.log:[2014-03-31 > >> 00:57:51.973565] E [posix-helpers.c:696:posix_handle_pair] > >> 0-holyscratch-posix: > >> /holyscratch08_03/brick/ramanathan_lab/dhuh/d9_take2_BGI/cuffdiffRN.txt: > >> key:system.posix_acl_access error:Invalid argument > >> holyscratch08: > >> /var/log/glusterfs/bricks/holyscratch08_03-brick.log:[2014-03-31 > >> 01:18:12.345818] E [posix-helpers.c:696:posix_handle_pair] > >> 0-holyscratch-posix: > >> /holyscratch08_03/brick/ramanathan_lab/dhuh/d9_take2_BGI/cuffdiffRN.txt: > >> key:system.posix_acl_access error:Invalid argument > >> holyscratch05: > >> /var/log/glusterfs/bricks/holyscratch05_04-brick.log:[2014-03-31 > >> 21:16:37.057674] E [posix-helpers.c:696:posix_handle_pair] > >> 0-holyscratch-posix: > >> > >> /holyscratch05_04/brick/ramanathan_lab/dhuh/d9_take2_BGI/Diffreg/cuffdiffRN.txt: > >> key:system.posix_acl_access error:Invalid argument > >> > >> -- > >> dr. james cuff, assistant dean for research computing, harvard > >> university | division of science | thirty eight oxford street, > >> cambridge. ma. 02138 | +1 617 384 7647 | http://rc.fas.harvard.edu > >> > >> > >> On Wed, Apr 9, 2014 at 9:52 AM, <james.bellinger@xxxxxxxxxxxxxxxx> wrote: > >>> > >>> I am seeing something perhaps similar. 3.4.2-1, 2 servers, each with 1 > >>> brick, replicated. A du of a local (ZFS) directory tree of 297834 files > >>> and 525GB takes about 17 minutes. A du of the gluster copy > >>> is still not > >>> finished after 22 hours. Network activity has been about 5-6KB/sec > >>> until > >>> (I gather) du hit a directory with 22450 files, when activity jumped to > >>> 300KB/sec (200 packets/sec) for about 15-20 minutes. If I assume that > >>> the > >>> spike came from scanning the two largest directories, that looks like > >>> about 8K of traffic per file, and about 5 packets. > >>> > >>> A 3.3.2 gluster installation that we are trying to retire is not > >>> afflicted > >>> this way. > >>> > >>> James Bellinger > >>> > >>>> > >>>> Am I the only person using Gluster suffering from very slow directory > >>>> access? It's so seriously bad that it almost makes Gluster unusable. > >>>> > >>>> Using NFS instead of the Fuse client masks the problem as long as the > >>>> directories are cached but it's still hellishly slow when you first > >>>> access them. > >>>> > >>>> Has there > >>>> been any progress at all fixing this bug? > >>>> > >>>> https://bugzilla.redhat.com/show_bug.cgi?id=1067256 > >>>> > >>>> Cheers, > >>>> > >>>> ________________________________ > >>>> > >>>> Gluster-users mailing list > >>>> Gluster-users@xxxxxxxxxxx > >>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users > >>> > >>> > >>> > >>> > >>> ________________________________ > >>> > >>> Gluster-users mailing list > >>> Gluster-users@xxxxxxxxxxx > >>> http://supercolony.gluster.org/mailman/listinfo/gluster-users > >> > >> ________________________________ > >> > >> Gluster-users mailing list > >> Gluster-users@xxxxxxxxxxx > >> http://supercolony.gluster.org/mailman/listinfo/gluster-users > > > > > > -- > > Sent from my Android device with K-9 Mail. Please excuse my brevity. > _______________________________________________ > Gluster-users mailing list > Gluster-users@xxxxxxxxxxx > http://supercolony.gluster.org/mailman/listinfo/gluster-users > _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-users