A similar issue was fixed in the master branch recently. Can you apply http://review.gluster.org/4459 to your source / rebuild / retest and see if the issue gets fixed for you? It is quite a trivial patch and might even just apply on 3.2.7 source. Avati On Mon, Feb 18, 2013 at 11:29 AM, Douglas Colkitt <douglas.colkitt at gmail.com > wrote: > Hi I'm running into a rather strange and frustrating bug and wondering if > anyone on the mailing list might have some insight about what might be > causing it. I'm running a cluster of two dozen nodes, where the processing > nodes are also the gluster bricks (using the SLURM resource manager). Each > node has the glusters mounted natively (not NFS). All nodes are using > v3.2.7. Each job in the node runs a shell script like so: > > containerDir=$1 > groupNum=$2 > mkdir -p $containerDir > ./generateGroupGen.py $groupNum >$containerDir/$groupNum.out > > Then run the following jobs: > > runGroupGen [glusterDirectory] 1 > runGroupGen [glusterDirectory] 2 > runGroupGen [glusterDirectory] 3 > ... > > Typically about 200 jobs launch within milliseconds of each other so the > glusterfs/fuse directory system receives a large number of simultaneous > create directory and create file system calls within a very short time. > > Some of the output files inside the directory have a file that exists but > no output. When this occurs it is always the case that either all jobs on a > node behave normally or all fail to produce output. It should be noted that > there are no error messages generated by the processes themselves, and all > processes on the no-output node return with no error code. In that sense > the failure is silent, but corrupts the data, which is dangerous. The only > indication of error are errors (on the no output nodes) in the > /var/log/distrib-glusterfs.log of the form: > > [2013-02-18 05:55:31.382279] E > [client3_1-fops.c:2228:client3_1_lookup_cbk] 0-volume1-client-16: remote > operation failed: Stale NFS file handle > [2013-02-18 05:55:31.382302] E > [client3_1-fops.c:2228:client3_1_lookup_cbk] 0-volume1-client-17: remote > operation failed: Stale NFS file handle > [2013-02-18 05:55:31.382327] E > [client3_1-fops.c:2228:client3_1_lookup_cbk] 0-volume1-client-18: remote > operation failed: Stale NFS file handle > [2013-02-18 05:55:31.640791] W [inode.c:1044:inode_path] > (-->/usr/lib/glusterfs/3.2.7/xlator/mount/fuse.so(+0xe8fd) [0x7fa8341868fd] > (-->/usr/lib/glusterfs/3.2.7/xlator/mount/fuse.so(+0xa6bb) [0x7fa8341826bb] > (-->/usr/lib/glusterfs/3.2.7/xlator/mount/fuse.so(fuse_loc_fill+0x1c6) > [0x7fa83417d156]))) 0-volume1/inode: no dentry for non-root inode > -69777006931: 0a37836d-e9e5-4cc1-8bd2-e8a49947959b > [2013-02-18 05:55:31.640865] W [fuse-bridge.c:561:fuse_getattr] > 0-glusterfs-fuse: 2298073: GETATTR 140360215569520 (fuse_loc_fill() failed) > [2013-02-18 05:55:31.641672] W [inode.c:1044:inode_path] > (-->/usr/lib/glusterfs/3.2.7/xlator/mount/fuse.so(+0xe8fd) [0x7fa8341868fd] > (-->/usr/lib/glusterfs/3.2.7/xlator/mount/fuse.so(+0xa6bb) [0x7fa8341826bb] > (-->/usr/lib/glusterfs/3.2.7/xlator/mount/fuse.so(fuse_loc_fill+0x1c6) > [0x7fa83417d156]))) 0-volume1/inode: no dentry for non-root inode > -69777006931: 0a37836d-e9e5-4cc1-8bd2-e8a49947959b > [2013-02-18 05:55:31.641724] W [fuse-bridge.c:561:fuse_getattr] > 0-glusterfs-fuse: 2298079: GETATTR 140360215569520 (fuse_loc_fill() failed) > ... > > Sometimes on these events, and sometimes not, there will also be logs (on > both normal and abnormal nodes) of the form: > > [2013-02-18 03:35:28.679681] I [dht-common.c:525:dht_revalidate_cbk] > 0-volume1-dht: mismatching layouts for /inSample/pred/20110831 > > I understand from reading the mailing list that both the dentry errors and > the mismatched layout errors are both non-fatal warnings and that the > metadata will become internally consistent regardless. But these errors > only happen on times when I'm slamming the glusterfs system with the > creation of a bunch of small files in a very short burst like I described > above. So their presence seems to be related to the error. > > I think the issue is almost assuredly related to the delayed propagation > of glusterfs directory metadata. Some nodes are creating directory > simultaneous to other nodes and the two are producing inconsistencies with > regards to the dht layout information. My hypothesis is that when Node A is > still writing that the process to resolve the inconsistencies with and > propagate the metadata from Node B is rendering the location that Node A is > writing to disconnected from its supposed path. (And hence the no dentry > errors). > > I've taken some effort to go through the glusterfs source code, > particularly the dht related files. The way dht normalizes anomalies could > be the problem, but I've failed to find anything specific. > > Has anyone else run into a problem like this, or have insight about what > might be causing it or how to avoid it? > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130218/e4ba525d/attachment-0001.html>