)=20 bug/warning/error has not necessarily been associated with data loss, b= ut we=20 are finding that our gluster fs is interrupting our cluster jobs with t= he=20 'Stale NFS handle' Warnings like this (on the client): [2013-01-03 12:30:59.149230] W [client3_1-fops.c:2630:client3_1_lookup_= cbk] 0- gl-client-0: remote operation failed: Stale NFS file handle. Path:=20 /bio/krthornt/build_div/yak/line06_CY08A/prinses (3b0aa7b2-bf7f-4b27- b515-32e94b1206e3) (and 7 more, differing by the timestamp of <<1s). The dir mentioned existed before the job was asked to read from it and = shortly=20 after the SGE failed, I checked that the glusterfs (/bio) was still mou= nted=20 and that dir was still r/w. We are getting these errors infrequently, but fairly regularly (a coupl= e times=20 a week, usually during a big array job that heavily reads from a partic= ular=20 dir) and I haven't seen any resolutions of the fault besides the vocabu= lary=20 being corrected. I know it's not nec an NFS problem, but I haven't see= n a fix=20 from the gluster folks. Our glusterfs on this system is set up like this (over QDR/tcpoib) $ gluster volume info gl =20 Volume Name: gl Type: Distribute Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 Status: Started Number of Bricks: 8 Transport-type: tcp,rdma Bricks: Brick1: bs2:/raid1 Brick2: bs2:/raid2 Brick3: bs3:/raid1 Brick4: bs3:/raid2 Brick5: bs4:/raid1 Brick6: bs4:/raid2 Brick7: bs1:/raid1 Brick8: bs1:/raid2 Options Reconfigured: auth.allow: 10.2.*.*,10.1.*.* performance.io-thread-count: 64 performance.quick-read: on performance.io-cache: on nfs.disable: on performance.cache-size: 268435456 performance.flush-behind: on performance.write-behind-window-size: 1024MB and otherwise appears to be happy. =20 We were having a low-level problem with the RAID servers, where this LS= I/3ware=20 error was temporally close (~2m) to the gluster error: LSI 3DM2 alert -- host: biostor4.oit.uci.edu Jan 03, 2013 03:32:09PM - Controller 6 ERROR - Drive timeout detected: encl=3D1, slot=3D3 This error seemed to be related to construction around our data center = and=20 dust related with it. We have had 10s of these LSI/3ware errors with n= o=20 related gluster errors or apparent problems with the RAIDs. No drives = were=20 ejected from the RAIDs and the errors did not repeat. 3ware explains: <http://cholla.mmto.org/computers/3ware/3dm2/en/3DM_2_OLH-8-6.html> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D 009h Drive timeout detected The 3ware RAID controller has a sophisticated recovery mechanism to han= dle=20 various types of failures of a disk drive. One such possible failure of= a disk=20 drive is a failure of a command that is pending from the 3ware RAID con= troller=20 to complete within a reasonable amount of time. If the 3ware RAID contr= oller=20 detects this condition, it notifies the user, prior to entering the rec= overy=20 phase, by displaying this AEN. Possible causes of APORT time-outs include a bad or intermittent disk d= rive,=20 power cable or interface cable. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D We've checked into this and it doesn't seem to be related, but I though= t I'd=20 bring it up. hjm On Thursday, August 23, 2012 09:54:13 PM Joe Julian wrote: > *Bug 832694* <https://bugzilla.redhat.com/show_bug.cgi?id=3D832694> > -ESTALE error text should be reworded >=20 > On 08/23/2012 09:50 PM, Kaushal M wrote: > > The "Stale NFS file handle" message is the default string given by > > strerror() for errno ESTALE. > > Gluster uses ESTALE as errno to indicate that the file being referr= ed > > to no longer exists, ie. the reference is stale. > >=20 > > - Kaushal > >=20 > > On Fri, Aug 24, 2012 at 7:03 AM, Jules Wang <lancelotds at 163.com > >=20 > > <mailto:lancelotds at 163.com>> wrote: > > Hi, Jon , > > =20 > > I also met the same issue, and reported a > > =20 > > bug(https://bugzilla.redhat.com/show_bug.cgi?id=3D851381) > > <https://bugzilla.redhat.com/show_bug.cgi?id=3D851381> > > =20 > > In the bug report, I share a simple way to reproduce the bu= g. > > Have fun. > > =20 > > Best Regards. > > Jules Wang. > > =20 > > At 2012-08-23 23:02:34,"B=F9i H=F9ng Vie^.t" <buihungviet at gmail= .com > > =20 > > <mailto:buihungviet at gmail.com>> wrote: > > Hi Jon, > > I have no answer for you. Just want to share with you guys > > that I met the same issue with this message. In my gluster > > system, Gluster client log files have a lot of these messag= es. > > I tried to ask and found nothing on the Web. Amazingly, > > Gluster have been running for long time :) > > =20 > > On Thu, Aug 23, 2012 at 8:43 PM, Jon Tegner <tegner at renget.= se > > =20 > > <mailto:tegner at renget.se>> wrote: > > Hi, I'm a bit curious of error messages of the type > > "remote operation failed: Stale NFS file handle". All > > clients using the file system use Gluster Native Client= , > > so why should stale nfs file handle be reported? > > =20 > > Regards, > > =20 > > /jon > > =20 > > =20 > > =20 > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org <mailto:Gluster-users at gluster= .org> > > http://gluster.org/cgi-bin/mailman/listinfo/gluster-use= rs > > =20 > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> > > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > >=20 > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)