No subject

bogus@xxxxxxxxxxxxxxxxxx () · Fri, 05 Oct 2012 16:24:05 -0000

)=20
bug/warning/error has not necessarily been associated with data loss, b=
ut we=20
are finding that our gluster fs is interrupting our cluster jobs with t=
he=20
'Stale NFS handle' Warnings like this (on the client):

[2013-01-03 12:30:59.149230] W [client3_1-fops.c:2630:client3_1_lookup_=
cbk] 0-
gl-client-0: remote operation failed: Stale NFS file handle. Path:=20
/bio/krthornt/build_div/yak/line06_CY08A/prinses (3b0aa7b2-bf7f-4b27-
b515-32e94b1206e3)

(and 7 more, differing by the timestamp of <<1s).

The dir mentioned existed before the job was asked to read from it and =
shortly=20
after the SGE failed, I checked that the glusterfs (/bio) was still mou=
nted=20
and that dir was still r/w.

We are getting these errors infrequently, but fairly regularly (a coupl=
e times=20
a week, usually during a big array job that heavily reads from a partic=
ular=20
dir) and I haven't seen any resolutions of the fault besides the vocabu=
lary=20
being corrected.  I know it's not nec an NFS problem, but I haven't see=
n a fix=20
from the gluster folks.

Our glusterfs on this system is set up like this (over QDR/tcpoib)

$ gluster volume info gl
=20
Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
auth.allow: 10.2.*.*,10.1.*.*
performance.io-thread-count: 64
performance.quick-read: on
performance.io-cache: on
nfs.disable: on
performance.cache-size: 268435456
performance.flush-behind: on
performance.write-behind-window-size: 1024MB

and otherwise appears to be happy. =20

We were having a low-level problem with the RAID servers, where this LS=
I/3ware=20
error was temporally close (~2m) to the gluster error:

LSI 3DM2 alert -- host: biostor4.oit.uci.edu
Jan 03, 2013 03:32:09PM - Controller 6
ERROR - Drive timeout detected: encl=3D1, slot=3D3

This error seemed to be related to construction around our data center =
and=20
dust related with it.  We have had 10s of these LSI/3ware errors with n=
o=20
related gluster errors or apparent problems with the RAIDs.  No drives =
were=20
ejected from the RAIDs and the errors did not repeat.  3ware explains:
<http://cholla.mmto.org/computers/3ware/3dm2/en/3DM_2_OLH-8-6.html>
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D
009h Drive timeout detected

The 3ware RAID controller has a sophisticated recovery mechanism to han=
dle=20
various types of failures of a disk drive. One such possible failure of=
 a disk=20
drive is a failure of a command that is pending from the 3ware RAID con=
troller=20
to complete within a reasonable amount of time. If the 3ware RAID contr=
oller=20
detects this condition, it notifies the user, prior to entering the rec=
overy=20
phase, by displaying this AEN.

Possible causes of APORT time-outs include a bad or intermittent disk d=
rive,=20
power cable or interface cable.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D
We've checked into this and it doesn't seem to be related, but I though=
t I'd=20
bring it up.

hjm

On Thursday, August 23, 2012 09:54:13 PM Joe Julian wrote:
> *Bug 832694* <https://bugzilla.redhat.com/show_bug.cgi?id=3D832694>
> -ESTALE error text should be reworded
>=20
> On 08/23/2012 09:50 PM, Kaushal M wrote:
> > The "Stale NFS file handle" message is the default string given by
> > strerror() for errno ESTALE.
> > Gluster uses ESTALE as errno to indicate that the file being referr=
ed
> > to no longer exists, ie. the reference is stale.
> >=20
> > - Kaushal
> >=20
> > On Fri, Aug 24, 2012 at 7:03 AM, Jules Wang <lancelotds at 163.com
> >=20
> > <mailto:lancelotds at 163.com>> wrote:
> >     Hi, Jon ,
> >    =20
> >         I also met the same issue, and reported a
> >    =20
> >     bug(https://bugzilla.redhat.com/show_bug.cgi?id=3D851381)
> >     <https://bugzilla.redhat.com/show_bug.cgi?id=3D851381>
> >    =20
> >         In the bug report, I share a simple way to reproduce the bu=
g.
> >         Have fun.
> >    =20
> >     Best Regards.
> >     Jules Wang.
> >    =20
> >     At 2012-08-23 23:02:34,"B=F9i H=F9ng Vie^.t" <buihungviet at gmail=
.com
> >    =20
> >     <mailto:buihungviet at gmail.com>> wrote:
> >         Hi Jon,
> >         I have no answer for you. Just want to share with you guys
> >         that I met the same issue with this message. In my gluster
> >         system, Gluster client log files have a lot of these messag=
es.
> >         I tried to ask and found nothing on the Web. Amazingly,
> >         Gluster have been running for long time :)
> >        =20
> >         On Thu, Aug 23, 2012 at 8:43 PM, Jon Tegner <tegner at renget.=
se
> >        =20
> >         <mailto:tegner at renget.se>> wrote:
> >             Hi, I'm a bit curious of error messages of the type
> >             "remote operation failed: Stale NFS file handle". All
> >             clients using the file system use Gluster Native Client=
,
> >             so why should stale nfs file handle be reported?
> >            =20
> >             Regards,
> >            =20
> >             /jon
> >            =20
> >            =20
> >            =20
> >             _______________________________________________
> >             Gluster-users mailing list
> >             Gluster-users at gluster.org <mailto:Gluster-users at gluster=
.org>
> >             http://gluster.org/cgi-bin/mailman/listinfo/gluster-use=
rs
> >    =20
> >     _______________________________________________
> >     Gluster-users mailing list
> >     Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
> >     http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
> >=20
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)