I have an IBM GPFS cluster of 8 servers on shared SAN running RHEL5. They use
GPFS's included Clustered NFS (CNFS) to load balance NFS to clients via
round-robin DNS and autofs. This has worked fine for 3 years. Recently I
upgraded one of the boxes to RHEL6. Base GPFS works fine on it. But CNFS
does not. My first issue was that once CNFS was running on the RHEL6 server,
it refused all mounts with the error in the syslog of:
rpc.mountd[16482]: authenticated mount request from
bourget.nmr.mgh.harvard.edu:626 for /gpfs/nsdg01/itgroup
(/gpfs/nsdg01/itgroup)
rpc.mountd[16482]: internal: no supported addresses in nfs_client
rpc.mountd[16482]: getfh failed: Operation notpermitted
So I started a service ticket with IBM but got nowhere with them.
I eventually found this on the web about a bug in cltsetup()
http://comments.gmane.org/gmane.linux.nfs/41432
I applied this patch to the stock RHEL6 nfs-utils source and rebuilt
and the above problem went away.
The curious thing about this patch is I did not have the problem when running
RHEL6 NFS outside of GPFS nor did IBM have the problem with CNFS on their test
systems on RHEL6. So it is a mystery what was triggering it for me. I
mention this only in case it has bearing on the current problem that has me
stuck.
With CNFS running and allowing mounts on the RHEL6 box, I killed the box.
Everything failed over fine to one of the old RHEL5 servers. Clients that had
mounted NFS shares from the RHEL6 box could still see them. I then brought
the RHEL6 box back up which retook the virtual IP assigned to it. But then
client mounts now failed with "Stale NFS file handle"
So essentially NFS failover from RHEL5 to the RHEL6 box of NFS fails (but
works the other way). And before you ask, yes, the /etc/exports on all boxes
are exactly the same and have the same "fsid=" assigned on all shares.
IBM does not see this problem on their test systems. They have no idea
and are just having me do "shot in the dark" upgrades and downgrades
on various things.
I am hoping someone on this list knows what a "Stale NFS file handle"
means in this situation when it is not a FSID mismatch that might
point me in a direction of what could be going wrong.
In case it helps here is a tcpdump of the packets on the server when
the Stale NFS file handle error happens
15:08:04.543680 IP bourget.nmr.mgh.harvard.edu.12492987 >
gpfstest.nmr.mgh.harvard.edu.nfs: 112 getattr fh
Unknown/01000100978A0100000000000000000000000000000000000000000000000000
15:08:04.543721 IP gpfstest.nmr.mgh.harvard.edu.nfs >
bourget.nmr.mgh.harvard.edu.12492987: reply ok 28 getattr ERROR: Stale NFS
file handle
15:08:04.544494 IP bourget.nmr.mgh.harvard.edu.979 >
gpfstest.nmr.mgh.harvard.edu.nfs: Flags [.], ack 1811398839, win 1460, options
[nop,nop,TS val 2056226500 ecr 1989675411], length 0
Also, if I bring down the RHEL6 box, so the failover occurs again to one
of the RHEL5 boxes, the client mount starts working again.
---------------------------------------------------------------
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129 USA
The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html