>> OK >> I've watched wireshark on cluster1 during start up of cluster2 (with > linux-2.6.32) which first tries 10003 and then 10005. >> The result is that cluster1 doesn't get a datagram for port 10003: >> http://net.razik.de/linux/T5120/cluster2_NFSROOT_MOUNT.png >> >> The first ARP request in the screenshot came _after_ the <tag> in > this kernel log: >> [ 6492.807917] IP-Config: Complete: >> [ 6492.807978] device=eth0, addr=137.226.167.242, > mask=255.255.255.224, gw=137.226.167.225, >> [ 6492.808227] host=cluster2, domain=, nis-domain=(none), >> [ 6492.808312] bootserver=255.255.255.255, rootserver=137.226.167.241, > rootpath= >> [ 6492.808570] Looking up port of RPC 100003/2 on 137.226.167.241 >> [ 6493.886014] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow > Control: Rx >> [ 6493.905840] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready >> <tag> >> [ 6527.827055] rpcbind: server 137.226.167.241 not responding, timed out >> [ 6527.827237] Root-NFS: Unable to get nfsd port number from server, using > default >> [ 6527.827353] Looking up port of RPC 100005/1 on 137.226.167.241 >> [ 6527.842212] VFS: Mounted root (nfs filesystem) on device 0:15. >> >> >> So I don't think that it's a problem of the hardware between the > machines. >> There's no reason why I wouldn't see an ARP requests from cluster2 > which would have been sent _before_ the <tag> if there would be one. I > think: cluster2 never sends a request to port 10003. >> What do you think? > > It agrees with our initial assessment that the first RPC request is failing. > The RPC client never gets the request through cluster2's network stack > because the NIC hasn't re-initialized when the request is sent. > > It looks like your system does a PXE boot, which provides the IP configuration > shown above. But then the kernel resets the NIC. During that reset, the kernel > is attempting to contact the NFS server to mount the root file system. > > We've set up NFSROOT to use UDP so that it will be relatively immune to > these initialization order problems. The RPC client should be retrying the lost > request, but apparently it isn't. What if you added "retrans=10" > to cluster2's mount options? (on the chance that mount option setting would > be copied to the rpcbind client's RPC transport...) > > IMO the correct way to fix this is to provide proper serialization in the > networking layer so that RPC requests are not even attempted until the NIC is > ready to carry traffic. That may be a pipe dream though. > I thank you three very much for your help! Now I'm sure that I haven't misconfigured anything... But I don't see a work around to get the NFSROOT mounted during start up of a kernel >=2.6.37 . It would be very sad with these nice Oracle (SUN) machines if no one could use them because of this bug. Do you know a kernel developer who maybe would try to write a patch for this problem? Or do you have another idea what I could do? Regards, Lukas -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html