Am 10.09.2013 um 23:55 schrieb "J. Bruce Fields" <bfields@xxxxxxxxxxxx>: > On Tue, Sep 10, 2013 at 09:48:12PM +0000, Joschi Brauchle wrote: >> Am 10.09.2013 um 22:35 schrieb "J. Bruce Fields" <bfields@xxxxxxxxxxxx>: >> >>> On Tue, Sep 10, 2013 at 08:49:05PM +0200, Joschi Brauchle wrote: >>>> Hello everyone, >>>> >>>> we are administrating an NFS high-availability cluster running on >>>> SLES11SP1 with kernel 2.6.32.59. Just recently, one of the cluster >>>> machines was updated to SLES11SP3 with kernel 3.0.82. >>>> >>>> >>>> We are now experiencing severe hangs on NFS clients when the >>>> SLES11SP3 server is running the NFS services. An strace on the >>>> hanging processes on the client side show that is is waiting up to >>>> 60+ seconds for a "utime()" call to complete. >>>> >>>> >>>> The problem we see is matching the problem described in the thread >>>> "v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to >>>> return". If the NFS server is running on SLES11SP3, the little test >>>> program provided in this tread hangs at the "utime()" call for 60+ >>>> seconds. It hangs each time it is run! It finishes right away with 0 >>>> seconds delay is SLES11SP1 is providing NFS services, each time. >>>> >>>> >>>> Now, in the serverside logfiles of SLES11SP3 we see these messages >>>> (not so on SP1): >>>> -------------- >>>> kernel: [99381.184976] RPC: AUTH_GSS upcall timed out. >>>> kernel: [99381.184978] Please check user daemon is running. >>>> -------------- >>>> >>>> We have always been running the NFS server without rpc.gssd on the >>>> server side, as the init script for the nfsserver also does not >>>> start rpc.gssd. >>>> >>>> >>>> Once we started rpc.gssd on the SLES11SP3 server, using the test >>>> utility on the client shows that the first call to "utime()" >>>> succeeds right away, the second call takes ~25s to complete. But >>>> now, any consecutive runs of the utility finish with no more delay. >>>> >>>> >>>> So can anyone confirm that with kernel 3.0+ the rpc.gssd daemon is >>>> also required on the server side for correct operation? >>>> >>>> Has there been a change between kernel 2.6.32.59 and 3.0.x? >>>> >>>> Thus, is the init script of the nfsserver in SLES11SP3 indeed >>>> missing to start rpc.gssd? >>> >>> It should be starting rpc.gssd to allow callbacks, yes. >>> >>> --b. >> >> Ok, we will run rpc.gssd on the server. Thanks. >> >> Could you please comment on having the nfs clients hang on utime() calls is to be expected when *not* running rpc.gssd? Or is this a problem that needs to be investigated? > > I think what happens is the utime call breaks a delegation, and the > delay is because the lack of gssd prevents the server from calling back > to the client to tell it that its delegation is broken, so the > delegation has to time out. > > That said, the server does a null callback to the client to test whether > callbacks are working before it gives out any delegations, so I'm > surprised it wouldn't have noticed the broken callbacks then. > > --b. Is there any information I can provide to figure this out? At what time is the null callback sent to the client? Maybe I can tcpdump that sequence... Best regards, J Brauchle
Attachment:
smime.p7s
Description: S/MIME cryptographic signature