Debugging a kernel crash in svc_process_common() on the client (NFS 4.1)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I am debugging rarely occurring kernel crashes in svc_process_common 
('sunrpc' kernel module) that some of our customers got. Unfortunately, 
I am still unable to reproduce these and can see no obvious fix in the 
mainline kernel.

Any hints on how to debug the issue further could be very helpful.

The OS is Virtuozzo 7. The crashes happened with at least 2 kernels, 
based on 3.10.0-862.11.6 and 3.10.0-693.21.1 from RHEL.

The crashes happened when the customers' systems were writing backups of 
some data to NFS shares. NFS 4.1 was used. No RDMA.

The backtrace looks like this:

#0 svc_process_common [sunrpc]
#1 bc_svc_process [sunrpc]
#2 nfs41_callback_svc [nfsv4]
#3 kthread

Each time, the crash happened here:

	/* Setup reply header */
	rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr(rqstp);

'struct svc_xprt' instance rqstp->rq_xprt pointed to was filled with 
invalid data. Accessing rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr 
caused the crash as a result.

I checked the crash dumps and found that the memory page allocated for 
that 'struct svc_xprt' (to 'struct svc_sock' that contains it, to be 
exact) had been given to another, unrelated, process by that time. So, 
it seems, the processing of the backchannel request on these NFS clients 
could race with something that called svc_xprt_put() for that 'struct 
svc_xprt' instance.

First, I thought that it might be a race between a backchannel request 
and umount of the NFS share (although I have no indication that the 
customers' system tried to unmount it). So, I added a delay into 
bc_svc_process(), opened a file on an NFS share from one NFS client and 
replaced the file from another client to make the server recall the 
delegation, to trigger a backchannel request. Then - closed the files 
and tried to umount the NFS share. Everything went OK, no crash. umount 
waited till the backchannel request had been processed by the client, as 
it should have.

I am new to this code, so might be missing something obvious. However, I 
cannot see at the moment, how bc_svc_process() could race with freeing 
of that 'struct svc_sock'.

Any ideas?

Regards,
Evgenii




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux