On 28 Jun 2019, at 14:33, Alan Post wrote: > On Fri, Jun 21, 2019 at 02:47:23PM -0600, Alan Post wrote: >>> Verifying this is the problem could be done by setting up some rolling >>> network captures.. but sometimes it can be hard to not have the capture >>> fill up with continuing traffic from other processes. >>> >> >> I did go ahead and set up a rolling capture between this NFS >> server and one rack of clients--I hope I can catch the event as >> it happens. Time will tell. >> > > I've run this rolling capture and did catch four candidate events. > I haven't confirmed any of them are real--I don't really know > what it is I'm looking for, so I've been approaching the problem > by incrementally/recursively throwing stuff out and manually > working through what's left. > > As far as I understand it, for a particular xid, there should be a > call and a reply. The approach I took then was to pull out these > fields from my capture and ignore RPC calls where both are present > in my capture. It seems this is simplistic, as the number of RPC > calls I have without an attendant reply isn't lining up with my > incident window. Does your capture report dropped packets? If so, maybe you need to increase the capture buffer. There are the sunrpc:xprt_transmit and sunrpc:xprt_complete_rqst tracepoints as well that should show the xids. > In one example, I have a series of READ calls which cease > generating RPC reply messages as the offset for the file continues > to increases. After a couple/few dozen messages, the RPC replies > continue as they were. Is there a normal or routine explanation > for this? > > RFC 5531 and the NetworkTracing page on wiki.linux-nfs.org have > been quite helpful bringing me up to speed. If any of you have > advice or guidance or can clarify my understanding of how the > call/reply RPC mechanism works I appreciate it. Seems like you understand it. Do you have specific questions? Ben