> On Jan 17, 2022, at 10:50 AM, Bruce Fields <bfields@xxxxxxxxxxxx> wrote: > > On Sat, Jan 15, 2022 at 07:46:06PM +0000, Chuck Lever III wrote: >> >>> On Jan 15, 2022, at 3:14 AM, Jonathan Woithe <jwoithe@xxxxxxxxxx> wrote: >>> >>> Hi Chuck >>> >>> Thanks for your response. >>> >>> On Fri, Jan 14, 2022 at 03:18:01PM +0000, Chuck Lever III wrote: >>>>> Recently we migrated an NFS server from a 32-bit environment running >>>>> kernel 4.14.128 to a 64-bit 5.15.x kernel. The NFS configuration remained >>>>> unchanged between the two systems. >>>>> >>>>> On two separate occasions since the upgrade (5 Jan under 5.15.10, 14 Jan >>>>> under 5.15.12) the kernel has oopsed at around the time that an NFS client >>>>> machine is turned on for the day. On both occasions the call trace was >>>>> essentially identical. The full oops sequence is at the end of this email. >>>>> The oops was not observed when running the 4.14.128 kernel. >>>>> >>>>> Is there anything more I can provide to help track down the cause of the >>>>> oops? >>>> >>>> A possible culprit is 7f024fcd5c97 ("Keep read and write fds with each >>>> nlm_file"), which was introduced in or around v5.15. You could try a >>>> simple test and back the server down to v5.14.y to see if the problem >>>> persists. >>> >>> I could do this, but only perhaps on Monday when I'm next on site. It may >>> take a while to get an answer though, since it seems we hit the fault only >>> around once every 2 weeks. Since it's a production server we are of course >>> limited in the things I can do. >>> >>> I *may* be able to set up another system as an NFS server and hit that with >>> repeated mount requests. That could help reduce the time we have to wait >>> for an answer. >> >> Given the callback information you provided, I believe that the problem >> is due to a client reboot, not a mount request. The callback shows the >> crash occurs while your server is processing an SM_NOTIFY request from >> one of your clients. >> >> >>> Is it worth considering a revert of 7f024fcd5c97? I guess it depends on how >>> many later patches depended on it. >> >> You can try reverting 7f024fcd5c97, but as I recall there are some >> subsequent changes that depend on that one. > > NLM locking on reexports would stop working. Which is a new (and > imperfect) feature, so less important than avoiding this NULL > dereference, if push came to shove. But, let's see if we can just fix > it..... Agreed. I was suggested reverting only as an experiment. -- Chuck Lever