Re: [Bug report] Recurring oops, 5.15.x, possibly during or soon after client mount

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Jan 15, 2022 at 07:46:06PM +0000, Chuck Lever III wrote:
> 
> > On Jan 15, 2022, at 3:14 AM, Jonathan Woithe <jwoithe@xxxxxxxxxx> wrote:
> > 
> > Hi Chuck
> > 
> > Thanks for your response.
> > 
> > On Fri, Jan 14, 2022 at 03:18:01PM +0000, Chuck Lever III wrote:
> >>> Recently we migrated an NFS server from a 32-bit environment running 
> >>> kernel 4.14.128 to a 64-bit 5.15.x kernel.  The NFS configuration remained
> >>> unchanged between the two systems.
> >>> 
> >>> On two separate occasions since the upgrade (5 Jan under 5.15.10, 14 Jan
> >>> under 5.15.12) the kernel has oopsed at around the time that an NFS client
> >>> machine is turned on for the day.  On both occasions the call trace was
> >>> essentially identical.  The full oops sequence is at the end of this email. 
> >>> The oops was not observed when running the 4.14.128 kernel.
> >>> 
> >>> Is there anything more I can provide to help track down the cause of the
> >>> oops?
> >> 
> >> A possible culprit is 7f024fcd5c97 ("Keep read and write fds with each
> >> nlm_file"), which was introduced in or around v5.15.  You could try a
> >> simple test and back the server down to v5.14.y to see if the problem
> >> persists.
> > 
> > I could do this, but only perhaps on Monday when I'm next on site.  It may
> > take a while to get an answer though, since it seems we hit the fault only
> > around once every 2 weeks.  Since it's a production server we are of course
> > limited in the things I can do.
> > 
> > I *may* be able to set up another system as an NFS server and hit that with
> > repeated mount requests.  That could help reduce the time we have to wait
> > for an answer.
> 
> Given the callback information you provided, I believe that the problem
> is due to a client reboot, not a mount request. The callback shows the
> crash occurs while your server is processing an SM_NOTIFY request from
> one of your clients.
> 
> 
> > Is it worth considering a revert of 7f024fcd5c97?  I guess it depends on how
> > many later patches depended on it.
> 
> You can try reverting 7f024fcd5c97, but as I recall there are some
> subsequent changes that depend on that one.

NLM locking on reexports would stop working.  Which is a new (and
imperfect) feature, so less important than avoiding this NULL
dereference, if push came to shove.  But, let's see if we can just fix
it.....

--b.



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux