On Sun, 2024-07-28 at 03:39 -0700, blokos wrote: > > On 7/28/2024 1:33 AM, Dan Aloni wrote: > > On 2024-07-28 02:57:42, Hristo Venev wrote: > > > On Sun, 2024-07-28 at 02:34 +0200, Hristo Venev wrote: > > > > On Sun, 2024-07-21 at 16:40 +0000, Trond Myklebust wrote: > > > > > On Sun, 2024-07-21 at 14:03 +0300, Dan Aloni wrote: > > > > > > On 2024-07-16 16:09:54, Trond Myklebust wrote: > > > > > > > [..] > > > > > > > gdb -batch -quiet -ex 'list > > > > > > > *(nfs_folio_find_private_request+0x3c)' -ex quit nfs.ko > > > > > > > > > > > > > > > > > > > > > I suspect this will show that the problem is occurring > > > > > > > inside > > > > > > > the > > > > > > > function folio_get_private(), but I'd like to be sure > > > > > > > that is > > > > > > > the > > > > > > > case. > > > > > > I would suspect that `->private_data` gets corrupted > > > > > > somehow. > > > > > > Maybe > > > > > > the folio_test_private() call needs to be protected by > > > > > > either the > > > > > > &mapping->i_private_lock, or folio lock? > > > > > > > > > > > If the problem is indeed happening in "folio_get_private()", > > > > > then > > > > > the > > > > > dereferenced address value of 00000000000003a6 would seem to > > > > > indicate > > > > > that the pointer value of 'folio' itself is screwed up, > > > > > doesn't it? > > > > The NULL dereference appears to be at the `WARN_ON_ONCE(req- > > > > >wb_head > > > > != > > > > req);` check. > > > > > > > > On my kernel the offset inside `nfs_folio_find_private_request` > > > > is > > > > +0x3f, but the address is again 0x3a6, meaning that `req` is > > > > for some > > > > reason set to 0x356 (the crash is on `cmp %rbp,0x50(%rbp)`). > > > ... and 0x356 happens to be NETFS_FOLIO_COPY_TO_CACHE. Maybe the > > > NETFS_RREQ_USE_PGPRIV2 flag is lost somehow? > > Seems NETFS_FOLIO_COPY_TO_CACHE relates to fscache use, you are > > activating that, right? > > > > Also in addition to my suggestion earlier, I think perhaps we need > > to > > use `folio_attach_private` and `folio_detach_private` instead of > > directly using `folio_set_private`, for which the NFS client seems > > to be > > the only direct user. > On my side Yes, fscache is used Same here. Disabling caching (by not running cachefilesd; the fsc mount option is still specified) seems to mitigate the issue. However, we'd ideally like to keep caching on.