On Fri, Oct 25, 2019 at 11:20:47AM -0400, J. Bruce Fields wrote: > On Fri, Oct 25, 2019 at 12:22:36PM +1100, NeilBrown wrote: > > I have a coredump from a machine that was running as an NFS server. > > nfs4_laundromat was trying to expire a client, and in particular was > > cleaning up the ->cl_openowners. > > As there were 6.5 million of these, it took rather longer than the > > softlockup timer thought was acceptable, and hence the core dump. > > > > Those open owners that I looked at had empty so_stateids lists, so I > > would normally expect them to be on the close_lru and to be removed > > fairly soon. But they weren't (only 32 openowners on close_lru). > > > > The only explanation I can think of for this is that maybe an OPEN > > request successfully got through nfs4_process_open1(), thus creating an > > open owner, but failed to get to or through nfs4_process_open2(), and > > so didn't add a stateid. I *think* this can leave an openowner that is > > unused but will never be cleaned up (until the client is expired, which > > might be too late). > > > > Is this possible? If so, how should we handle those openowners which > > never had a stateid? > > In 3.0 (which it the kernel were I saw this) I could probably just put > > the openowner on the close_lru when it is created. > > In more recent kernels, it seems to be assumed that openowners are only > > on close_lru if they have a oo_last_closed_stid. Would we need a > > separate "never used lru", or should they just be destroyed as soon as > > the open fails? > > Hopefully we can just throw the new openowner away when the open fails. > > But it looks like the new openowner is visible on global data structures > by then, so we need to be sure somebody else isn't about to use it. But, also, if this has only been seen on 3.0, it may have been fixed already. It sounds like kind of a familiar problem, but I didn't spot a relevant commit on a quick look through the logs. --b.