uncollected nfsd open owners

NeilBrown <neilb@xxxxxxx> · Fri, 25 Oct 2019 12:22:36 +1100

Hi,
 I have a coredump from a machine that was running as an NFS server.
 nfs4_laundromat was trying to expire a client, and in particular was
 cleaning up the ->cl_openowners.
 As there were 6.5 million of these, it took rather longer than the
 softlockup timer thought was acceptable, and hence the core dump.

 Those open owners that I looked at had empty so_stateids lists, so I
 would normally expect them to be on the close_lru and to be removed
 fairly soon.  But they weren't (only 32 openowners on close_lru).

 The only explanation I can think of for this is that maybe an OPEN
 request successfully got through nfs4_process_open1(), thus creating an
 open owner, but failed to get to or through nfs4_process_open2(), and
 so didn't add a stateid.  I *think* this can leave an openowner that is
 unused but will never be cleaned up (until the client is expired, which
 might be too late).

 Is this possible?  If so, how should we handle those openowners which
 never had a stateid?
 In 3.0 (which it the kernel were I saw this) I could probably just put
 the openowner on the close_lru when it is created.
 In more recent kernels, it seems to be assumed that openowners are only
 on close_lru if they have a oo_last_closed_stid.  Would we need a
 separate "never used lru", or should they just be destroyed as soon as
 the open fails?

 Also, should we put a cond_resched() in some or all of those loops in
 __destroy_client() ??

Thanks for your help,
NeilBrown
Attachment:
signature.asc

Description: PGP signature