Fwd: Possible memory leak on nfsd

Chuck Lever <chuck.lever@xxxxxxxxxx> · Thu, 12 Dec 2024 11:15:33 -0500

Hi -

An NFSD page allocation on v6.1.y is triggering OOM-killer. The reporter
has provided a lot of detail, and we need some help steering us towards
the possible leak culprit. Any takers?

(We've asked the reporter to reproduce on a more recent kernel if
possible).

-------- Forwarded Message --------
Subject: Re: Possible memory leak on nfsd
Date: Thu, 12 Dec 2024 16:00:17 +0000
From: Chuck Lever via Bugspray Bot <bugbot@xxxxxxxxxx>
To: jlayton@xxxxxxxxxx, linux-nfs@xxxxxxxxxxxxxxx, trondmy@xxxxxxxxxx, 
cel@xxxxxxxxxx, anna@xxxxxxxxxx

Chuck Lever writes via Kernel.org Bugzilla:

From attachment 307290:

[29924.805968] 
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-0.slice/user@0.service/init.scope,task=(sd-pam),pid=4503,uid=0
[29924.805991] Out of memory: Killed process 4503 ((sd-pam)) 
total-vm:173972kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 
pgtables:96kB oom_score_adj:100
[29925.425864] nfsd invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), 
order=0, oom_score_adj=0
[29925.425872] CPU: 0 PID: 1874 Comm: nfsd Kdump: loaded Tainted: G 
      E      6.1.119-1.el9.elrepo.x86_64 #1
[29925.425875] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 
2.22.2 09/12/2024
[29925.425877] Call Trace:
[29925.425880]  <TASK>
[29925.425885]  dump_stack_lvl+0x45/0x5e
[29925.425893]  dump_header+0x4a/0x213
[29925.425897]  oom_kill_process.cold+0xb/0x10
[29925.425901]  out_of_memory+0xed/0x2e0
[29925.425906]  __alloc_pages_slowpath.constprop.0+0x707/0x9d0
[29925.425916]  __alloc_pages+0x35d/0x370
[29925.425921]  __alloc_pages_bulk+0x3e5/0x680
[29925.425927]  svc_alloc_arg+0x81/0x1f0 [sunrpc]
[29925.425991]  svc_recv+0x1f/0x190 [sunrpc]
[29925.426043]  ? nfsd_inet6addr_event+0x110/0x110 [nfsd]
[29925.426080]  nfsd+0x87/0xc0 [nfsd]
[29925.426113]  kthread+0xe5/0x110
[29925.426118]  ? kthread_complete_and_exit+0x20/0x20
[29925.426122]  ret_from_fork+0x1f/0x30
[29925.426129]  </TASK>

NFSD is triggering the OOM killer because it frequently allocates up to 
256 pages at a time to fill the send and receive buffers. It is not 
necessarily the source of a leak.

The bulk page allocator is on the slow path here, suggesting there 
weren't any free pages available on the lists it normally checks first. 
So it is doing one-at-a-time order-0 allocations, a sign that memory is 
short.

We see that Node 1 appears to be short on free memory, but the system 
has not pushed into swap at all. Kernel memory isn't swappable, so 
whatever is leaking is in the kernel proper.

The slab caches all look reasonably sized, so not likely a slab leak.

At this point we would want someone with some MM expertise to come in 
and help us nail down the leak.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219535#c13
You can reply to this message to join the discussion.

--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)