Hi -
An NFSD page allocation on v6.1.y is triggering OOM-killer. The reporter
has provided a lot of detail, and we need some help steering us towards
the possible leak culprit. Any takers?
(We've asked the reporter to reproduce on a more recent kernel if
possible).
-------- Forwarded Message --------
Subject: Re: Possible memory leak on nfsd
Date: Thu, 12 Dec 2024 16:00:17 +0000
From: Chuck Lever via Bugspray Bot <bugbot@xxxxxxxxxx>
To: jlayton@xxxxxxxxxx, linux-nfs@xxxxxxxxxxxxxxx, trondmy@xxxxxxxxxx,
cel@xxxxxxxxxx, anna@xxxxxxxxxx
Chuck Lever writes via Kernel.org Bugzilla:
From attachment 307290:
[29924.805968]
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-0.slice/user@0.service/init.scope,task=(sd-pam),pid=4503,uid=0
[29924.805991] Out of memory: Killed process 4503 ((sd-pam))
total-vm:173972kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0
pgtables:96kB oom_score_adj:100
[29925.425864] nfsd invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL),
order=0, oom_score_adj=0
[29925.425872] CPU: 0 PID: 1874 Comm: nfsd Kdump: loaded Tainted: G
E 6.1.119-1.el9.elrepo.x86_64 #1
[29925.425875] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS
2.22.2 09/12/2024
[29925.425877] Call Trace:
[29925.425880] <TASK>
[29925.425885] dump_stack_lvl+0x45/0x5e
[29925.425893] dump_header+0x4a/0x213
[29925.425897] oom_kill_process.cold+0xb/0x10
[29925.425901] out_of_memory+0xed/0x2e0
[29925.425906] __alloc_pages_slowpath.constprop.0+0x707/0x9d0
[29925.425916] __alloc_pages+0x35d/0x370
[29925.425921] __alloc_pages_bulk+0x3e5/0x680
[29925.425927] svc_alloc_arg+0x81/0x1f0 [sunrpc]
[29925.425991] svc_recv+0x1f/0x190 [sunrpc]
[29925.426043] ? nfsd_inet6addr_event+0x110/0x110 [nfsd]
[29925.426080] nfsd+0x87/0xc0 [nfsd]
[29925.426113] kthread+0xe5/0x110
[29925.426118] ? kthread_complete_and_exit+0x20/0x20
[29925.426122] ret_from_fork+0x1f/0x30
[29925.426129] </TASK>
NFSD is triggering the OOM killer because it frequently allocates up to
256 pages at a time to fill the send and receive buffers. It is not
necessarily the source of a leak.
The bulk page allocator is on the slow path here, suggesting there
weren't any free pages available on the lists it normally checks first.
So it is doing one-at-a-time order-0 allocations, a sign that memory is
short.
We see that Node 1 appears to be short on free memory, but the system
has not pushed into swap at all. Kernel memory isn't swappable, so
whatever is leaking is in the kernel proper.
The slab caches all look reasonably sized, so not likely a slab leak.
At this point we would want someone with some MM expertise to come in
and help us nail down the leak.
View: https://bugzilla.kernel.org/show_bug.cgi?id=219535#c13
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)