Re: exec error: BUG: Bad rss-counter

ebiederm@xxxxxxxxxxxx (Eric W. Biederman) · Mon, 01 Mar 2021 14:43:51 -0600

Ilya Lipnitskiy <ilya.lipnitskiy@xxxxxxxxx> writes:

> Eric, All,
>
> The following error appears when running Linux 5.10.18 on an embedded
> MIPS mt7621 target:
> [    0.301219] BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1
>
> Being a very generic error, I started digging and added a stack dump
> before the BUG:
> Call Trace:
> [<80008094>] show_stack+0x30/0x100
> [<8033b238>] dump_stack+0xac/0xe8
> [<800285e8>] __mmdrop+0x98/0x1d0
> [<801a6de8>] free_bprm+0x44/0x118
> [<801a86a8>] kernel_execve+0x160/0x1d8
> [<800420f4>] call_usermodehelper_exec_async+0x114/0x194
> [<80003198>] ret_from_kernel_thread+0x14/0x1c
>
> So that's how I got to looking at fs/exec.c and noticed quite a few
> changes last year. Turns out this message only occurs once very early
> at boot during the very first call to kernel_execve. current->mm is
> NULL at this stage, so acct_arg_size() is effectively a no-op.

If you believe this is a new error you could bisect the kernel
to see which change introduced the behavior you are seeing.

> More digging, and I traced the RSS counter increment to:
> [<8015adb4>] add_mm_counter_fast+0xb4/0xc0
> [<80160d58>] handle_mm_fault+0x6e4/0xea0
> [<80158aa4>] __get_user_pages.part.78+0x190/0x37c
> [<8015992c>] __get_user_pages_remote+0x128/0x360
> [<801a6d9c>] get_arg_page+0x34/0xa0
> [<801a7394>] copy_string_kernel+0x194/0x2a4
> [<801a880c>] kernel_execve+0x11c/0x298
> [<800420f4>] call_usermodehelper_exec_async+0x114/0x194
> [<80003198>] ret_from_kernel_thread+0x14/0x1c
>
> In fact, I also checked vma_pages(bprm->vma) and lo and behold it is set to 1.
>
> How is fs/exec.c supposed to handle implied RSS increments that happen
> due to page faults when discarding the bprm structure? In this case,
> the bug-generating kernel_execve call never succeeded, it returned -2,
> but I didn't trace exactly what failed.

Unless I am mistaken any left over pages should be purged by exit_mmap
which is called by mmput before mmput calls mmdrop.

AKA it looks very very fishy this happens and this does not look like
an execve error.

On the other hand it would be good to know why kernel_execve is failing.
Then the error handling paths could be scrutinized, and we can check to
see if everything that should happen on an error path does.

> Interestingly, this "BUG:" message is timing-dependent. If I wait a
> bit before calling free_bprm after bprm_execve the message seems to go
> away (there are 3 other cores running and calling into kernel_execve
> at the same time, so there is that). The error also only ever happens
> once (probably because no more page faults happen?).
>
> I don't know enough to propose a proper fix here. Is it decrementing
> the bprm->mm RSS counter to account for that page fault? Or is
> current->mm being NULL a bigger problem?

This is call_usermode_helper calls kernel_execve from a kernel thread
forked by kthreadd.  Which means current->mm == NULL is expected, and
current->active_mm == &init_mm.

Similarly I bprm->mm having an incremented RSS counter appears correct.

The question is why doesn't that count get consistently cleaned up.

> Apologies in advance, but I have looked hard and do not see a clear
> resolution for this even in the latest kernel code.

I may be blind but I see two possibilities.

1) There is a memory stomp that happens early on and bad timing causes
   the memory stomp to result in an elevated rss count.

2) There is a buggy error handling path, and whatever failure you are
    running into that early in boot walks through that buggy failure
    path.

I don't think this is a widespread issue or yours would not be the first
report like this I have seen.

The two productive paths I can see for tracing down your problem are:
1) git bisect (assuming you have a known good version)
2) Figuring out what exec failed.

I really think exec_mmap should have cleaned up anything in the mm.  So
the fact that it doesn't worries me.

Eric