Re: exec error: BUG: Bad rss-counter

Zhou Yanjie <zhouyanjie@xxxxxxxxxxxxxx> · Tue, 30 Mar 2021 12:56:07 +0800

Hi Ilya,

On 2021/3/29 上午10:48, Ilya Lipnitskiy wrote:
On Sat, Mar 20, 2021 at 8:59 AM Zhou Yanjie <zhouyanjie@xxxxxxxxxxxxxx> wrote:
Hi Ilya,

On 2021/3/3 下午11:55, Ilya Lipnitskiy wrote:
On Wed, Mar 3, 2021 at 7:50 AM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
Ilya Lipnitskiy <ilya.lipnitskiy@xxxxxxxxx> writes:

On Tue, Mar 2, 2021 at 11:37 AM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
Ilya Lipnitskiy <ilya.lipnitskiy@xxxxxxxxx> writes:

On Mon, Mar 1, 2021 at 12:43 PM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
Ilya Lipnitskiy <ilya.lipnitskiy@xxxxxxxxx> writes:

Eric, All,

The following error appears when running Linux 5.10.18 on an embedded
MIPS mt7621 target:
[    0.301219] BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1

Being a very generic error, I started digging and added a stack dump
before the BUG:
Call Trace:
[<80008094>] show_stack+0x30/0x100
[<8033b238>] dump_stack+0xac/0xe8
[<800285e8>] __mmdrop+0x98/0x1d0
[<801a6de8>] free_bprm+0x44/0x118
[<801a86a8>] kernel_execve+0x160/0x1d8
[<800420f4>] call_usermodehelper_exec_async+0x114/0x194
[<80003198>] ret_from_kernel_thread+0x14/0x1c

So that's how I got to looking at fs/exec.c and noticed quite a few
changes last year. Turns out this message only occurs once very early
at boot during the very first call to kernel_execve. current->mm is
NULL at this stage, so acct_arg_size() is effectively a no-op.
If you believe this is a new error you could bisect the kernel
to see which change introduced the behavior you are seeing.

More digging, and I traced the RSS counter increment to:
[<8015adb4>] add_mm_counter_fast+0xb4/0xc0
[<80160d58>] handle_mm_fault+0x6e4/0xea0
[<80158aa4>] __get_user_pages.part.78+0x190/0x37c
[<8015992c>] __get_user_pages_remote+0x128/0x360
[<801a6d9c>] get_arg_page+0x34/0xa0
[<801a7394>] copy_string_kernel+0x194/0x2a4
[<801a880c>] kernel_execve+0x11c/0x298
[<800420f4>] call_usermodehelper_exec_async+0x114/0x194
[<80003198>] ret_from_kernel_thread+0x14/0x1c

In fact, I also checked vma_pages(bprm->vma) and lo and behold it is set to 1.

How is fs/exec.c supposed to handle implied RSS increments that happen
due to page faults when discarding the bprm structure? In this case,
the bug-generating kernel_execve call never succeeded, it returned -2,
but I didn't trace exactly what failed.
Unless I am mistaken any left over pages should be purged by exit_mmap
which is called by mmput before mmput calls mmdrop.
Good to know. Some more digging and I can say that we hit this error
when trying to unmap PFN 0 (is_zero_pfn(pfn) returns TRUE,
vm_normal_page returns NULL, zap_pte_range does not decrement
MM_ANONPAGES RSS counter). Is my understanding correct that PFN 0 is
usable, but special? Or am I totally off the mark here?
It would be good to know if that is the page that get_user_pages_remote
returned to copy_string_kernel.  The zero page that is always zero,
should never be returned when a writable mapping is desired.
Indeed, pfn 0 is returned from get_arg_page: (page is 0x809cf000,
page_to_pfn(page) is 0) and it is the same page that is being freed and not
refcounted in mmput/zap_pte_range. Confirmed with good old printk. Also,
ZERO_PAGE(0)==0x809fc000 -> PFN 5120.

I think I have found the problem though, after much digging and thanks to all
the information provided. init_zero_pfn() gets called too late (after
the call to
is_zero_pfn(0) from mmput returns true), until then zero_pfn == 0, and after,
zero_pfn == 5120. Boom.

So PFN 0 is special, but only for a little bit, enough for something
on my system
to call kernel_execve :)

Question: is my system not supposed to be calling kernel_execve this
early or does
init_zero_pfn() need to happen earlier? init_zero_pfn is currently a
core_initcall.
Looking quickly it seems that init_zero_pfn() is in mm/memory.c and is
common for both mips and x86.  Further it appears init_zero_pfn() has
been that was since 2009 a13ea5b75964 ("mm: reinstate ZERO_PAGE").

Given the testing that x86 gets and that nothing like this has been
reported it looks like whatever driver is triggering the kernel_execve
is doing something wrong.
Because honestly.  If the zero page isn't working there is not a chance
that anything in userspace is working so it is clearly much too early.

I suspect there is some driver that is initialized very early that is
doing something that looks innocuous (like triggering a hotplug event)
and that happens to cause a call_usermode_helper which then calls
kernel_execve.
I will investigate the offenders more closely. However, I do not
notice this behavior on the same system based on the 5.4 kernel. Is it

I also encountered this problem on Ingenic X1000 and X1830. This is the
printed information:

[    0.120715] BUG: Bad rss-counter state mm:(ptrval)
   type:MM_ANONPAGES val:1

I tested kernel 5.9, kernel 5.10, kernel 5.11, and kernel 5.12, only
kernel 5.9 did not have this problem, so we can know that this problem
was introduced in kernel 5.10, have you found any effective solution?
Try:

diff --git a/mm/memory.c b/mm/memory.c
index c8e357627318..1fd753245369 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -166,7 +166,7 @@ static int __init init_zero_pfn(void)
         zero_pfn = page_to_pfn(ZERO_PAGE(0));
         return 0;
  }
-core_initcall(init_zero_pfn);
+early_initcall(init_zero_pfn);


It works, thanks!


Best regards!


  void mm_trace_rss_stat(struct mm_struct *mm, int member, long count)
  {