On Wed, Mar 3, 2021 at 7:50 AM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: > > Ilya Lipnitskiy <ilya.lipnitskiy@xxxxxxxxx> writes: > > > On Tue, Mar 2, 2021 at 11:37 AM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: > >> > >> Ilya Lipnitskiy <ilya.lipnitskiy@xxxxxxxxx> writes: > >> > >> > On Mon, Mar 1, 2021 at 12:43 PM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: > >> >> > >> >> Ilya Lipnitskiy <ilya.lipnitskiy@xxxxxxxxx> writes: > >> >> > >> >> > Eric, All, > >> >> > > >> >> > The following error appears when running Linux 5.10.18 on an embedded > >> >> > MIPS mt7621 target: > >> >> > [ 0.301219] BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1 > >> >> > > >> >> > Being a very generic error, I started digging and added a stack dump > >> >> > before the BUG: > >> >> > Call Trace: > >> >> > [<80008094>] show_stack+0x30/0x100 > >> >> > [<8033b238>] dump_stack+0xac/0xe8 > >> >> > [<800285e8>] __mmdrop+0x98/0x1d0 > >> >> > [<801a6de8>] free_bprm+0x44/0x118 > >> >> > [<801a86a8>] kernel_execve+0x160/0x1d8 > >> >> > [<800420f4>] call_usermodehelper_exec_async+0x114/0x194 > >> >> > [<80003198>] ret_from_kernel_thread+0x14/0x1c > >> >> > > >> >> > So that's how I got to looking at fs/exec.c and noticed quite a few > >> >> > changes last year. Turns out this message only occurs once very early > >> >> > at boot during the very first call to kernel_execve. current->mm is > >> >> > NULL at this stage, so acct_arg_size() is effectively a no-op. > >> >> > >> >> If you believe this is a new error you could bisect the kernel > >> >> to see which change introduced the behavior you are seeing. > >> >> > >> >> > More digging, and I traced the RSS counter increment to: > >> >> > [<8015adb4>] add_mm_counter_fast+0xb4/0xc0 > >> >> > [<80160d58>] handle_mm_fault+0x6e4/0xea0 > >> >> > [<80158aa4>] __get_user_pages.part.78+0x190/0x37c > >> >> > [<8015992c>] __get_user_pages_remote+0x128/0x360 > >> >> > [<801a6d9c>] get_arg_page+0x34/0xa0 > >> >> > [<801a7394>] copy_string_kernel+0x194/0x2a4 > >> >> > [<801a880c>] kernel_execve+0x11c/0x298 > >> >> > [<800420f4>] call_usermodehelper_exec_async+0x114/0x194 > >> >> > [<80003198>] ret_from_kernel_thread+0x14/0x1c > >> >> > > >> >> > In fact, I also checked vma_pages(bprm->vma) and lo and behold it is set to 1. > >> >> > > >> >> > How is fs/exec.c supposed to handle implied RSS increments that happen > >> >> > due to page faults when discarding the bprm structure? In this case, > >> >> > the bug-generating kernel_execve call never succeeded, it returned -2, > >> >> > but I didn't trace exactly what failed. > >> >> > >> >> Unless I am mistaken any left over pages should be purged by exit_mmap > >> >> which is called by mmput before mmput calls mmdrop. > >> > Good to know. Some more digging and I can say that we hit this error > >> > when trying to unmap PFN 0 (is_zero_pfn(pfn) returns TRUE, > >> > vm_normal_page returns NULL, zap_pte_range does not decrement > >> > MM_ANONPAGES RSS counter). Is my understanding correct that PFN 0 is > >> > usable, but special? Or am I totally off the mark here? > >> > >> It would be good to know if that is the page that get_user_pages_remote > >> returned to copy_string_kernel. The zero page that is always zero, > >> should never be returned when a writable mapping is desired. > > > > Indeed, pfn 0 is returned from get_arg_page: (page is 0x809cf000, > > page_to_pfn(page) is 0) and it is the same page that is being freed and not > > refcounted in mmput/zap_pte_range. Confirmed with good old printk. Also, > > ZERO_PAGE(0)==0x809fc000 -> PFN 5120. > > > > I think I have found the problem though, after much digging and thanks to all > > the information provided. init_zero_pfn() gets called too late (after > > the call to > > is_zero_pfn(0) from mmput returns true), until then zero_pfn == 0, and after, > > zero_pfn == 5120. Boom. > > > > So PFN 0 is special, but only for a little bit, enough for something > > on my system > > to call kernel_execve :) > > > > Question: is my system not supposed to be calling kernel_execve this > > early or does > > init_zero_pfn() need to happen earlier? init_zero_pfn is currently a > > core_initcall. > > Looking quickly it seems that init_zero_pfn() is in mm/memory.c and is > common for both mips and x86. Further it appears init_zero_pfn() has > been that was since 2009 a13ea5b75964 ("mm: reinstate ZERO_PAGE"). > > Given the testing that x86 gets and that nothing like this has been > reported it looks like whatever driver is triggering the kernel_execve > is doing something wrong. > > Because honestly. If the zero page isn't working there is not a chance > that anything in userspace is working so it is clearly much too early. > > I suspect there is some driver that is initialized very early that is > doing something that looks innocuous (like triggering a hotplug event) > and that happens to cause a call_usermode_helper which then calls > kernel_execve. I will investigate the offenders more closely. However, I do not notice this behavior on the same system based on the 5.4 kernel. Is it possible that last year's exec changes have exposed this issue? Not blaming exec at all, just making sure I understand the problem better. Ilya