Re: [PATCH v8 23/70] mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()

Juergen Gross <jgross@xxxxxxxx> · Mon, 2 May 2022 09:08:16 +0200

On 02.05.22 02:14, Liam Howlett wrote:
* Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> [220428 21:16]:
On Fri, 29 Apr 2022 00:38:50 +0000 Liam Howlett <liam.howlett@xxxxxxxxxx> wrote:

mm/mmap.c: In function 'do_brk_flags':
mm/mmap.c:2908:17: error: implicit declaration of function
	'khugepaged_enter_vma_merge'; did you mean 'khugepaged_enter_vma'?

It appears that this is later fixed, but it hurts bisectability
(and prevents me from finding the actual build failure in linux-next
when trying to build corenet64_smp_defconfig).

Yeah, that khugepaged_enter_vma_merge was renamed in another patch set.
Andrew made the correction but kept the patch as it was.  I think the
suggested change is right.. if you read the commit that introduced
khugepaged_enter_vma(), it seems right at least.

Things are a bit crazy lately.  Merge issues with mapletree, merge
issues with mglru on mapletree, me doing a bunch of retooling to start
publishing/merging via git, mapletree runtime issues, etc.

I've dropped the mapletree patches again.  Please scoop up all known
fixes and redo against the (non-rebasing) mm-stable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Okay, sounds good.

I have been porting my patches over and hit a bit of a snag. It looked
like my patches were not booting on the s390 - but not all the time. So
I reverted back to mm-stable (059342d1dd4e) and found that also failed
to boot sometimes on my qemu setup.  When it fails it's ~4-5sec into
booting.  The last thing I see is:

"[    4.668916] Spectre V2 mitigation: execute trampolines"

I've bisected back to commit e553f62f10d9 (mm, page_alloc: fix
build_zonerefs_node())

With the this commit, I am unable to boot one out of three times.  When
using the previous commit I was not able to get it to hang after trying
10+ times.  This is a qemu s390 install with KASAN on and I see no error
messages.  I think it's likely it is this patch, but no guaranteed.

This sounds like a race condition during the setup of memory zones.

I could imagine my patch is triggering this problem, but it should
not be the real root cause.

I'm no expert regarding zone setup, but I think it might help to
print some zone data in case the problem is happening. Which data is
needed I have no real idea, but maybe someone else can help here. The
following diff should recognize the problematic case (it might show
false positives, though):

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e42038382c1..23f029f39985 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6132,6 +6132,9 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct 
zoneref *zonerefs)
                zone_type--;
                zone = pgdat->node_zones + zone_type;
                if (populated_zone(zone)) {
+                       if (!managed_zone(zone)) {
+                               /* Print some data regarding the zone. */
+                       }
                        zoneref_set_zone(zone, &zonerefs[nr_zones++]);
                        check_highest_zone(zone_type);
                }


Juergen

Attachment:
OpenPGP_0xB0DE9DD628BF132F.asc

Description: OpenPGP public key
Attachment:
OpenPGP_signature

Description: OpenPGP digital signature