Re: Possible regression with file madvise(MADV_COLLAPSE)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Avi,

On 10/10/24 1:54 AM, Avi Kivity wrote:
On Linux 6.10.10 with CONFIG_READ_ONLY_THP_FOR_FS=y,
madvise(MADV_COLLAPSE) on  program text fails with EINVAL.

To reproduce, compile the reproducer with

clang -g -o text-hugepage  text-hugepage.c \
	-fuse-ld=lld \
	-Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152 \
         -Wl,-z,separate-loadable-segments

and run:

$ strace -e trace=madvise ./text-hugepage
madvise(0x400000, 2097152, MADV_HUGEPAGE) = 0
madvise(0x400000, 2097152, MADV_POPULATE_READ) = 0
madvise(0x400000, 2097152, MADV_COLLAPSE) = -1 EINVAL (Invalid
argument)

(the funky linker options are needed to make sure the .text vma spans a
hugepage).


I say "possible regression" since I haven't tried it with an older
kernel, but I believe it worked at some point or other seeing that
others managed to get it to work.

==== text-hugepage.c ====
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>

#include <sys/mman.h>

static
void
try_remap_text_segment() {
     FILE *fp = fopen("/proc/self/maps", "r");
     if (!fp) {
         return;
     }
     char *buf = NULL;
     size_t n;
     while (getline(&buf, &n, fp) >= 0) {
         char *lstart = buf;
         char *lmid = strchr(lstart, '-');
         if (!lmid) {
             continue;
         }
         *lmid++ = '\0';
         char *lend = strchr(lmid, ' ');
         if (!lend) {
             continue;
         }
         *lend = '\0';
size_t start = strtoul(lstart, NULL, 16);
         size_t end = strtoul(lmid, NULL, 16);
         uintptr_t some_text_addr = (uintptr_t)&try_remap_text_segment;
         if (some_text_addr >= start && some_text_addr < end) {
             end &= ~(uintptr_t)0x1fffff;
             madvise((void*)start, end - start, MADV_HUGEPAGE);
             madvise((void*)start, end - start, MADV_POPULATE_READ);
             madvise((void*)start, end - start, MADV_COLLAPSE);
             break;
         }
     }
     free(buf);
     fclose(fp);
}

void
huge_function() {
     // Make sure .text is has a huge page full of stuff
     asm volatile (".fill 4000000, 1, 0x90");
}

int
main() {
     try_remap_text_segment();
}
==== end text-hugepage.c ====


I'm able to reproduce the issue with upstream kernel (v6.12.rc2) on ARM64 where the
base page size is 4KB. The reason why I looked into the issue is because of commit
d659b715e94a ("mm/huge_memory: avoid PMD-size page cache if needed") where -EINVAL
is enforced on madvise(MADV_COLLAPSE) on ARM64 where the base page size is 64KB.

In order to reproduce the issue, I have to drop the clean pagecache and compile
the test program every time.

[root@dhcp-10-26-1-237 issue]# cat Makefile
default:
	@echo 1 > /proc/sys/vm/drop_caches
	@gcc test.c -o test
	./test
[root@dhcp-10-26-1-237 issue]# make
./test
test: test.c:54: try_remap_text_segment: Assertion `ret == 0' failed.      <<< Error from madvise(MADV_COLLAPSE)
make: *** [Makefile:4: default] Aborted (core dumped)

Traced it a bit and found SCAN_FAIL is returned as the following call trace indicates.
However, the progream ("test") is opened as readonly, I don't understand how PG_dirty
is set.

Backtrace
=========
sys_madvise
  do_madvise
    madvise_behavior_valid
    madvise_walk_vmas
      madvise_vma_behavior
        can_modify_vma_madv
        madvise_collapse
          thp_vma_allowable_order
          hpage_collapse_scan_file
            collapse_file
              folio_test_dirty          # SCAN_FAIL returned here

Snapshot of /proc/`pidof test`/smaps before calling to madvise(MADV_COLLAPSE).

[root@dhcp-10-26-1-237 issue]# cat /proc/`pidof test`/smaps | head -n 25
00400000-00600000 r-xp 00000000 fd:05 101812754                          /home/gavin/sandbox/issue/test
Size:               2048 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                2048 kB
Pss:                2048 kB
Pss_Dirty:             0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:      2048 kB
Private_Dirty:         0 kB
Referenced:         2048 kB
Anonymous:             0 kB
KSM:                   0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:           1
VmFlags: rd ex mr mw me hg

Thanks,
Gavin





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux