Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/14/2016 01:48 AM, Jan Stancek wrote:
> On 10/14/2016 01:26 AM, Mike Kravetz wrote:
>>
>> Hi Jan,
>>
>> Any chance you can get the contents of /sys/kernel/mm/hugepages
>> before and after the first run of libhugetlbfs testsuite on Power?
>> Perhaps a script like:
>>
>> cd /sys/kernel/mm/hugepages
>> for f in hugepages-*/*; do
>> 	n=`cat $f`;
>> 	echo -e "$n\t$f";
>> done
>>
>> Just want to make sure the numbers look as they should.
>>
> 
> Hi Mike,
> 
> Numbers are below. I have also isolated a single testcase from "func"
> group of tests: corrupt-by-cow-opt [1]. This test stops working if I
> run it 19 times (with 20 hugepages). And if I disable this test,
> "func" group tests can all pass repeatedly.

Thanks Jan,

I appreciate your efforts.

> 
> [1] https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/corrupt-by-cow-opt.c
> 
> Regards,
> Jan
> 
> Kernel is v4.8-14230-gb67be92, with reboot between each run.
> 1) Only func tests
> System boot
> After setup:
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 0       hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> After func tests:
> ********** TEST SUMMARY
> *                      16M
> *                      32-bit 64-bit
> *     Total testcases:     0     85
> *             Skipped:     0      0
> *                PASS:     0     81
> *                FAIL:     0      4
> *    Killed by signal:     0      0
> *   Bad configuration:     0      0
> *       Expected FAIL:     0      0
> *     Unexpected PASS:     0      0
> * Strange test result:     0      0
> 
> 26      hugepages-16384kB/free_hugepages
> 26      hugepages-16384kB/nr_hugepages
> 26      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 1       hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> After test cleanup:
>  umount -a -t hugetlbfs
>  hugeadm --pool-pages-max ${HPSIZE}:0
> 
> 1       hugepages-16384kB/free_hugepages
> 1       hugepages-16384kB/nr_hugepages
> 1       hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 1       hugepages-16384kB/resv_hugepages
> 1       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 

I am guessing the leaked reserve page is which is triggered by
running the test you isolated corrupt-by-cow-opt.


> ---
> 
> 2) Only stress tests
> System boot
> After setup:
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 0       hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> After stress tests:
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 17      hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> After cleanup:
> 17      hugepages-16384kB/free_hugepages
> 17      hugepages-16384kB/nr_hugepages
> 17      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 17      hugepages-16384kB/resv_hugepages
> 17      hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 

This looks worse than the summary after running the functional tests.

> ---
> 
> 3) only corrupt-by-cow-opt
> 
> System boot
> After setup:
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 0       hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> libhugetlbfs-2.18# env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3298
> Write s to 0x3effff000000 via shared mapping
> Write p to 0x3effff000000 via private mapping
> Read s from 0x3effff000000 via shared mapping
> PASS
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 1       hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages

Leaked one reserve page

> 
> # env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3312
> Write s to 0x3effff000000 via shared mapping
> Write p to 0x3effff000000 via private mapping
> Read s from 0x3effff000000 via shared mapping
> PASS
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 2       hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages

It is pretty consistent that we leak a reserve page every time this
test is run.

The interesting thing is that corrupt-by-cow-opt is a very simple
test case.  commit 67961f9db8c4 potentially changes the return value
of the functions vma_has_reserves() and vma_needs/commit_reservation()
for the owner (HPAGE_RESV_OWNER) of private mappings.  running the
test with and without the commit results in the same return values for
these routines on x86.  And, no leaked reserve pages.

Is it possible to revert this commit and run the libhugetlbs tests
(func and stress) again while monitoring the counts in /sys?  The
counts should go to zero after cleanup as you describe above.  I just
want to make sure that this commit is causing all the problems you
are seeing.  If it is, then we can consider reverting and I can try
to think of another way to address the original issue.

Thanks for your efforts on this.  I can not reproduce on x86 or sparc
and do not see any similar symptoms on these architectures.

-- 
Mike Kravetz

> 
> (... output cut from ~17 iterations ...)
> 
> # env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3686
> Write s to 0x3effff000000 via shared mapping
> Bus error
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 19      hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> # env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3700
> Write s to 0x3effff000000 via shared mapping
> FAIL    mmap() 2: Cannot allocate memory
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 19      hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]