Re: [PATCH 2/2] selftests/migration: Disable NUMA balancing and check migration status

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Ryan Roberts <ryan.roberts@xxxxxxx> writes:

> On 09/08/2023 10:34, David Hildenbrand wrote:
>> On 09.08.23 06:21, Alistair Popple wrote:
>>>
>>> Ryan Roberts <ryan.roberts@xxxxxxx> writes:
>>>
>>>> On 07/08/2023 13:41, Alistair Popple wrote:
>>>>>
>>>>> Ryan Roberts <ryan.roberts@xxxxxxx> writes:
>>>>>
>>>>>> On 07/08/2023 07:39, Alistair Popple wrote:
>>>>>>> The migration selftest was only checking the return code and not the
>>>>>>> status array for migration success/failure. Update the test to check
>>>>>>> both. This uncovered a bug in the return code handling of
>>>>>>> do_pages_move().
>>>>>>>
>>>>>>> Also disable NUMA balancing as that can lead to unexpected migration
>>>>>>> failures.
>>>>>>>
>>>>>>> Signed-off-by: Alistair Popple <apopple@xxxxxxxxxx>
>>>>>>> Suggested-by: Ryan Roberts <ryan.roberts@xxxxxxx>
>>>>>>> ---
>>>>>>>
>>>>>>> Ryan, this will still cause the test to fail if a migration failed. I
>>>>>>> was unable to reproduce a migration failure for any cases on my system
>>>>>>> once I disabled NUMA balancing though so I'd be curious if you are
>>>>>>> still seeing failures with this patch applied. AFAIK there shouldn't
>>>>>>> be anything else that would be causing migration failure so would like
>>>>>>> to know what is causing failures. Thanks!
>>>>>>
>>>>>>
>>>>>> Hi Alistair,
>>>>>>
>>>>>> Afraid I'm still seeing unmigrated pages when running with these 2 patches:
>>>>>>
>>>>>>
>>>>>> #  RUN           migration.shared_anon ...
>>>>>> Didn't migrate 1 pages
>>>>>> # migration.c:183:shared_anon:Expected migrate(ptr, self->n1, self->n2)
>>>>>> (-2) == 0 (0)
>>>>>> # shared_anon: Test terminated by assertion
>>>>>> #          FAIL  migration.shared_anon
>>>>>> not ok 2 migration.shared_anon
>>>>>>
>>>>>>
>>>>>> I added some instrumentation; it usually fails on the second time
>>>>>> through the loop in migrate() but I've also seen it fail the first
>>>>>> time. Never seen it get though 2 iterations successfully though.
>>>>>
>>>>> Interesting. I guess migration failure is always possible for various
>>>>> reasons so I will update the test to report the number of failed
>>>>> migrations rather than making it a test failure.
>>>>
>>>> I find it odd that migration always succeeds for the private_anon and
>>>> private_anon_thp cases, but fails for the fork case. I guess there is something
>>>> about the page being RO (for CoW) or having higher mapcount/refcount that causes
>>>> the migration to fail?
>>>
>>> But the fork case uses a shared mapping, so there shouldn't be any
>>> RO/CoW AFAIK. A higher refcount than expected would cause migration
>>> failure, but there's nothing I can think of that would be causing that
>>> now NUMA balancing is disabled in the test (that caused migration
>>> failures for me in the private cases due to the concurrent migration
>>> attempts).
>> 
>> Yes, we do have MAP_SHARED | MAP_ANONYMOUS, which is just shmem with extra-steps.
>
> Yep my bad - not reading the code properly. I assumed it was private because the
> other one is.

Not a problem!

>> 
>> In add_page_for_migration(), we do have
>> 
>>     if (page_mapcount(page) > 1 && !migrate_all)
>> 
>> but the test seems to always specify MPOL_MF_MOVE_ALL when calling move_pages(),
>> so it should be ignored.
>> 
>> It would be helpful to dump the page when migration fails in that case.
>> 
>> 
>> Note that in add_page_for_migration(), we perform a follow_page() that used to
>> accidentally honor NUMA hinting faults. Fixed by
>> 
>> commit 3c471b04db7604974e3bea45e2d9c7c27a905536
>> Author: David Hildenbrand <david@xxxxxxxxxx>
>> Date:   Thu Aug 3 16:32:02 2023 +0200
>> 
>>     mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
>> 
>> in mm-unstable.
>
> This fix does not change the behaviour of the test; it still fails.

It fixes the failures I was seeing for private_anon with NUMA balancing
enabled[1]. Clearly there is something different between Ryan's setup
and mine because he doesn't see failures for the private cases but I do.

However I am still seeing failures in the private_anon_thp case with
NUMA balancing enabled even with the patch. I haven't had time to dig
deeper - perhaps it is expected. From memory concurrent migrations can
take extra page references which could be upsetting things which is
initially why I figured it needed disabling.

> HOWEVER... It turns out the test has started passing in mm-unstable due to a
> commit after this one. See bisect:

As I said, I've never seen a failure for shared_anon on my setup so
haven't replicated this. Hopefully I will get some time to look more
closely soon.

[1] - Note that requires changing the test in this patch to still pass
'status' argument to move_pages() but remove the mbind() call that
disables NUMA balancing. Without the status argument the test passes
with NUMA balancing enabled anyway with or without the patch from David.

> Note terms are inverted:
> good => test *fails*
> bad => test *passes*
>
> Initial bounds:
> good: 3f943756e8b3 => commit suggested by DavidH (test still failing there)
> bad: 84d9f657acae  => mm-unstable head as of a few days ago
>
> All of these commits are from mm-unstable; none are in v6.5-rc3, which is where
> I originally reported the issue.
>
> git bisect start
> # bad: [84d9f657acaecc43dc52f25d52230db85fd5ffdd] mm: move vma locking out of
> vma_prepare and dup_anon_vma
> git bisect bad 84d9f657acaecc43dc52f25d52230db85fd5ffdd
> # good: [3f943756e8b3e0b5d0ea1f3087658eb559b0c7b0] mm/gup: reintroduce FOLL_NUMA
> as FOLL_HONOR_NUMA_FAULT
> git bisect good 3f943756e8b3e0b5d0ea1f3087658eb559b0c7b0
> # good: [7dfbad370c66f35789d193a758575d31aabd25a4] selftests/mm: optionally pass
> duration to transhuge-stress
> git bisect good 7dfbad370c66f35789d193a758575d31aabd25a4
> # bad: [fc99f767390266b5436575c00445d4445f6c9be6] mips: convert various
> functions to use ptdescs
> git bisect bad fc99f767390266b5436575c00445d4445f6c9be6
> # bad: [af89742c0bf319979e00cfb066ead6b510f3e296] powerpc/book3s64/vmemmap:
> switch radix to use a different vmemmap handling function
> git bisect bad af89742c0bf319979e00cfb066ead6b510f3e296
> # good: [dfebce290a7b44985eda4ddd76378cdc82e3541c] maple_tree: adjust node
> allocation on mas_rebalance()
> git bisect good dfebce290a7b44985eda4ddd76378cdc82e3541c
> # good: [9517da22a74a49102bcd6c8556eeceaca965b0a6] mm: move FAULT_FLAG_VMA_LOCK
> check down from do_fault()
> git bisect good 9517da22a74a49102bcd6c8556eeceaca965b0a6
> # bad: [5ec6b3644e50d859ebf4cba5cc29cfbda0e6d109] mm: change
> pudp_huge_get_and_clear_full take vm_area_struct as arg
> git bisect bad 5ec6b3644e50d859ebf4cba5cc29cfbda0e6d109
> # bad: [bbecc62bae72ec22e4276464a5ef359511923954] mm: handle faults that merely
> update the accessed bit under the VMA lock
> git bisect bad bbecc62bae72ec22e4276464a5ef359511923954
> # bad: [2132f14c5bc1b10ea964ab89bd6291fc05eaccaa] mm: handle swap and NUMA PTE
> faults under the VMA lock
> git bisect bad 2132f14c5bc1b10ea964ab89bd6291fc05eaccaa
> # bad: [8890e186b3470fc690d3022656e98c0c41e27eca] mm: run the fault-around code
> under the VMA lock
> git bisect bad 8890e186b3470fc690d3022656e98c0c41e27eca
> # first bad commit: [8890e186b3470fc690d3022656e98c0c41e27eca] mm: run the
> fault-around code under the VMA lock
>
> First commit where test passes:
>
> commit 8890e186b3470fc690d3022656e98c0c41e27eca
> Author: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
> Date:   Mon Jul 24 19:54:08 2023 +0100
>
>     mm: run the fault-around code under the VMA lock
>
>     The map_pages fs method should be safe to run under the VMA lock instead
>     of the mmap lock.  This should have a measurable reduction in contention
>     on the mmap lock.
>
>     Link: https://lkml.kernel.org/r/20230724185410.1124082-9-willy@xxxxxxxxxxxxx
>     Signed-off-by: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
>     Reviewed-by: Suren Baghdasaryan <surenb@xxxxxxxxxx>
>     Cc: Arjun Roy <arjunroy@xxxxxxxxxx>
>     Cc: Eric Dumazet <edumazet@xxxxxxxxxx>
>     Cc: Punit Agrawal <punit.agrawal@xxxxxxxxxxxxx>
>     Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>
>  mm/memory.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
>
>
> It's not really clear to me why this change should turn the fail into a pass
> though... Perhaps migration tries to get the mmap_lock and if it fails, then it
> skips migration?
>
>> 
>> 
>> Maybe that fix does no longer require you to disable NUMA hinting for this test
>> case. Maybe you're lucky and it resolves the  shared_anon case as well, but I
>> somewhat doubt it :)
>> 
>> It doesn't really explain why the shared case would fail, though...
>> 






[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux