Re: [PATCH 2/2] selftests/migration: Disable NUMA balancing and check migration status

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Ryan Roberts <ryan.roberts@xxxxxxx> writes:

> On 07/08/2023 13:41, Alistair Popple wrote:
>> 
>> Ryan Roberts <ryan.roberts@xxxxxxx> writes:
>> 
>>> On 07/08/2023 07:39, Alistair Popple wrote:
>>>> The migration selftest was only checking the return code and not the
>>>> status array for migration success/failure. Update the test to check
>>>> both. This uncovered a bug in the return code handling of
>>>> do_pages_move().
>>>>
>>>> Also disable NUMA balancing as that can lead to unexpected migration
>>>> failures.
>>>>
>>>> Signed-off-by: Alistair Popple <apopple@xxxxxxxxxx>
>>>> Suggested-by: Ryan Roberts <ryan.roberts@xxxxxxx>
>>>> ---
>>>>
>>>> Ryan, this will still cause the test to fail if a migration failed. I
>>>> was unable to reproduce a migration failure for any cases on my system
>>>> once I disabled NUMA balancing though so I'd be curious if you are
>>>> still seeing failures with this patch applied. AFAIK there shouldn't
>>>> be anything else that would be causing migration failure so would like
>>>> to know what is causing failures. Thanks!
>>>
>>>
>>> Hi Alistair,
>>>
>>> Afraid I'm still seeing unmigrated pages when running with these 2 patches:
>>>
>>>
>>> #  RUN           migration.shared_anon ...
>>> Didn't migrate 1 pages
>>> # migration.c:183:shared_anon:Expected migrate(ptr, self->n1, self->n2) (-2) == 0 (0)
>>> # shared_anon: Test terminated by assertion
>>> #          FAIL  migration.shared_anon
>>> not ok 2 migration.shared_anon
>>>
>>>
>>> I added some instrumentation; it usually fails on the second time
>>> through the loop in migrate() but I've also seen it fail the first
>>> time. Never seen it get though 2 iterations successfully though.
>> 
>> Interesting. I guess migration failure is always possible for various
>> reasons so I will update the test to report the number of failed
>> migrations rather than making it a test failure. 
>
> I find it odd that migration always succeeds for the private_anon and
> private_anon_thp cases, but fails for the fork case. I guess there is something
> about the page being RO (for CoW) or having higher mapcount/refcount that causes
> the migration to fail?

But the fork case uses a shared mapping, so there shouldn't be any
RO/CoW AFAIK. A higher refcount than expected would cause migration
failure, but there's nothing I can think of that would be causing that
now NUMA balancing is disabled in the test (that caused migration
failures for me in the private cases due to the concurrent migration
attempts).

> I just modified access_mem() to access the page _after_ the one that is being
> migrated (the mmap is allocating 2M). In this case, shared_anon passes. So there
> is something about access_mem() reading the page that stops its from being
> migrated? - that would explain why we sometimes have a successful first
> iteration in migrate() - we are racing the fork'ed child.
>
> But its not clear why reading a page would cause migration to fail? If the
> reader thread wins, then the HW should just allow the read without faulting into
> the kernel - so no opportunity to change kernel state. If move_pages() wins,
> then access_mem would take a fault and see the migration entry and (presumably)
> wait for migration to complete?

That matches my understanding also, hence why I was failing the test if
a migration didn't succeed because I don't understand why it would be
failing.

>> I was mostly just
>> curious as to what would be causing the occasional failures for my own
>> understanding, but the failures themselves are unimportant.
>> 
>>> I did also try just this patch without the error handling update in the kernel, but it still fails in the same way.
>>>
>>> I'm running on arm64 in case that wasn't clear. Let me know if there is anything I can do to help debug.
>> 
>> Thanks! Unless you're concerned about the failures I am happy to ignore
>> them. Pages can fail to migrate for all sorts of reasons although I'm a
>> little suprised anonymous migrations are failing so frequently for you.
>
> Something about this doesn't smell right to me. Would be good to hear your
> thoughts on above before we close this and move on...

I agree, something seems odd here so happy to try and close it. I
haven't been able to replicate any failures on my local fake NUMA x86_64
development machine though. Will see if I can replicate on an ARM64
platform. Chances are something is taking an extra reference on the
page, it's just a matter of figuring out what...

>> 
>>> Thanks,
>>> Ryan
>>>
>>>
>>>>
>>>>  tools/testing/selftests/mm/migration.c | 18 +++++++++++++++++-
>>>>  1 file changed, 17 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/tools/testing/selftests/mm/migration.c b/tools/testing/selftests/mm/migration.c
>>>> index 379581567f27..cf079af5799b 100644
>>>> --- a/tools/testing/selftests/mm/migration.c
>>>> +++ b/tools/testing/selftests/mm/migration.c
>>>> @@ -51,6 +51,12 @@ FIXTURE_SETUP(migration)
>>>>  	ASSERT_NE(self->threads, NULL);
>>>>  	self->pids = malloc(self->nthreads * sizeof(*self->pids));
>>>>  	ASSERT_NE(self->pids, NULL);
>>>> +
>>>> +	/*
>>>> +	 * Disable NUMA balancing which can cause migration
>>>> +	 * failures.
>>>> +	 */
>>>> +	numa_set_membind(numa_all_nodes_ptr);
>>>>  };
>>>>  
>>>>  FIXTURE_TEARDOWN(migration)
>>>> @@ -62,13 +68,14 @@ FIXTURE_TEARDOWN(migration)
>>>>  int migrate(uint64_t *ptr, int n1, int n2)
>>>>  {
>>>>  	int ret, tmp;
>>>> -	int status = 0;
>>>>  	struct timespec ts1, ts2;
>>>>  
>>>>  	if (clock_gettime(CLOCK_MONOTONIC, &ts1))
>>>>  		return -1;
>>>>  
>>>>  	while (1) {
>>>> +		int status = NUMA_NUM_NODES + 1;
>>>> +
>>>>  		if (clock_gettime(CLOCK_MONOTONIC, &ts2))
>>>>  			return -1;
>>>>  
>>>> @@ -85,6 +92,15 @@ int migrate(uint64_t *ptr, int n1, int n2)
>>>>  			return -2;
>>>>  		}
>>>>  
>>>> +		/*
>>>> +		 * Note we should never see this because move_pages() should
>>>> +		 * have indicated a page couldn't migrate above.
>>>> +		 */
>>>> +		if (status < 0) {
>>>> +			printf("Page didn't migrate, error %d\n", status);
>>>> +			return -2;
>>>> +		}
>>>> +
>>>>  		tmp = n2;
>>>>  		n2 = n1;
>>>>  		n1 = tmp;
>> 





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux