On 15.02.23 19:03, SeongJae Park wrote:
On Wed, 15 Feb 2023 14:16:05 +0100 David Hildenbrand <david@xxxxxxxxxx> wrote:
On 14.02.23 23:32, SeongJae Park wrote:
do_migrate_range() returns migrate_pages() return value, which zero
means perfect success, in usual cases. If all pages are failed to be
isolated, however, it returns isolate_{lru,movalbe}_page() return
values, or zero if all pfn were invalid, were hugetlb or hwpoisoned. So
do_migrate_range() returning zero means either perfect success, or
special cases of isolation total failure.
Actually, the return value is not checked by any caller, so it might be
better to simply make it a void function. However, there is a TODO for
checking the return value.
I'd prefer to not add more dead code ;) Let's not return an error instead.
Makes sense, I will send next spin soon.
It's still unclear which kind of fatal migration issues we actually care
about and how to really detect them.
What do you think about treating the isolation/migration rate limit
(migrate_rs) hit in do_migrate_range() as fatal? It warns for the event
already, so definitely a bad sign.
If that's not that bad enough to be treated as fatal, I think we could have yet
another rate limit to be considered fatal.
IIRC, there are some setups where offlining might take several minutes
(e.g., heavy O_DIRECT load) and that's to be expected.
So the existing code warns for better debugging, but keeps trying. So
the ratelimit is rather to not produce too much debug output, not to
really indicate that something is fatal.
--
Thanks,
David / dhildenb