On Wed, 15 Feb 2023 21:00:50 +0100 David Hildenbrand <david@xxxxxxxxxx> wrote: > On 15.02.23 19:03, SeongJae Park wrote: > > On Wed, 15 Feb 2023 14:16:05 +0100 David Hildenbrand <david@xxxxxxxxxx> wrote: > > > >> On 14.02.23 23:32, SeongJae Park wrote: > >>> do_migrate_range() returns migrate_pages() return value, which zero > >>> means perfect success, in usual cases. If all pages are failed to be > >>> isolated, however, it returns isolate_{lru,movalbe}_page() return > >>> values, or zero if all pfn were invalid, were hugetlb or hwpoisoned. So > >>> do_migrate_range() returning zero means either perfect success, or > >>> special cases of isolation total failure. > >>> > >>> Actually, the return value is not checked by any caller, so it might be > >>> better to simply make it a void function. However, there is a TODO for > >>> checking the return value. > >> > >> I'd prefer to not add more dead code ;) Let's not return an error instead. > > > > Makes sense, I will send next spin soon. > > > >> > >> It's still unclear which kind of fatal migration issues we actually care > >> about and how to really detect them. > > > > What do you think about treating the isolation/migration rate limit > > (migrate_rs) hit in do_migrate_range() as fatal? It warns for the event > > already, so definitely a bad sign. > > > > If that's not that bad enough to be treated as fatal, I think we could have yet > > another rate limit to be considered fatal. > > IIRC, there are some setups where offlining might take several minutes > (e.g., heavy O_DIRECT load) and that's to be expected. > > So the existing code warns for better debugging, but keeps trying. So > the ratelimit is rather to not produce too much debug output, not to > really indicate that something is fatal. Thank you for clarification, David! Thanks, SJ > > -- > Thanks, > > David / dhildenb