On 26 Sep 2023, at 14:19, David Hildenbrand wrote: > On 21.09.23 16:47, Zi Yan wrote: >> On 21 Sep 2023, at 6:19, David Hildenbrand wrote: >> >>> On 21.09.23 04:31, Zi Yan wrote: >>>> On 20 Sep 2023, at 13:23, Zi Yan wrote: >>>> >>>>> On 20 Sep 2023, at 12:04, Johannes Weiner wrote: >>>>> >>>>>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote: >>>>>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote: >>>>>>>> On 9/20/23 03:38, Zi Yan wrote: >>>>>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote: >>>>>>>>> >>>>>>>>>> On 09/19/23 16:57, Zi Yan wrote: >>>>>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote: >>>>>>>>>>> >>>>>>>>>>>> --- a/mm/page_alloc.c >>>>>>>>>>>> +++ b/mm/page_alloc.c >>>>>>>>>>>> @@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page, >>>>>>>>>>>> end = pageblock_end_pfn(pfn) - 1; >>>>>>>>>>>> >>>>>>>>>>>> /* Do not cross zone boundaries */ >>>>>>>>>>>> +#if 0 >>>>>>>>>>>> if (!zone_spans_pfn(zone, start)) >>>>>>>>>>>> start = zone->zone_start_pfn; >>>>>>>>>>>> +#else >>>>>>>>>>>> + if (!zone_spans_pfn(zone, start)) >>>>>>>>>>>> + start = pfn; >>>>>>>>>>>> +#endif >>>>>>>>>>>> if (!zone_spans_pfn(zone, end)) >>>>>>>>>>>> return false; >>>>>>>>>>>> I can still trigger warnings. >>>>>>>>>>> >>>>>>>>>>> OK. One thing to note is that the page type in the warning changed from >>>>>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Just to be really clear, >>>>>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path. >>>>>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call >>>>>>>>>> path WITHOUT your change. >>>>>>>>>> >>>>>>>>>> I am guessing the difference here has more to do with the allocation path? >>>>>>>>>> >>>>>>>>>> I went back and reran focusing on the specific migrate type. >>>>>>>>>> Without your patch, and coming from the alloc_contig_range call path, >>>>>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above. >>>>>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1' >>>>>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning. >>>>>>>>>> >>>>>>>>>> I could be wrong, but I do not think your patch changes things. >>>>>>>>> >>>>>>>>> Got it. Thanks for the clarification. >>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> One idea about recreating the issue is that it may have to do with size >>>>>>>>>>>> of my VM (16G) and the requested allocation sizes 4G. However, I tried >>>>>>>>>>>> to really stress the allocations by increasing the number of hugetlb >>>>>>>>>>>> pages requested and that did not help. I also noticed that I only seem >>>>>>>>>>>> to get two warnings and then they stop, even if I continue to run the >>>>>>>>>>>> script. >>>>>>>>>>>> >>>>>>>>>>>> Zi asked about my config, so it is attached. >>>>>>>>>>> >>>>>>>>>>> With your config, I still have no luck reproducing the issue. I will keep >>>>>>>>>>> trying. Thanks. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Perhaps try running both scripts in parallel? >>>>>>>>> >>>>>>>>> Yes. It seems to do the trick. >>>>>>>>> >>>>>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory? >>>>>>>>> >>>>>>>>> I am able to reproduce it with the script below: >>>>>>>>> >>>>>>>>> while true; do >>>>>>>>> echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages& >>>>>>>>> echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages& >>>>>>>>> wait >>>>>>>>> echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages >>>>>>>>> echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages >>>>>>>>> done >>>>>>>>> >>>>>>>>> I will look into the issue. >>>>>>> >>>>>>> Nice! >>>>>>> >>>>>>> I managed to reproduce it ONCE, triggering it not even a second after >>>>>>> starting the script. But I can't seem to do it twice, even after >>>>>>> several reboots and letting it run for minutes. >>>>>> >>>>>> I managed to reproduce it reliably by cutting the nr_hugepages >>>>>> parameters respectively in half. >>>>>> >>>>>> The one that triggers for me is always MIGRATE_ISOLATE. With some >>>>>> printk-tracing, the scenario seems to be this: >>>>>> >>>>>> #0 #1 >>>>>> start_isolate_page_range() >>>>>> isolate_single_pageblock() >>>>>> set_migratetype_isolate(tail) >>>>>> lock zone->lock >>>>>> move_freepages_block(tail) // nop >>>>>> set_pageblock_migratetype(tail) >>>>>> unlock zone->lock >>>>>> del_page_from_freelist(head) >>>>>> expand(head, head_mt) >>>>>> WARN(head_mt != tail_mt) >>>>>> start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES) >>>>>> for (pfn = start_pfn, pfn < end_pfn) >>>>>> if (PageBuddy()) >>>>>> split_free_page(head) >>>>>> >>>>>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the >>>>>> lock. The move_freepages_block() does nothing because the PageBuddy() >>>>>> is set on the pageblock to the left. Once we drop the lock, the buddy >>>>>> gets allocated and the expand() puts things on the wrong list. The >>>>>> splitting code that handles MAX_ORDER blocks runs *after* the tail >>>>>> type is set and the lock has been dropped, so it's too late. >>>>> >>>>> Yes, this is the issue I can confirm as well. But it is intentional to enable >>>>> allocating a contiguous range at pageblock granularity instead of MAX_ORDER >>>>> granularity. With your changes below, it no longer works, because if there >>>>> is an unmovable page in >>>>> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)), >>>>> the allocation fails but it would succeed in current implementation. >>>>> >>>>> I think a proper fix would be to make move_freepages_block() split the >>>>> MAX_ORDER page and put the split pages in the right migratetype free lists. >>>>> >>>>> I am working on that. >>>> >>>> After spending half a day on this, I think it is much harder than I thought >>>> to get alloc_contig_range() working with the freelist migratetype hygiene >>>> patchset. Because alloc_contig_range() relies on racy migratetype changes: >>>> >>>> 1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent >>>> another parallel isolation, but they are not moved to the MIGRATE_ISOLATE >>>> free list yet. >>>> >>>> 2. later in the process, isolate_freepages_range() is used to actually grab >>>> the free pages. >>>> >>>> 3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned >>>> ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or >>>> in-use pages. But it is not the case when alloc_contig_range() work on >>>> pageblock aligned ranges. Now during isolation phase, free or in-use pages >>>> will need to be split to get their subpages into the right free lists. >>>> >>>> 4. the hardest case is when a in-use page sits across two pageblocks, currently, >>>> the code just isolate one pageblock, migrate the page, and let split_free_page() >>>> to correct the free list later. But to strictly enforce freelist migratetype >>>> hygiene, extra work is needed at free page path to split the free page into >>>> the right freelists. >>>> >>>> I need more time to think about how to get alloc_contig_range() properly. >>>> Help is needed for the bullet point 4. >>> >>> >>> I once raised that we should maybe try making MIGRATE_ISOLATE a flag that preserves the original migratetype. Not sure if that would help here in any way. >> >> I have that in my backlog since you asked and have been delaying it. ;) Hopefully > > It's complicated and I wish I would have had more time to review it > back then ... or now to clean it up later. > > Unfortunately, nobody else did have the time to review it back then ... maybe we can > do better next time. David doesn't scale. > > Doing page migration from inside start_isolate_page_range()->isolate_single_pageblock() > really is sub-optimal (and mostly code duplication from alloc_contig_range). I felt the same when I wrote the code. But I thought it was the only way out. > >> I can do it after I fix this. That change might or might not help only if we make >> some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not >> overwrite existing migratetype, the code might not need to split a page and move >> it to MIGRATE_ISOLATE freelist? > > Did someone test how memory offlining plays along with that? (I can try myself > within the next 1-2 weeks) > > There [mm/memory_hotplug.c:offline_pages] we always cover full MAX_ORDER ranges, > though. > > ret = start_isolate_page_range(start_pfn, end_pfn, > MIGRATE_MOVABLE, > MEMORY_OFFLINE | REPORT_FAILURE, > GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL); Since a full MAX_ORDER range is passed, no free page split will happen. > >> >> The fundamental issue in alloc_contig_range() is that to work at >> pageblock level, a page (>pageblock_order) can have one part is isolated and >> the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list() >> now checks first pageblock migratetype, so such a page needs to be removed >> from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and >> finally put back to multiple free lists. This needs to be done at isolation stage >> before free pages are removed from their free lists (the stage after isolation). > > One idea was to always isolate larger chunks, and handle movability checks/split/etc > at a later stage. Once isolation would be decoupled from the actual/original migratetype, > the could have been easier to handle (especially some corner cases I had in mind back then). I think it is a good idea. When I coded alloc_contig_range() up, I tried to accommodate existing set_migratetype_isolate(), which calls has_unmovable_pages(). If these two are decoupled, set_migrateype_isolate() can work on MAX_ORDER-aligned ranges and has_unmovable_pages() can still work on pageblock-aligned ranges. Let me give this a try. > >> If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages >> in their original migratetype and check migratetype before allocating a page, >> that might help. But that might add extra work (e.g., splitting a partially >> isolated free page before allocation) in the really hot code path, which is not >> desirable. > > With MIGRATE_ISOLATE being a separate flag, one idea was to have not a single > separate isolate list, but one per "proper migratetype". But again, just some random > thoughts I had back then, I never had sufficient time to think it all through. Got it. I will think about it. One question on separate MIGRATE_ISOLATE: the implementation I have in mind is that MIGRATE_ISOLATE will need a dedicated flag bit instead of being one of migratetype. But now there are 5 migratetypes + MIGRATE_ISOLATE and PB_migratetype_bits is 3, so an extra migratetype_bit is needed. But current migratetype implementation is a word-based operation, requiring NR_PAGEBLOCK_BITS to be divisor of BITS_PER_LONG. This means NR_PAGEBLOCK_BITS needs to be increased from 4 to 8 to meet the requirement, wasting a lot of space. An alternative is to have a separate array for MIGRATE_ISOLATE, which requires additional changes. Let me know if you have a better idea. Thanks. -- Best Regards, Yan, Zi
Attachment:
signature.asc
Description: OpenPGP digital signature