On 17.11.21 04:04, Zi Yan wrote: > On 16 Nov 2021, at 3:58, David Hildenbrand wrote: > >> On 15.11.21 20:37, Zi Yan wrote: >>> From: Zi Yan <ziy@xxxxxxxxxx> >>> >>> Hi David, >> >> Hi, >> >> thanks for looking into this. >> Hi, sorry for the delay, I wasn't "actually working" last week, so now I'm back from holiday :) >>> >>> You suggested to make alloc_contig_range() deal with pageblock_order instead of >>> MAX_ORDER - 1 and get rid of MAX_ORDER - 1 dependency in virtio_mem[1]. This >>> patchset is my attempt to achieve that. Please take a look and let me know if >>> I am doing it correctly or not. >>> >>> From what my understanding, cma required alignment of >>> max(MAX_ORDER - 1, pageblock_order), because when MIGRATE_CMA was introduced, >>> __free_one_page() does not prevent merging two different pageblocks, when >>> MAX_ORDER - 1 > pageblock_order. But current __free_one_page() implementation >>> does prevent that. It should be OK to just align cma to pageblock_order. >>> alloc_contig_range() relies on MIGRATE_CMA to get free pages, so it can use >>> pageblock_order as alignment too. >> >> I wonder if that's sufficient. Especially the outer_start logic in >> alloc_contig_range() might be problematic. There are some ugly corner >> cases with free pages/allocations spanning multiple pageblocks and we >> only isolated a single pageblock. > > Thank you a lot for writing the list of these corner cases. They are > very helpful! > >> >> >> Regarding CMA, we have to keep the following cases working: >> >> a) Different pageblock types (MIGRATE_CMA and !MIGRATE_CMA) in MAX_ORDER >> - 1 page: >> [ MAX_ ORDER - 1 ] >> [ pageblock 0 | pageblock 1] >> >> Assume either pageblock 0 is MIGRATE_CMA or pageblock 1 is MIGRATE_CMA, >> but not both. We have to make sure alloc_contig_range() keeps working >> correctly. This should be the case even with your change, as we won't >> merging pages accross differing migratetypes. > > Yes. > >> >> b) Migrating/freeing a MAX_ ORDER - 1 page while partially isolated: >> [ MAX_ ORDER - 1 ] >> [ pageblock 0 | pageblock 1] >> >> Assume both are MIGRATE_CMA. Assume we want to either allocate from >> pageblock 0 or pageblock 1. Especially, assume we want to allocate from >> pageblock 1. While we would isolate pageblock 1, we wouldn't isolate >> pageblock 0. >> >> What happens if we either have a free page spanning the MAX_ORDER - 1 >> range already OR if we have to migrate a MAX_ORDER - 1 page, resulting >> in a free MAX_ORDER - 1 page of which only the second pageblock is >> isolated? We would end up essentially freeing a page that has mixed >> pageblocks, essentially placing it in !MIGRATE_ISOLATE free lists ... I >> might be wrong but I have the feeling that this would be problematic. >> > > This could happen when start_isolate_page_range() stumbles upon a compound > page with order >= pageblock_order or a free page with order >= > pageblock_order, but should not. start_isolate_page_range() should check > the actual page size, either compound page size or free page size, and set > the migratetype across pageblocks if the page is bigger than pageblock size. > More precisely set_migratetype_isolate() should do that. Right -- but then we have to extend the isolation range from within set_migratetype_isolate() :/ E.g., how should we know what we have to unisolate later .. > > >> c) Concurrent allocations: >> [ MAX_ ORDER - 1 ] >> [ pageblock 0 | pageblock 1] >> >> Assume b) but we have two concurrent CMA allocations to pageblock 0 and >> pageblock 1, which would now be possible as start_isolate_page_range() >> isolate would succeed on both. > > Two isolations will be serialized by the zone lock taken by > set_migratetype_isolate(), so the concurrent allocation would not be a problem. > If it is a MAX_ORDER-1 free page, the first comer should split it and only > isolate one of the pageblock then second one can isolate the other pageblock. Right, the issue is essentially b) above. > If it is a MAX_ORDER-1 compound page, the first comer should isolate both > pageblocks, then the second one would fail. WDYT? Possibly we could even have two independent CMA areas "colliding" within a MAX_ ORDER - 1 page. I guess the surprise would be getting an "-EAGAIN" from isolation, but the caller should properly handle that. Maybe it's really easier to do something along the lines I proposed below and always isolate the complete MAX_ORDER-1 range in alloc_contig_range(). We just have to "fix" isolation as I drafted. > > > In sum, it seems to me that the issue is page isolation code only sees > pageblock without check the actual page. When there are multiple pageblocks > belonging to one page, the problem appears. This should be fixed. > >> >> >> Regarding virtio-mem, we care about the following cases: >> >> a) Allocating parts from completely movable MAX_ ORDER - 1 page: >> [ MAX_ ORDER - 1 ] >> [ pageblock 0 | pageblock 1] >> >> Assume pageblock 0 and pageblock 1 are either free or contain only >> movable pages. Assume we allocated pageblock 0. We have to make sure we >> can allocate pageblock 1. The other way around, assume we allocated >> pageblock 1, we have to make sure we can allocate pageblock 0. >> >> Free pages spanning both pageblocks might be problematic. > > Can you elaborate a bit? If either of pageblock 0 and 1 is used by > virtio-mem, why do we care the other? If pageblock 0 and 1 belong to > the same page (either free or compound), they should have the same > migratetype. If we want to just allocate one of them, we can split > the free page or migrate the compound page then split the remaining > free page. The thing is: it has to work on ZONE_NORMAL and ZONE_MOVABLE as well. It's essentially the same issue as a) and b) in the CMA case, so it should be covered by these. > >> >> b) Allocate parts of partially movable MAX_ ORDER - 1 page: >> [ MAX_ ORDER - 1 ] >> [ pageblock 0 | pageblock 1] >> >> Assume pageblock 0 contains unmovable data but pageblock 1 not: we have >> to make sure we can allocate pageblock 1. Similarly, assume pageblock 1 >> contains unmovable data but pageblock 0 no: we have to make sure we can >> allocate pageblock 1. has_unmovable_pages() might allow for that. >> >> But, we want to fail early in case we want to allocate a single >> pageblock but it contains unmovable data. This could be either directly >> or indirectly. >> >> If we have an unmovable (compound) MAX_ ORDER - 1 and we'd try isolating >> pageblock 1, has_unmovable_pages() would always return "false" because >> we'd simply be skiping over any tail pages, and not detect the >> un-movability. > > OK. It seems to me that has_unmovable_pages() needs to be fixed to handle > such a situation. Right. > >> >> c) Migrating/freeing a MAX_ ORDER - 1 page while partially isolated: >> >> Same concern as for CMA b) >> >> >> So the biggest concern I have is dealing with migrating/freeing > >> pageblock_order pages while only having isolated a single pageblock. > > I agree. I think isolation code needs to be aware of >pageblock_order > pages and act accordingly. If it is a free page, split the page to > avoid isolating a subset of the page. If it is a compound page, either > fail the isolation or isolate the entire compound page instead. > >> >>> >>> In terms of virtio_mem, if I understand correctly, it relies on >>> alloc_contig_range() to obtain contiguous free pages and offlines them to reduce >>> guest memory size. As the result of alloc_contig_range() alignment change, >>> virtio_mem should be able to just align PFNs to pageblock_order. >> >> For virtio-mem it will most probably be desirable to first try >> allocating the MAX_ORDER -1 range if possible and then fallback to >> pageblock_order. But that's an additional change on top in virtio-mem code. >> > > Just to understand the motivation, is this because MAX_ORDER-1 range > would be faster than pageblock_order? What if MAX_ORDER-1 goes beyond > a memory section size (like my WIP patchset to increase MAX_ORDER > beyond the memory section size)? virtio-mem could first try PAGES_PER_SECTION, > then fall back to pageblock_order, right? My comment is only in the context of this patch series, not in context of eventually raising MAX_ORDER to exceed eventually a single memory section or even a memory block. Yes, it would be faster. What we do right now (if the complete memory block is used by Linux and thus not allocated by virtio-mem yet): a) Try allocating the the complete memory block. b) If it fails, fallback to essentially MAX_ORDER - 1 chunks In the context of this patch it would be reasonable to a) Try allocating the the complete memory block. b) If it fails, fallback to essentially MAX_ORDER - 1 chunks c) If it fails, fallback to essentially pageblock order chunks Things will be different if we change MAX_ORDER - 1. >> >> >> My take to teach alloc_contig_range() to properly handle would be the >> following: >> >> a) Convert MIGRATE_ISOLATE into a separate pageblock flag >> >> We would want to convert MIGRATE_ISOLATE into a separate pageblock >> flags, such that when we isolate a page block we preserve the original >> migratetype. While start_isolate_page_range() would set that bit, >> undo_isolate_page_range() would simply clear that bit. The buddy would >> use a single MIGRATE_ISOLATE queue as is: the original migratetype is >> only used for restoring the correct migratetype. This would allow for >> restoring e.g., MIGRATE_UNMOVABLE after isolating an unmovable pageblock >> (below) and not simply setting all such pageblocks to MIGRATE_MOVABLE >> when un-isolating. >> >> Ideally, we'd get rid of the "migratetype" parameter for >> alloc_contig_range(). However, even with the above change we have to >> make sure that memory offlining and ordinary alloc_contig_range() users >> will fail on MIGRATE_CMA pageblocks (has_unmovable_page() checks that as >> of today). We could achieve that differently, though (e.g., bool >> cma_alloc parameter instead). > > This might need to be done in a separate patch, since pageblock bits require > to be word aligned and it is 4 now. To convert MIGRATE_ISOLATE to a separate > bit, either NR_PAGEBLOCK_BITS needs to be increased to 8 or a separate > isolation bitmap array needs to be allocated. Or the migratetype information > can be stored temporarily during isolation process. I can look into it later. Right, we'd need 8 instead of 4 bits. But we could preserve the previous migratettype (MOVABLE, UNMOVABLE, CMA) ... when isolating and wouldn't have to magically punch in whatever someone told us. > > >> >> >> b) Allow isolating pageblocks with unmovable pages >> >> We'd pass the actual range of interest to start_isolate_page_range() and >> rework the code to check has_unmovable_pages() only on the range of >> interest, but considering overlapping larger allocations. E.g., if we >> stumble over a compound page, lookup the head an test if that page is >> movable/unmovable. > > This is an optimization to reduce isolation failure rate, right? This only > applies to the pageblocks at the beginning and the end of a range of interest. Right. And with a) we can simply isolate+unisolate without always changing the migratetype e.g., to MIGRATE_MOVABLE in case of virtio-mem. > >> >> c) Change alloc_contig_range() to not "extend" the range of interest to >> include pageblock of different type. Assume we're isolating a >> MIGRATE_CMA pageblock, only isolate a neighboring MIGRATE_CMA pageblock, >> not other pageblocks. > > But alloc_contig_range() would return these extended pageblocks at the end. > And if pageblock migratetype can be preserved during isolation (item (a) above), > this would not be a problem, right? We have to make sure that ordinary alloc_contig_range() and memory offlining don't allocate MIGRATE_CMA ranges. So when actually isolating we have to refuse isolating MIGRATE_CMA pageblocks. Handling that in the caller when adjusting the range keeps the logic in the actual isolation code is one option (cma=false - bail out when wanting to isolate MIGRATE_CMA). But there might be alternatives. Most probably we'd just have to check for the "range of interest". If cma=false we just have to make sure to not isolate MIGRATE_CMA inside the "range of interest". Yes that should work as well. > >> >> >> So we'd keep isolating complete MAX_ORDER - 1 pages unless c) prevents >> it. We'd allow isolating even pageblocks that contain unmovable pages on >> ZONE_NORMAL, and check via has_unmovable_pages() only if the range of >> interest contains unmovable pages, not the whole MAX_ORDER -1 range or >> even the whole pageblock. We'd not silently overwrite the pageblock type >> when restoring but instead restore the old migratetype. >> > I assume MAX_ORDER - 1 is an optimization for faster isolation speed. > If MAX_ORDER goes beyond a memory section size, I guess PAGES_PER_SECTION > is what you want, right? FYI, I am preparing a follow-up patch to replace > any MAX_ORDER use that is intended to indicate maximum physically contiguous > size with a new variable, MAX_PHYS_CONTIG_ORDER, which is PFN_SECTION_SHIFT > when SPARSEMEM and MAX_ORDER when FLATMEM. I would replace MAX_ORDER here > with the new variable. Yes, with MAX_ORDER changes it will be a different story. We could detect at runtime what actually makes sense. -- Thanks, David / dhildenb