On 12.11.19 22:05, David Hildenbrand wrote: > On 12.11.19 19:34, Alexander Duyck wrote: >> On Tue, 2019-11-12 at 14:04 +0100, David Hildenbrand wrote: >>>>>> fact is it is still invasive, just to different parts of the mm subsystem. >>>>> >>>>> I'd love to see how it uses the page isolation framework, and only has a >>>>> single hook to queue pages. I don't like the way pages are pulled out of >>>>> the buddy in Niteshs approach currently. What you have is cleaner. >>>> >>>> I don't see how you could use the page isolation framework to pull out >>>> free pages. Is there a thread somewhere on the topic that I missed? >>> >>> It's basically only isolating pages while reporting them, and not >>> pulling them out of the buddy (IOW, you move the pages to the isolate >>> queues where nobody is allowed to touch them, and setting the >>> migratetype properly). This e.g., makes other user of page isolation >>> (e.g., memory offlining, alloc_contig_range()) play nicely with these >>> isolated pages. "somebody else just isolated them, please try again." >> >> How so? If I understand correctly there isn't anything that prevents you >> from isolating an already isolated page, is there? Last I knew isolated > > mm/page_isolation.c:set_migratetype_isolate() > ... > if (is_migrate_isolate_page(page)) > goto out; > ... > -> Currently -EBUSY > >> pages are still considered "movable" since they are still buddy pages >> aren't they? > > They are neither movable nor unmovable AFAIK. They are temporarily > blocked. E.g., memory offlining currently returns -EBUSY if it cannot > isolate the page range. alloc_contig_range() does the same. Imagine > somebody allocating a gigantic page. You certainly cannot move the pages > that are isolated while allocating the page. But you can signal to the > caller to try again later. > >> >> Also this seems like it would have other implications since isolating a >> page kicks of the memory notifier so as a result a balloon driver would >> then free the pages back out so that they could be isolated with the >> assumption the region is going offline. > > Memory notifier? Balloon pages getting freed? No. > > The memory notifier is used for onlining/offlining, it is not involved here. > > I think what you mean is the "isolate notifier", which is only used by > CMM on PPC. > > See https://lkml.org/lkml/2019/10/31/487, where I rip that notifier out. > >> >>> start_isolate_page_range()/undo_isolate_page_range()/test_pages_isolated() >>> along with a lockless check if the page is free. >> >> Okay, that part I think I get. However doesn't all that logic more or less >> ignore the watermarks? It seems like you could cause an OOM if you don't >> have the necessary checks in place for that. > > Any approach that temporarily blocks some free pages from getting > allocated will essentially have this issue, no? I think one main design > point to minimize false OOMs was to limit the number of pages we report > at a time. Or what do you propose here in addition to that? > >> >>> I think it should be something like this (ignoring different >>> migratetypes and such for now) >>> >>> 1. Test lockless if page is free: Not free? Done. >> >> So this should help to reduce the liklihood of races in the steps below. >> However it might also be useful if the code had some other check to see if >> it was done other than just making a pass through the bitmap. > > Yes. > >> >> One thing I had brought up with Nitesh was the idea of maybe doing some >> sort of RCU bitmap type approach. Basically while we hold the zone lock we >> could swap out the old bitmap for a new one. We could probably even keep a >> counter at the start of the structure so that we could track how many bits >> are actually set there. Then it becomes less likely of having a race where >> you free a page and set the bit and the hinting thread tests and clears >> the bit but doesn't see the freed page since it is not synchronized. >> Otherwise your notification setup and reporting thread may need a few smp >> barriers added where necessary. > > Yes, swapping out the bitmap via RCU is also be a way to make memory > hotplug work. > > I was also thinking about a different bitmap approach. Store for each > section a bitmap. Use a meta bitmap with a bit for each section that > contains pages to report. Sparse zones and memory hot(un)plug would not > be a real issue anymore. > > One could go one step further and only have a bitmap with a bit for each > section. Only remember that some (large) page was not reported in that > section (e.g., after buddy merging). In the reporting thread, report all > free pages within that section. You could end up reporting the same page > a couple of times, but the question would be if this is relevant at all. > One would have to prototype and measure that. > > Long story short, I am not 100% a fan of the current "bitmap per zone" > approach but is is fairly simple to start with :) > >> >>> 2. start_isolate_page_range(): Busy? Rare race (with other isolate users >> >> Doesn't this have the side effect of draining all the percpu caches in >> order to make certain to flush the pages we isolated from there? > > While alloc_contig_range() e.g., calls lru_add_drain_all(), I don't > think isolation will. Where did you spot something like this in > mm/page_isolation.c? > >> >>> or with an allocation). Done. >>> 3. test_pages_isolated() >> >> So I have reviewed the code and I don't see how this could conflict with >> other callers isolating the pages. If anything it seems like if another >> thread has already isolated the pages you would end up getting a false >> positive, reporting the pages, and pulling them back out of isolation. > > Isolated pages cannot be isolated. This is tracked via the migratetype. > >> >>> 3a. no? Rare race, page not free anymore. undo_isolate_page_range() >> >> I would hope it is rare. However for something like a max order page I >> could easily see a piece of it having been pulled out. I would think this >> case would be exceedingly expensive since you would have to put back any >> pages you had previous moved into isolation. > > I guess it is rare, there is a tiny slot between checking if the page is > free and isolating it. Would have to see that in action. > >> >>> 3b. yes? Report, then undo_isolate_page_range() >>> >>> If we would run into performance issues with the current page isolation >>> implementation (esp. locking), I think there are some nice >>> cleanups/reworks possible of which all current users could benefit >>> (especially accross pageblocks). >> >> To me this feels a lot like what you had for this solution near the start. >> Only now instead of placing the pages into an array you are tracking a >> bitmap and then using that bitmap to populate the MIGRATE_ISOLATE lists. > > Now we have a clean MM interface to do that :) And yes, which data > structure we're using becomes irrelevant. > >> >> This sounds far more complex to me then it probably needs to be since just >> holding the pages with the buddy type cleared should be enough to make >> them temporarily unusable for other threads, and even in your case you are > > If you have a page that is not PageBuddy() and not movable within > ZONE_MOVABLE, has_unmovable_pages() will WARN_ON_ONCE(zone_idx(zone) == > ZONE_MOVABLE). This can be triggered via memory offlining, when > isolating the page range. > > If your approach does exactly that (clear PageBuddy() on a > ZONE_MOVABLE), it would be a bug. The only safe way is to have the > pageblock(s) isolated. > Minor correction: Only if your refcount is > 0. -- Thanks, David / dhildenb