On 08.04.19 20:09, Nitesh Narayan Lal wrote: > On 4/8/19 12:36 PM, David Hildenbrand wrote: >> On 06.04.19 02:09, Alexander Duyck wrote: >>> So I am starting this thread as a spot to collect my thoughts on the >>> current guest free page hinting design as well as point out a few >>> possible things we could do to improve upon it. >>> >>> 1. The current design isn't likely going to scale well to multiple >>> VCPUs. The issue specifically is that the zone lock must be held to >>> pull pages off of the free list and to place them back there once they >>> have been hinted upon. As a result it would likely make sense to try >>> to limit ourselves to only having one thread performing the actual >>> hinting so that we can avoid running into issues with lock contention >>> between threads. >> Makes sense. >> >>> 2. There are currently concerns about the hinting triggering false OOM >>> situations if too much memory is isolated while it is being hinted. My >>> thought on this is to simply avoid the issue by only hint on a limited >>> amount of memory at a time. Something like 64MB should be a workable >>> limit without introducing much in the way of regressions. However as a >>> result of this we can easily be overrun while waiting on the host to >>> process the hinting request. As such we will probably need a way to >>> walk the free list and free pages after they have been freed instead >>> of trying to do it as they are freed. >> We will need such a way in case we care about dropped hinting requests, yes. >> >>> 3. Even with the current buffering which is still on the larger side >>> it is possible to overrun the hinting limits if something causes the >>> host to stall and a large swath of memory is released. As such we are >>> still going to need some sort of scanning mechanism or will have to >>> live with not providing accurate hints. >> Yes, usually if there is a lot of guest activity, you could however >> assume that free pages might get reused either way soon. Of course, >> special cases are "freeing XGB and being idle afterwards". >> >>> 4. In my opinion, the code overall is likely more complex then it >>> needs to be. We currently have 2 allocations that have to occur every >>> time we provide a hint all the way to the host, ideally we should not >>> need to allocate more memory to provide hints. We should be able to >>> hold the memory use for a memory hint device constant and simply map >>> the page address and size to the descriptors of the virtio-ring. >> I don't think the two allocations are that complex. The only thing I >> consider complex is isolation a lot of pages from different zones etc. >> Two allocations, nobody really cares about that. Of course, the fact >> that we have to allocate memory from the VCPUs where we currently freed >> a page is not optimal. I consider that rather a problem/complex. >> >> Especially you have a point regarding scalability and multiple VCPUs. >> >>> With that said I have a few ideas that may help to address the 4 >>> issues called out above. The basic idea is simple. We use a high water >>> mark based on zone->free_area[order].nr_free to determine when to wake >>> up a thread to start hinting memory out of a given free area. From >>> there we allocate non-"Offline" pages from the free area and assign >>> them to the hinting queue up to 64MB at a time. Once the hinting is >>> completed we mark them "Offline" and add them to the tail of the >>> free_area. Doing this we should cycle the non-"Offline" pages slowly >>> out of the free_area. In addition the search cost should be minimal >>> since all of the "Offline" pages should be aggregated to the tail of >>> the free_area so all pages allocated off of the free_area will be the >>> non-"Offline" pages until we shift over to them all being "Offline". >>> This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages >>> since the only real consumer of add_to_free_area_tail is >>> __free_one_page which uses it to place a page with an order less than >>> MAX_ORDER - 2 on the tail of a free_area assuming that it should be >>> freeing the buddy of that page shortly. The only other issue with >>> adding to tail would be the memory shuffling which was recently added, >>> but I don't see that as being something that will be enabled in most >>> cases so we could probably just make the features mutually exclusive, >>> at least for now. >>> >>> So if I am not mistaken this would essentially require a couple >>> changes to the mm infrastructure in order for this to work. >>> >>> First we would need to split nr_free into two counters, something like >>> nr_freed and nr_bound. You could use nr_freed - nr_bound to get the >>> value currently used for nr_free. When we pulled the pages for hinting >>> we would reduce the nr_freed value and then add back to it when the >>> pages are returned. When pages are allocated they would increment the >>> nr_bound value. The idea behind this is that we can record nr_free >>> when we collect the pages and save it to some local value. This value >>> could then tell us how many new pages have been added that have not >>> been hinted upon. >> I can imagine that quite some people will have problems with such >> "virtualization specific changes" splattered around core memory >> management. Would there be a way to manage this data at a different >> place, out of core-mm and somehow work on it via callbacks? >> >>> In addition we will need some way to identify which pages have been >>> hinted on and which have not. The way I believe easiest to do this >>> would be to overload the PageType value so that we could essentially >>> have two values for "Buddy" pages. We would have our standard "Buddy" >>> pages, and "Buddy" pages that also have the "Offline" value set in the >>> PageType field. Tracking the Online vs Offline pages this way would >>> actually allow us to do this with almost no overhead as the mapcount >>> value is already being reset to clear the "Buddy" flag so adding a >>> "Offline" flag to this clearing should come at no additional cost. >> Just nothing here that this will require modifications to kdump >> (makedumpfile to be precise and the vmcore information exposed from the >> kernel), as kdump only checks for the the actual mapcount value to >> detect buddy and offline pages (to exclude them from dumps), they are >> not treated as flags. >> >> For now, any mapcount values are really only separate values, meaning >> not the separate bits are of interest, like flags would be. Reusing >> other flags would make our life a lot easier. E.g. PG_young or so. But >> clearing of these is then the problematic part. >> >> Of course we could use in the kernel two values, Buddy and BuddyOffline. >> But then we have to check for two different values whenever we want to >> identify a buddy page in the kernel. >> >>> Lastly we would need to create a specialized function for allocating >>> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue >>> "Offline" pages. I'm thinking the alloc function it would look >>> something like __rmqueue_smallest but without the "expand" and needing >>> to modify the !page check to also include a check to verify the page >>> is not "Offline". As far as the changes to __free_one_page it would be >>> a 2 line change to test for the PageType being offline, and if it is >>> to call add_to_free_area_tail instead of add_to_free_area. >> As already mentioned, there might be scenarios where the additional >> hinting thread might consume too much CPU cycles, especially if there is >> little guest activity any you mostly spend time scanning a handful of >> free pages and reporting them. I wonder if we can somehow limit the >> amount of wakeups/scans for a given period to mitigate this issue. >> >> One main issue I see with your approach is that we need quite a lot of >> core memory management changes. This is a problem. I wonder if we can >> factor out most parts into callbacks. >> >> E.g. in order to detect where to queue a certain page (front/tail), call >> a callback if one is registered, mark/check pages in a core-mm unknown >> way as offline etc. >> >> I still wonder if there could be an easier way to combine recording of >> hints and one hinting thread, essentially avoiding scanning and some of >> the required core-mm changes. > In order to resolve the scalability issues associated with my > patch-series without compromising with free memory hints, I may explore > the idea described below: > - Use xbitmap (if possible - earlier suggested by Rik and Wei) > corresponding to each zone on a granularity of MAX_ORDER - 2, to track > the freed PFN's. > - Define and use counters corresponding to each zone to monitor the > amount of memory freed. > - As soon as the 64MB free memory threshold is hit wake up the kernel > thread which will scan this xbitmap and try to isolate the pages and > clear the corresponding bits. (We still have to acquire zone lock to > protect the respective xbitmap) > - Report the isolated pages back to the host in a synchronous manner. > I still have to work on several details of this idea including xbitmap, > but first would like to hear any suggestions/thoughts. As discussed offline, I think this is the right approach to follow. a) We have an asynchronous hinting thread as suggested by Alex. b) We record hints and only have to scan a xbitmap, not some random lists. c) We have minimal core mm changes Best of both worlds and sounds simple. -- Thanks, David / dhildenb