On 08.04.19 22:51, David Hildenbrand wrote: > On 08.04.19 22:10, Alexander Duyck wrote: >> On Mon, Apr 8, 2019 at 11:40 AM David Hildenbrand <david@xxxxxxxxxx> wrote: >>> >>>>>> >>>>>> In addition we will need some way to identify which pages have been >>>>>> hinted on and which have not. The way I believe easiest to do this >>>>>> would be to overload the PageType value so that we could essentially >>>>>> have two values for "Buddy" pages. We would have our standard "Buddy" >>>>>> pages, and "Buddy" pages that also have the "Offline" value set in the >>>>>> PageType field. Tracking the Online vs Offline pages this way would >>>>>> actually allow us to do this with almost no overhead as the mapcount >>>>>> value is already being reset to clear the "Buddy" flag so adding a >>>>>> "Offline" flag to this clearing should come at no additional cost. >>>>> >>>>> Just nothing here that this will require modifications to kdump >>>>> (makedumpfile to be precise and the vmcore information exposed from the >>>>> kernel), as kdump only checks for the the actual mapcount value to >>>>> detect buddy and offline pages (to exclude them from dumps), they are >>>>> not treated as flags. >>>>> >>>>> For now, any mapcount values are really only separate values, meaning >>>>> not the separate bits are of interest, like flags would be. Reusing >>>>> other flags would make our life a lot easier. E.g. PG_young or so. But >>>>> clearing of these is then the problematic part. >>>>> >>>>> Of course we could use in the kernel two values, Buddy and BuddyOffline. >>>>> But then we have to check for two different values whenever we want to >>>>> identify a buddy page in the kernel. >>>> >>>> Actually this may not be working the way you think it is working. >>> >>> Trust me, I know how it works. That's why I was giving you the notice. >>> >>> Read the first paragraph again and ignore the others. I am only >>> concerned about makedumpfile that has to be changed. >>> >>> PAGE_OFFLINE_MAPCOUNT_VALUE >>> PAGE_BUDDY_MAPCOUNT_VALUE >>> >>> Once you find out how these values are used, you should understand what >>> has to be changed and where. >> >> Ugh. Is there an official repo I am supposed to refer to for makedumpfile? >> >> As far as the changes needed I don't think this would necessitate >> additional exports. We could probably just get away with having >> makedumpfile generate a new value by simply doing an "&" of the two >> values to determine what an offline buddy would be. If need be I can >> submit a patch for that. I find it kind of annoying that the kernel is >> handling identifying these bits one way, and makedumpfile is doing it >> another way. It should have been setup to handle this all the same >> way. >> >>> >>>>>> >>>>>> Lastly we would need to create a specialized function for allocating >>>>>> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue >>>>>> "Offline" pages. I'm thinking the alloc function it would look >>>>>> something like __rmqueue_smallest but without the "expand" and needing >>>>>> to modify the !page check to also include a check to verify the page >>>>>> is not "Offline". As far as the changes to __free_one_page it would be >>>>>> a 2 line change to test for the PageType being offline, and if it is >>>>>> to call add_to_free_area_tail instead of add_to_free_area. >>>>> >>>>> As already mentioned, there might be scenarios where the additional >>>>> hinting thread might consume too much CPU cycles, especially if there is >>>>> little guest activity any you mostly spend time scanning a handful of >>>>> free pages and reporting them. I wonder if we can somehow limit the >>>>> amount of wakeups/scans for a given period to mitigate this issue. >>>> >>>> That is why I was talking about breaking nr_free into nr_freed and >>>> nr_bound. By doing that I can record the nr_free value to a >>>> virtio-balloon specific location at the start of any walk and should >>>> know exactly now many pages were freed between that call and the next >>>> one. By ordering things such that we place the "Offline" pages on the >>>> tail of the list it should make the search quite fast since we would >>>> just be always allocating off of the head of the queue until we have >>>> hinted everything int he queue. So when we hit the last call to alloc >>>> the non-"Offline" pages and shut down our thread we can use the >>>> nr_freed value that we recorded to know exactly how many pages have >>>> been added that haven't been hinted. >>>> >>>>> One main issue I see with your approach is that we need quite a lot of >>>>> core memory management changes. This is a problem. I wonder if we can >>>>> factor out most parts into callbacks. >>>> >>>> I think that is something we can't get away from. However if we make >>>> this generic enough there would likely be others beyond just the >>>> virtualization drivers that could make use of the infrastructure. For >>>> example being able to track the rate at which the free areas are >>>> cycling in and out pages seems like something that would be useful >>>> outside of just the virtualization areas. >>> >>> Might be, but might be the other extreme, people not wanting such >>> special cases in core mm. I assume the latter until I see a very clear >>> design where such stuff has been properly factored out. >> >> The only real pain point I am seeing right now is the assumptions >> makedumpfile is currently making about how mapcount is being used to >> indicate pagetype. If we patch it to fix it most of the other bits are >> minor. > > I'll be curious how splitting etc. will be handled. Especially if you > want to set Offline for all affected sub pages. > Answering that myself, I guess you are planning to change the buddy to basically copy the offline value to sub-pages when splitting, also attaching them to the tail of the list instead of the head. -- Thanks, David / dhildenb