On 2/19/19 8:03 AM, David Hildenbrand wrote: >>>>> There are two main ways to avoid allocation: >>>>> 1. do not add extra data on top of each chunk passed >>>> If I am not wrong then this is close to what we have right now. >>> Yes, minus the kthread(s) and eventually with some sort of memory >>> allocation for the request. Once you're asynchronous via a notification >>> mechanisnm, there is no real need for a thread anymore, hopefully. >> Whether we should go with kthread or without it, I would like to do some >> performance comparison before commenting on this. >>>> One issue I see right now is that I am polling while host is freeing the >>>> memory. >>>> In the next version I could tie the logic which returns pages to the >>>> buddy and resets the per cpu array index value to 0 with the callback. >>>> (i.e.., it happens once we receive an response from the host) >>> The question is, what happens when freeing pages and the array is not >>> ready to be reused yet. In that case, you want to somehow continue >>> freeing pages without busy waiting or eventually not reporting pages. >> This is what happens right now. >> Having kthread or not should not effect this behavior. >> When the array is full the current approach simply skips collecting the >> free pages. > Well, it somehow does affect your implementation. If you have a kthread > you always have to synchronize against the VCPU: "Is the pcpu array > ready to be used again". > > Once you do it asynchronously from your VCPU without another thread > being involved, such synchronization is not required. Simply prepare a > request and send it off. Reuse the pcpu array instantly. At least that's > the theory :) > > If you have a guest bulk freeing a lot of memory, I guess temporarily > dropping free page hints could be counter-productive. It really depends > on how fast the thread gets scheduled and how long the hinting process > takes. Having another thread involved might add a lot to that latency to > that formula. We'll have to measure, but my gut feeling is that once we > do stuff asynchronously, there is no need for a thread anymore. This is true. > >>> The callback should put the pages back to the buddy and free the request >>> eventually to have a fully asynchronous mechanism. >>> >>>> Other change which I am testing right now is to only capture 'MAX_ORDER >>> I am not sure if this is an arbitrary number we came up with here. We >>> should really play with different orders to find a hot spot. I wouldn't >>> consider this high priority, though. Getting the whole concept right to >>> be able to deal with any magic number we come up should be the ultimate >>> goal. (stuff that only works with huge pages I consider not future >>> proof, especially regarding fragmented guests which can happen easily) >> Its quite possible that when we are only capturing MAX_ORDER - 1 and run >> a specific workload we don't get any memory back until we re-run the >> program and buddy finally starts merging of pages of order MAX_ORDER -1. >> This is why I think we may make this configurable from compile time and >> keep capturing MAX_ORDER - 1 so that we don't end up breaking anything. > Eventually pages will never get merged. Assume you have 1 page of a > MAX_ORDER - 1 chunk still allocated somewhere (e.g. !movable via > kmalloc). You skip reporting that chunk completely. Roughly 1mb/2mb/4mb > wasted (depending on the arch). This stuff can sum up. After the discussion, here are the changes on which I am planning to work next: 1. Get rid of the kthread and dynamically allocate a per-cpu array to hold the isolated pages. As soon as the initial per-cpu array is completely scanned, release it so that we don't end up blocking anything. 2. Continue capturing MAX_ORDER - 1, for now. Reduce the initial per-cpu array size to 256 for now. As we are doing asynchronous reporting we should be fine with a lower size array. 3. As soon as the host responds, release the pages back to the buddy from the callback and free the request. Benefits wrt current implementation: 1. We will not eat up performance due to kernel thread. 2. We will still be doing reporting asynchronously=> no blocking. 3. Hopefully, we will be able to free more memory. -- Regards Nitesh
Attachment:
signature.asc
Description: OpenPGP digital signature