On 7/18/19 11:34 AM, Alexander Duyck wrote: > On Wed, Jul 17, 2019 at 10:14 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: >> On Wed, Jul 17, 2019 at 09:43:52AM -0700, Alexander Duyck wrote: >>> On Wed, Jul 17, 2019 at 3:28 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: >>>> On Tue, Jul 16, 2019 at 02:06:59PM -0700, Alexander Duyck wrote: >>>>> On Tue, Jul 16, 2019 at 10:41 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: >>>>> >>>>> <snip> >>>>> >>>>>>>> This is what I am saying. Having watched that patchset being developed, >>>>>>>> I think that's simply because processing blocks required mm core >>>>>>>> changes, which Wei was not up to pushing through. >>>>>>>> >>>>>>>> >>>>>>>> If we did >>>>>>>> >>>>>>>> while (1) { >>>>>>>> alloc_pages >>>>>>>> add_buf >>>>>>>> get_buf >>>>>>>> free_pages >>>>>>>> } >>>>>>>> >>>>>>>> We'd end up passing the same page to balloon again and again. >>>>>>>> >>>>>>>> So we end up reserving lots of memory with alloc_pages instead. >>>>>>>> >>>>>>>> What I am saying is that now that you are developing >>>>>>>> infrastructure to iterate over free pages, >>>>>>>> FREE_PAGE_HINT should be able to use it too. >>>>>>>> Whether that's possible might be a good indication of >>>>>>>> whether the new mm APIs make sense. >>>>>>> The problem is the infrastructure as implemented isn't designed to do >>>>>>> that. I am pretty certain this interface will have issues with being >>>>>>> given small blocks to process at a time. >>>>>>> >>>>>>> Basically the design for the FREE_PAGE_HINT feature doesn't really >>>>>>> have the concept of doing things a bit at a time. It is either >>>>>>> filling, stopped, or done. From what I can tell it requires a >>>>>>> configuration change for the virtio balloon interface to toggle >>>>>>> between those states. >>>>>> Maybe I misunderstand what you are saying. >>>>>> >>>>>> Filling state can definitely report things >>>>>> a bit at a time. It does not assume that >>>>>> all of guest free memory can fit in a VQ. >>>>> I think where you and I may differ is that you are okay with just >>>>> pulling pages until you hit OOM, or allocation failures. Do I have >>>>> that right? >>>> This is exactly what the current code does. But that's an implementation >>>> detail which came about because we failed to find any other way to >>>> iterate over free blocks. >>> I get that. However my concern is that permeated other areas of the >>> implementation that make taking another approach much more difficult >>> than it needs to be. >> Implementation would have to change to use an iterator obviously. But I don't see >> that it leaked out to a hypervisor interface. >> >> In fact take a look at virtio_balloon_shrinker_scan >> and you will see that it calls shrink_free_pages >> without waiting for the device at all. > Yes, and in case you missed it earlier I am pretty sure that leads to > possible memory corruption. I don't think it was tested enough to be > able to say that is safe. > > Specifically we cannot be clearing the dirty flag on pages that are in > use. We should only be clearing that flag for pages that are > guaranteed to not be in use. > >>>>> In my mind I am wanting to perform the hinting on a small >>>>> block at a time and work through things iteratively. >>>>> >>>>> The problem is the FREE_PAGE_HINT doesn't have the option of returning >>>>> pages until all pages have been pulled. It is run to completion and >>>>> will keep filling the balloon until an allocation fails and the host >>>>> says it is done. >>>> OK so there are two points. One is that FREE_PAGE_HINT does not >>>> need to allocate a page at all. It really just wants to >>>> iterate over free pages. >>> I agree that it should just want to iterate over pages. However the >>> issue I am trying to point out is that it doesn't have any guarantees >>> on ordering and that is my concern. What I want to avoid is >>> potentially corrupting memory. >> I get that. I am just trying to make sure you are aware that for >> FREE_PAGE_HINT specifically ordering does not matter because it does not >> care when hypervisor used the buffers. It only cares that page was >> free after it got the request. used buffers are only tracked to avoid >> overflowing the VQ. This is different from your hinting where you make >> it the responsibility of the guest to not allocate page before it was >> used. > Prove to me that the ordering does not matter. As far as I can tell it > should since this is being used to clear the bitmap and will affect > migration. I'm pretty certain the page should not be freed until it > has been processed. Otherwise I believe there is a risk of the page > not being migrated and leading to a memory corruption when the VM is > finally migrated. > >>> So for example with my current hinting approach I am using the list of >>> hints because I get back one completion indicating all of the hints >>> have been processed. It is only at that point that I can go back and >>> make the memory available for allocation again. >> Right. But just counting them would work just as well, no? >> At least as long as you wait for everything to complete... >> If you want to pipeline, see below > Yes, but if possible I would also want to try and keep the batch > behavior that I have. We could count the descriptors processed, > however that is still essentially done all via busy waiting in the > FREE_PAGE_HINT logic. > >>> So one big issue right now with the FREE_PAGE_HINT approach is that it >>> is designed to be all or nothing. Using the balloon makes it >>> impossible for us to be incremental as all the pages are contained in >>> one spot. What we would need is some way to associate a page with a >>> given vq buffer. >> Sorry if I'm belaboring the obvious, but isn't this what 'void *data' in >> virtqueue_add_inbuf is designed for? And if you only ever use >> virtqueue_add_inbuf and virtqueue_add_outbuf on a given VQ, then you can >> track two pointers using virtqueue_add_inbuf_ctx. > I am still learning virtio so I wasn't aware of this piece until > yesterday. For FREE_PAGE_HINT it would probably work as we would then > have that association. For my page hinting I am still thinking I would > prefer to just pass around a scatterlist since that is the structure I > would likely fill and then later drain of pages versus just > maintaining a list. > >>> Ultimately in order to really make the FREE_PAGE_HINT >>> logic work with something like my page hinting logic it would need to >>> work more like a network Rx ring in that we would associate a page per >>> buffer and have some way of knowing the two are associated. >> Right. That's exactly how virtio net does it btw. > Yeah, I saw that after reviewing the code yesterday. > >>>> The reason FREE_PAGE_HINT does not free up pages until we finished >>>> iterating over the free list it not a hypervisor API. The reason is we >>>> don't want to keep getting the same address over and over again. >>>> >>>>> I would prefer to avoid that as I prefer to simply >>>>> notify the host of a fixed block of pages at a time and let it process >>>>> without having to have a thread on each side actively pushing pages, >>>>> or listening for the incoming pages. >>>> Right. And FREE_PAGE_HINT can go even further. It can push a page and >>>> let linux use it immediately. It does not even need to wait for host to >>>> process anything unless the VQ gets full. >>> If it is doing what you are saying it will be corrupting memory. >> No and that is hypervisor's responsibility. >> >> I think you are missing part of the picture here. >> >> Here is a valid implementation: >> >> Before asking for hints, hypervisor write-protects all memory, and logs >> all write faults. When hypervisor gets the hint, if page has since been >> modified, the hint is ignored. > No here is the part where I think you missed the point. I was already > aware of this. So my concern is this scenario. > > If you put a hint on the VQ and then free the memory back to the > guest, what about the scenario where another process could allocate > the memory and dirty it before we process the hint request on the > host? In that case the page was dirtied, the hypervisor will have > correctly write faulted and dirtied it, and then we came though and > incorrectly marked it as being free. That is the scenario I am worried > about as I am pretty certain that leads to memory corruption. > > >>> At a >>> minimum it has to wait until the page has been processed and the dirty >>> bit cleared before it can let linux use it again. It is all a matter >>> of keeping the dirty bit coherent. If we let linux use it again >>> immediately and then cleared the dirty bit we would open up a possible >>> data corruption race during migration as a dirty page might not be >>> marked as such. >> I think you are talking about the dirty bit on the host, right? >> >> The implication is that calling MADV_FREE from qemu would >> not be a good implementation of FREE_PAGE_HINT. >> And indeed, as far as I can see it does nothing of the sort. > I don't mean the dirty bit on the host, I am talking about the bitmap > used to determine which pages need to be migrated. That is what this > hint is updating and it is also being tracked via the write protection > of the pages at the start of migration. > > My concern is that we can end up losing track of pages that are > updated if we are hinting after they have been freed back to the guest > for reallocation. > >>>>>>>>> The basic idea with the bubble hinting was to essentially create mini >>>>>>>>> balloons. As such I had based the code off of the balloon inflation >>>>>>>>> code. The only spot where it really differs is that I needed the >>>>>>>>> ability to pass higher order pages so I tweaked thinks and passed >>>>>>>>> "hints" instead of "pfns". >>>>>>>> And that is fine. But there isn't really such a big difference with >>>>>>>> FREE_PAGE_HINT except FREE_PAGE_HINT triggers upon host request and not >>>>>>>> in response to guest load. >>>>>>> I disagree, I believe there is a significant difference. >>>>>> Yes there is, I just don't think it's in the iteration. >>>>>> The iteration seems to be useful to hinting. >>>>> I agree that iteration is useful to hinting. The problem is the >>>>> FREE_PAGE_HINT code isn't really designed to be iterative. It is >>>>> designed to run with a polling thread on each side and it is meant to >>>>> be run to completion. >>>> Absolutely. But that's a bug I think. >>> I think it is a part of the design. Basically in order to avoid >>> corrupting memory it cannot return the page to the guest kernel until >>> it has finished clearing the dirty bits associated with the pages. >> OK I hope I clarified by that's not supposed to be the case. > I think you might have missed something. I am pretty certain issues > are still present. > >>>>>>> The >>>>>>> FREE_PAGE_HINT code was implemented to be more of a streaming >>>>>>> interface. >>>>>> It's implemented like this but it does not follow from >>>>>> the interface. The implementation is a combination of >>>>>> attempts to minimize # of exits and minimize mm core changes. >>>>> The problem is the interface doesn't have a good way of indicating >>>>> that it is done with a block of pages. >>>>> >>>>> So what I am probably looking at if I do a sg implementation for my >>>>> hinting is to provide one large sg block for all 32 of the pages I >>>>> might be holding. >>>> Right now if you pass an sg it will try to allocate a buffer >>>> on demand for you. If this is a problem I could come up >>>> with a new API that lets caller allocate the buffer. >>>> Let me know. >>>> >>>>> I'm assuming that will still be processed as one >>>>> contiguous block. With that I can then at least maintain a single >>>>> response per request. >>>> Why do you care? Won't a counter of outstanding pages be enough? >>>> Down the road maybe we could actually try to pipeline >>>> things a bit. So send 32 pages once you get 16 of these back >>>> send 16 more. Better for SMP configs and does not hurt >>>> non-SMP too much. I am not saying we need to do it right away though. >>> So the big thing is we cannot give the page back to the guest kernel >>> until we know the processing has been completed. In the case of the >>> MADV_DONT_NEED call it will zero out the entire page on the next >>> access. If the guest kernel had already written data by the time we >>> get to that it would cause a data corruption and kill the whole guest. >> >> Exactly but FREE_PAGE_HINT does not cause qemu to call MADV_DONT_NEED. > No, instead it clears the bit indicating that the page is supposed to > be migrated. The effect will not be all that different, just delayed > until the VM is actually migrated. > >>>>>>> This is one of the things Linus kept complaining about in >>>>>>> his comments. This code attempts to pull in ALL of the higher order >>>>>>> pages, not just a smaller block of them. >>>>>> It wants to report all higher order pages eventually, yes. >>>>>> But it's absolutely fine to report a chunk and then wait >>>>>> for host to process the chunk before reporting more. >>>>>> >>>>>> However, interfaces we came up with for this would call >>>>>> into virtio with a bunch of locks taken. >>>>>> The solution was to take pages off the free list completely. >>>>>> That in turn means we can't return them until >>>>>> we have processed all free memory. >>>>> I get that. The problem is the interface is designed around run to >>>>> completion. For example it will sit there in a busy loop waiting for a >>>>> free buffer because it knows the other side is suppose to be >>>>> processing the pages already. >>>> I didn't get this part. >>> I think the part you may not be getting is that we cannot let the >>> guest use the page until the hint has been processed. Otherwise we >>> risk corrupting memory. That is the piece that has me paranoid. If we >>> end up performing a hint on a page that is use somewhere in the kernel >>> it will corrupt memory one way or another. That is the thing I have to >>> avoid at all cost. >> You have to do it, sure. And that is because you do not >> assume that hypervisor does it for you. But FREE_PAGE_HINT doesn't, >> hypervisor takes care of that. > Sort of. The hypervisor is trying to do dirty page tracking, however > the FREE_PAGE_HINT interferes with that. That is the problem. If we > get that out of order then the hypervisor work will be undone and we > just make a mess of memory. > >>> That is why I have to have a way to know exactly which pages have been >>> processed and which haven't before I return pages to the guest. >>> Otherwise I am just corrupting memory. >> Sure. That isn't really hard though. > Agreed. > >>>>>>> Honestly the difference is >>>>>>> mostly in the hypervisor interface than what is needed for the kernel >>>>>>> interface, however the design of the hypervisor interface would make >>>>>>> doing things more incrementally much more difficult. >>>>>> OK that's interesting. The hypervisor interface is not >>>>>> documented in the spec yet. Let me take a stub at a writeup now. So: >>>>>> >>>>>> >>>>>> >>>>>> - hypervisor requests reporting by modifying command ID >>>>>> field in config space, and interrupting guest >>>>>> >>>>>> - in response, guest sends the command ID value on a special >>>>>> free page hinting VQ, >>>>>> followed by any number of buffers. Each buffer is assumed >>>>>> to be the address and length of memory that was >>>>>> unused *at some point after the time when command ID was sent*. >>>>>> >>>>>> Note that hypervisor takes pains to handle the case >>>>>> where memory is actually no longer free by the time >>>>>> it gets the memory. >>>>>> This allows guest driver to take more liberties >>>>>> and free pages without waiting for guest to >>>>>> use the buffers. >>>>>> >>>>>> This is also one of the reason we call this a free page hint - >>>>>> the guarantee that page is free is a weak one, >>>>>> in that sense it's more of a hint than a promise. >>>>>> That helps guarantee we don't create OOM out of blue. >>>> I would like to stress the last paragraph above. >>> The problem is we don't want to give bad hints. What we do based on >>> the hint is clear the dirty bit. If we clear it in err when the page >>> is actually in use it will lead to data corruption after migration. >> That's true for your patches. I get that. > No, it should be true for FREE_PAGE_HINT as well. The fact that it > isn't is a bug as far as I am concerned. If you are doing dirty page > tracking in the hypervisor you cannot expect it to behave well if the > guest is providing it with bad data. > >>> The idea with the hint is that you are saying the page is currently >>> not in use, however if you send that hint late and have already freed >>> the page back you can corrupt memory. >> >> That part is I think wrong - assuming "you" means upstream code. > Yes, I am referring to someone running FREE_PAGE_HINT code. I usually > try to replace them with "we" to make it clear I am not talking about > someone personally, it is a bad habit. > >>>>>> - guest eventually sends a special buffer signalling to >>>>>> host that it's done sending free pages. >>>>>> It then stops reporting until command id changes. >>>>> The pages are not freed back to the guest until the host reports that >>>>> it is "DONE" via a configuration change. Doing that stops any further >>>>> progress, and attempting to resume will just restart from the >>>>> beginning. >>>> Right but it's not a requirement. Host does not assume this at all. >>>> It's done like this simply because we can't iterate over pages >>>> with the existing API. >>> The problem is nothing about the implementation was designed for >>> iteration. What I would have to do is likely gut and rewrite the >>> entire guest side of the FREE_PAGE_HINT code in order to make it work >>> iteratively. >> >> Right. I agree. >> >>> As I mentioned it would probably have to look more like a >>> NIC Rx ring in handling because we would have to have some sort of way >>> to associate the pages 1:1 to the buffers. >>> >>>>> The big piece this design is missing is the incremental notification >>>>> pages have been processed. The existing code just fills the vq with >>>>> pages and keeps doing it until it cannot allocate any more pages. We >>>>> would have to add logic to stop, flush, and resume to the existing >>>>> framework. >>>> But not to the hypervisor interface. Hypervisor is fine >>>> with pages being reused immediately. In fact, even before they >>>> are processed. >>> I don't think that is actually the case. If it does that I am pretty >>> sure it will corrupt memory during migration. >>> >>> Take a look at qemu_guest_free_page_hint: >>> https://github.com/qemu/qemu/blob/master/migration/ram.c#L3342 >>> >>> I'm pretty sure that code is going in and clearing the dirty bitmap >>> for memory. >> Yes it does. However the trick is that meanwhile >> kvm is logging new writes. So the bitmap that >> is being cleared is the bitmap that was logged before the request >> was sent to guest. >> >>> If we were to allow a page to be allocated and used and >>> then perform the hint it is going to introduce a race where the page >>> might be missed for migration and could result in memory corruption. >> commit c13c4153f76db23cac06a12044bf4dd346764059 has this explanation: >> >> Note: balloon will report pages which were free at the time of this call. >> As the reporting happens asynchronously, dirty bit logging must be >> enabled before this free_page_start call is made. Guest reporting must be >> disabled before the migration dirty bitmap is synchronized. >> >> but over multiple iterations this seems to have been dropped >> from code comments. Wei, would you mind going back >> and documenting the APIs you used? >> They seem to be causing confusion ... > The "Note" is the behavior I am seeing. Specifically there is nothing > in place to prevent the freed pages from causing corruption if they > are freed before being hinted. The requirement should be that they > cannot be freed until after they are hinted that way the dirty bit > logging will mark the page as dirty if it is accessed AFTER being > hinted. > > If you do not guarantee the hinting has happened first you could end > up logging the dirty bit before the hint is processed and then clear > the dirty bit due to the hint. It is pretty straight forward to > resolve by just not putting the page into the balloon until after the > hint has been processed. > >>>>>> - host can restart the process at any time by >>>>>> updating command ID. That will make guest stop >>>>>> and start from the beginning. >>>>>> >>>>>> - host can also stop the process by specifying a special >>>>>> command ID value. >>>>>> >>>>>> >>>>>> ========= >>>>>> >>>>>> >>>>>> Now let's compare to what you have here: >>>>>> >>>>>> - At any time after boot, guest walks over free memory and sends >>>>>> addresses as buffers to the host >>>>>> >>>>>> - Memory reported is then guaranteed to be unused >>>>>> until host has used the buffers >>>>>> >>>>>> >>>>>> Is above a fair summary? >>>>>> >>>>>> So yes there's a difference but the specific bit of chunking is same >>>>>> imho. >>>>> The big difference is that I am returning the pages after they are >>>>> processed, while FREE_PAGE_HINT doesn't and isn't designed to. >>>> It doesn't but the hypervisor *is* designed to support that. >>> Not really, it seems like it is more just a side effect of things. >> I hope the commit log above is enough to convice you we did >> think about this. > Sorry, but no. I think the "note" convinced me there is a race > condition, specifically in the shrinker case. We cannot free the page > back to host memory until the hint has been processed, otherwise we > will race with the dirty bit logging. > >>> Also as I mentioned before I am also not a huge fan of polling on both >>> sides as it is just going to burn through CPU. If we are iterative and >>> polling it is going to end up with us potentially pushing one CPU at >>> 100%, and if the one CPU doing the polling cannot keep up with the >>> page updates coming from the other CPUs we would be stuck in that >>> state for a while. I would have preferred to see something where the >>> CPU would at least allow other tasks to occur while it is waiting for >>> buffers to be returned by the host. >> You lost me here. What does polling have to do with it? > This is just another issue I found. Specifically busy polling while > waiting on the host to process the hints. I'm not a fan of it and was > just pointing it out. > >>>>> The >>>>> problem is the interface doesn't allow for a good way to identify that >>>>> any given block of pages has been processed and can be returned. >>>> And that's because FREE_PAGE_HINT does not care. >>>> It can return any page at any point even before hypervisor >>>> saw it. >>> I disagree, see my comment above. >> OK let's see if above is enough to convice you. Or maybe we >> have a bug when shrinker is invoked :) But I don't think so. > I'm pretty sure there is a bug. > >>>>> Instead pages go in, but they don't come out until the configuration >>>>> is changed and "DONE" is reported. The act of reporting "DONE" will >>>>> reset things and start them all over which kind of defeats the point. >>>> Right. >>>> >>>> But if you consider how we are using the shrinker you will >>>> see that it's kind of broken. >>>> For example not keeping track of allocated >>>> pages means the count we return is broken >>>> while reporting is active. >>>> >>>> I looked at fixing it but really if we can just >>>> stop allocating memory that would be way cleaner. >>> Agreed. If we hit an OOM we should probably just stop the free page >>> hinting and treat that as the equivalent to an allocation failure. >> And fix the shrinker count to include the pages in the vq. Yea. > I don't know if we really want to touch the pages in the VQ. I would > say that we should leave them alone. > >>> As-is I think this also has the potential for corrupting memory since >>> it will likely be returning the most recent pages added to the balloon >>> so the pages are likely still on the processing queue. >> That part is fine I think because of the above. >> >>>> For example we allocate pages until shrinker kicks in. >>>> Fair enough but in fact many it would be better to >>>> do the reverse: trigger shrinker and then send as many >>>> free pages as we can to host. >>> I'm not sure I understand this last part. >> Oh basically what I am saying is this: one of the reasons to use page >> hinting is when host is short on memory. In that case, why don't we use >> shrinker to ask kernel drivers to free up memory? Any memory freed could >> then be reported to host. > Didn't the balloon driver already have a feature like that where it > could start shrinking memory if the host was under memory pressure? If you are referring to auto-ballooning (I don't think it is merged). It has its own set of disadvantages such as it could easily lead to OOM, memory corruption and so on. VIRTIO_BALLOON_F_FREE_PAGE_HINT does address some of those issues. However, it still requires external control to initiate/stop the memory transaction. > If > so how would adding another one add much value. > The idea here is if the memory is free we just mark it as such. As > long as we can do so with no noticeable overhead on the guest or host > why not just do it? +1. This is the advantage which both the hinting solutions are trying to introduce. -- Thanks Nitesh