On Tue, Feb 19, 2019 at 01:57:14PM -0800, Alexander Duyck wrote: > On Tue, Feb 19, 2019 at 10:32 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > > > >>> This essentially just ends up being another trade-off of CPU versus > > >>> memory though. Assuming we aren't using THP we are going to take a > > >>> penalty in terms of performance but could then free individual pages > > >>> less than HUGETLB_PAGE_ORDER, but the CPU utilization is going to be > > >>> much higher in general even without the hinting. I figure for x86 we > > >>> probably don't have too many options since if I am not mistaken > > >>> MAX_ORDER is just one or two more than HUGETLB_PAGE_ORDER. > > >> > > >> THP is an implementation detail in the hypervisor. Yes, it is the common > > >> case on x86. But it is e.g. not available on s390x yet. And we also want > > >> this mechanism to work on s390x (e.g. for nested virtualization setups > > >> as discussed). > > >> > > >> If we e.g. report any granularity after merging was done in the buddy, > > >> we could end up reporting everything from page size up to MAX_SIZE - 1, > > >> the hypervisor could ignore hints below a certain magic number, if it > > >> makes its life easier. > > > > > > For each architecture we can do a separate implementation of what to > > > hint on. We already do that for bare metal so why would we have guests > > > do the same type of hinting in the virtualization case when there are > > > fundamental differences in page size and features in each > > > architecture? > > > > > > This is another reason why I think the hypercall approach is a better > > > idea since each architecture is likely going to want to handle things > > > differently and it would be a pain to try and sort that all out in a > > > virtio driver. > > > > I can't follow. We are talking about something as simple as a minimum > > page granularity here that can easily be configured. Nothing that > > screams for different implementations. But I get your point, we could > > tune for different architectures. > > I was thinking about the guest side of things. Basically if we need to > define different page orders for different architectures then we start > needing to do architecture specific includes. Then if we throw in > stuff like the fact that the first level of KVM can make use of the > host style hints then that is another thing that will be a difference > int he different architectures. Sorry didn't catch this one. What are host style hints? > I'm just worried this stuff is going > to start adding up to a bunch of "#ifdef" cruft if we are trying to do > this as a virtio driver. I agree we want to avoid that. And by comparison, if it's up to host or if it's tied to logic within guest (such as MAX_PAGE_ORDER as suggested by Linus) as opposed to CPU architecture, then virtio is easier as you can re-use config space and feature bits to negotiate host/guest capabilities. Doing hypercalls for that would add lots of hypercalls. I CC'd Wei Wang who implemented host-driven hints in the balloon right now. Wei I wonder - could you try changing from MAX_PAGE_ORDER to HUGETLB_PAGE_ORDER? Does this affect performance for you at all? Thanks! > > > > > >>> > > >>> As far as fragmentation my thought is that we may want to look into > > >>> adding support to the guest for prioritizing defragmentation on pages > > >>> lower than THP size. Then that way we could maintain the higher > > >>> overall performance with or without the hinting since shuffling lower > > >>> order pages around between guests would start to get expensive pretty > > >>> quick. > > >> > > >> My take would be, design an interface/mechanism that allows any kind of > > >> granularity. You can than balance between cpu overead and space shifting. > > > > > > The problem with using "any kind of granularity" is that in the case > > > of memory we are already having problems with 4K pages being deemed > > > too small of a granularity to be useful for anything and making > > > operations too expensive. > > > > No, sorry, s390x does it. And via batch reporting it could work. Not > > saying we should do page granularity, but "to be useful for anything" is > > just wrong. > > Yeah, I was engaging in a bit of hyperbole. I have had a headache this > morning so I am a bit cranky. > > So I am assuming the batching is the reason why you also have a > arch_alloc_page then for the s390 so that you can abort the hint if a > page is reallocated before the hint is processed then? I just want to > confirm so that my understanding of this is correct. > > If that is the case I would be much happier with an asynchronous page > hint setup as this doesn't deprive the guest of memory while waiting > on the hint. The current logic in the patches from Nitesh has the > pages unavailable to the guest while waiting on the hint and that has > me somewhat concerned as it is going to hurt cache locality as it will > guarantee that we cannot reuse the same page if we are doing a cycle > of alloc and free for the same page size. > > > > > > I'm open to using other page orders for other architectures. Nothing > > > says we have to stick with THP sized pages for all architectures. I > > > have just been focused on x86 and this seems like the best fit for the > > > balance between CPU and freeing of memory for now on that > > > architecture. > > > > > >> I feel like repeating myself, but on s390x hinting is done on page > > >> granularity, and I have never heard somebody say "how can I turn it off, > > >> this is slowing down my system too much.". All we know is that one > > >> hypercall per free is most probably not acceptable. We really have to > > >> play with the numbers. > > > > > > My thought was we could look at doing different implementations for > > > other architectures such as s390 and powerPC. Odds are the > > > implementations would be similar but have slight differences where > > > appropriate such as what order we should start hinting on, or if we > > > bypass the hypercall/virtio-balloon for a host native approach if > > > available. > > > > > >> I tend to like an asynchronous reporting approach as discussed in this > > >> thread, we would have to see if Nitesh could get it implemented. > > > > > > I agree it would be great if it could work. However I have concerns > > > given that work on this patch set dates back to 2017, major issues > > > such as working around device assignment have yet to be addressed, and > > > it seems like most of the effort is being focused on things that in my > > > opinion are being over-engineered for little to no benefit. > > > > I can understand that you are trying to push your solution. I would do > > the same. Again, I don't like a pure synchronous approach that works on > > one-element-at-a-time. Period. Other people might have other opinions. > > This is mine - luckily I don't have anything to say here :) > > > > MST also voted for an asynchronous solution if we can make it work. > > Nitesh made significant improvements since the 2017. Complicated stuff > > needs time. No need to rush. People have been talking about free page > > hinting since 2006. I talked to various people that experimented with > > bitmap based solutions two years ago. > > Now that I think I have a better understanding of how the s390x is > handling this I'm beginning to come around to the idea of an > asynchronous setup. The one thing that has been bugging me about the > asynchronous approach is the fact that the pages are not available to > the guest while waiting on the hint to be completed. If we can do > something like an arch_alloc_page and that would abort the hint and > allow us to keep the page available while waiting on the hint that > would be my preferred way of handling this. > > > So much to that, if you think your solution is the way to go, please > > follow up on it. Nitesh seems to have decided to look into the > > asynchronous approach you also called "great if it could work". As long > > as we don't run into elementary blockers there, to me it all looks like > > we are making progress, which is good. If we find out asynchronous > > doesn't work, synchronous is the only alternative. > > I plan to follow up in the next week or so. > > > And just so you don't get me wrong: Thanks for looking and working on > > this. And thanks for sharing your opinions and insights! However making > > a decision about going your way at this point does not seem reasonable > > to me. We have plenty of time. > > I appreciate the feedback. Sorry if I seemed a bit short. As I > mentioned I've had a headache most of the morning which hasn't really > helped my mood. > > Thanks. > > - Alex -- MST