On Tue, Feb 19, 2019 at 09:06:01AM +0100, David Hildenbrand wrote: > On 19.02.19 00:47, Alexander Duyck wrote: > > On Mon, Feb 18, 2019 at 9:42 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > >> > >> On 18.02.19 18:31, Alexander Duyck wrote: > >>> On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > >>>> > >>>> On 18.02.19 17:49, Michael S. Tsirkin wrote: > >>>>> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote: > >>>>>> It would be worth a try. My feeling is that a synchronous report after > >>>>>> e.g. 512 frees should be acceptable, as it seems to be acceptable on > >>>>>> s390x. (basically always enabled, nobody complains). > >>>>> > >>>>> What slips under the radar on an arch like s390 might > >>>>> raise issues for a popular arch like x86. My fear would be > >>>>> if it's only a problem e.g. for realtime. Then you get > >>>>> a condition that's very hard to trigger and affects > >>>>> worst case latencies. > >>>> > >>>> Realtime should never use free page hinting. Just like it should never > >>>> use ballooning. Just like it should pin all pages in the hypervisor. > >>>> > >>>>> > >>>>> But really what business has something that is supposedly > >>>>> an optimization blocking a VCPU? We are just freeing up > >>>>> lots of memory why is it a good idea to slow that > >>>>> process down? > >>>> > >>>> I first want to know that it is a problem before we declare it a > >>>> problem. I provided an example (s390x) where it does not seem to be a > >>>> problem. One hypercall ~every 512 frees. As simple as it can get. > >>>> > >>>> No trying to deny that it could be a problem on x86, but then I assume > >>>> it is only a problem in specific setups. > >>>> > >>>> I would much rather prefer a simple solution that can eventually be > >>>> disabled in selected setup than a complicated solution that tries to fit > >>>> all possible setups. Realtime is one of the examples where such stuff is > >>>> to be disabled either way. > >>>> > >>>> Optimization of space comes with a price (here: execution time). > >>> > >>> One thing to keep in mind though is that if you are already having to > >>> pull pages in and out of swap on the host in order be able to provide > >>> enough memory for the guests the free page hinting should be a > >>> significant win in terms of performance. > >> > >> Indeed. And also we are in a virtualized environment already, we can > >> have any kind of sudden hickups. (again, realtime has special > >> requirements on the setup) > >> > >> Side note: I like your approach because it is simple. I don't like your > >> approach because it cannot deal with fragmented memory. And that can > >> happen easily. > >> > >> The idea I described here can be similarly be an extension of your > >> approach, merging in a "batched reporting" Nitesh proposed, so we can > >> report on something < MAX_ORDER, similar to s390x. In the end it boils > >> down to reporting via hypercall vs. reporting via virtio. The main point > >> is that it is synchronous and batched. (and that we properly take care > >> of the race between host freeing and guest allocation) > > > > I'd say the discussion is even simpler then that. My concern is more > > synchronous versus asynchronous. I honestly think the cost for a > > synchronous call is being overblown and we are likely to see the fault > > and zeroing of pages cost more than the hypercall or virtio > > transaction itself. > > The overhead of page faults and zeroing should be mitigated by > MADV_FREE, as Andrea correctly stated (thanks!). Then, the call overhead > (context switch) becomes relevant. > > We have various discussions now :) And I think they are related. > > synchronous versus asynchronous > batched vs. non-batched > MAX_ORDER - 1 vs. other/none magic number > > 1. synchronous call without batching on every kfree is bad. The > interface is fixed to big magic numbers, otherwise we end up having a > hypercall on every kfree. This is your approach. > > 2. asynchronous calls without batching would most probably have similar > problems with small granularities as we had when ballooning without > batching. Just overhead we can avoid. > > 3. synchronous and batched is what s390x does. It can deal with page > granularity. It is what I initially described in this sub-thread. > > 4. asynchronous and batched. This is the other approach we discussed > yesterday. If we can get it implemented, I would be interested in > performance numbers. > > As far as I understood, Michael seems to favor something like 4 (and I > assume eventually 2 if it is similarly fast). I am a friend of either 3 > or 4. Well Linus said big granularity is important for linux MM and not to bother with hinting small sizes. Alex said cost of a hypercall is drawfed by a pagefault after alloc. I would be curious whether async pagefault can help things somehow though. > > > > Also one reason why I am not a fan of working with anything less than > > PMD order is because there have been issues in the past with false > > memory leaks being created when hints were provided on THP pages that > > essentially fragmented them. I guess hugepaged went through and > > started trying to reassemble the huge pages and as a result there have > > been apps that ended up consuming more memory than they would have > > otherwise since they were using fragments of THP pages after doing an > > MADV_DONTNEED on sections of the page. > > I understand your concerns, but we should not let bugs in the hypervisor > dictate the design. Bugs are there to be fixed. Interesting read, > though, thanks! Right but if we break up a huge page we are then creating more work for hypervisor to reassemble it. > -- > > Thanks, > > David / dhildenb