On 8/2/19 11:13 AM, Alexander Duyck wrote: > On Fri, 2019-08-02 at 10:41 -0400, Nitesh Narayan Lal wrote: >> On 8/1/19 6:24 PM, Alexander Duyck wrote: >>> This series provides an asynchronous means of reporting to a hypervisor >>> that a guest page is no longer in use and can have the data associated >>> with it dropped. To do this I have implemented functionality that allows >>> for what I am referring to as unused page reporting >>> >>> The functionality for this is fairly simple. When enabled it will allocate >>> statistics to track the number of reported pages in a given free area. >>> When the number of free pages exceeds this value plus a high water value, >>> currently 32, it will begin performing page reporting which consists of >>> pulling pages off of free list and placing them into a scatter list. The >>> scatterlist is then given to the page reporting device and it will perform >>> the required action to make the pages "reported", in the case of >>> virtio-balloon this results in the pages being madvised as MADV_DONTNEED >>> and as such they are forced out of the guest. After this they are placed >>> back on the free list, and an additional bit is added if they are not >>> merged indicating that they are a reported buddy page instead of a >>> standard buddy page. The cycle then repeats with additional non-reported >>> pages being pulled until the free areas all consist of reported pages. >>> >>> I am leaving a number of things hard-coded such as limiting the lowest >>> order processed to PAGEBLOCK_ORDER, and have left it up to the guest to >>> determine what the limit is on how many pages it wants to allocate to >>> process the hints. The upper limit for this is based on the size of the >>> queue used to store the scatterlist. >>> >>> My primary testing has just been to verify the memory is being freed after >>> allocation by running memhog 40g on a 40g guest and watching the total >>> free memory via /proc/meminfo on the host. With this I have verified most >>> of the memory is freed after each iteration. As far as performance I have >>> been mainly focusing on the will-it-scale/page_fault1 test running with >>> 16 vcpus. With that I have seen up to a 2% difference between the base >>> kernel without these patches and the patches with virtio-balloon enabled >>> or disabled. >> A couple of questions: >> >> - The 2% difference which you have mentioned, is this visible for >> all the 16 cores or just the 16th core? >> - I am assuming that the difference is seen for both "number of process" >> and "number of threads" launched by page_fault1. Is that right? > Really, the 2% is bordering on just being noise. Sometimes it is better > sometimes it is worse. However I think it is just slight variability in > the tests since it doesn't usually form any specific pattern. > > I have been able to tighten it down a bit by actually splitting my guest > over 2 nodes and pinning the vCPUs so that the nodes in the guest match up > to the nodes in the host. Doing that I have seen results where I had less > than 1% variability between with the patches and without. Interesting. I usually pin the guest to a single NUMA node to avoid this. > > One thing I am looking at now is modifying the page_fault1 test to use THP > instead of 4K pages as I suspect there is a fair bit of overhead in > accessing the pages 4K at a time vs 2M at a time. I am hoping with that I > can put more pressure on the actual change and see if there are any > additional spots I should optimize. +1. Right now I don't think will-it-scale touches all the guest memory. May I know how much memory does will-it-scale/page_fault1, occupies in your case and how much do you get back with your patch-set? Do you have any plans of running any other benchmarks as well? Just to see the impact on other sub-systems. >>> One side effect of these patches is that the guest becomes much more >>> resilient in terms of NUMA locality. With the pages being freed and then >>> reallocated when used it allows for the pages to be much closer to the >>> active thread, and as a result there can be situations where this patch >>> set will out-perform the stock kernel when the guest memory is not local >>> to the guest vCPUs. >> Was this the reason because of which you were seeing better results for >> page_fault1 earlier? > Yes I am thinking so. What I have found is that in the case where the > patches are not applied on the guest it takes a few runs for the numbers > to stabilize. What I think was going on is that I was running memhog to > initially fill the guest and that was placing all the pages on one node or > the other and as such was causing additional variability as the pages were > slowly being migrated over to the other node to rebalance the workload. > One way I tested it was by trying the unpatched case with a direct- > assigned device since that forces it to pin the memory. In that case I was > getting bad results consistently as all the memory was forced to come from > one node during the pre-allocation process. > I have also seen that the page_fault1 values take some time to get stabilize on an unmodified kernel. What I am wondering here is that if on a single NUMA guest doing the following will give the right/better idea or not: 1. Pin the guest to a single NUMA node. 2. Run memhog so that it touches all the guest memory. 3. Run will-it-scale/page_fault1. Compare/observe the values for the last core (this is considering the other core values doesn't drastically differ). -- Thanks Nitesh