On 8/12/19 2:47 PM, Alexander Duyck wrote: > On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@xxxxxxxxxx> wrote: >> This patch introduces the core infrastructure for free page reporting in >> virtual environments. It enables the kernel to track the free pages which >> can be reported to its hypervisor so that the hypervisor could >> free and reuse that memory as per its requirement. >> >> While the pages are getting processed in the hypervisor (e.g., >> via MADV_DONTNEED), the guest must not use them, otherwise, data loss >> would be possible. To avoid such a situation, these pages are >> temporarily removed from the buddy. The amount of pages removed >> temporarily from the buddy is governed by the backend(virtio-balloon >> in our case). >> >> To efficiently identify free pages that can to be reported to the >> hypervisor, bitmaps in a coarse granularity are used. Only fairly big >> chunks are reported to the hypervisor - especially, to not break up THP >> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits >> in the bitmap are an indication whether a page *might* be free, not a >> guarantee. A new hook after buddy merging sets the bits. >> >> Bitmaps are stored per zone, protected by the zone lock. A workqueue >> asynchronously processes the bitmaps, trying to isolate and report pages >> that are still free. The backend (virtio-balloon) is responsible for >> reporting these batched pages to the host synchronously. Once reporting/ >> freeing is complete, isolated pages are returned back to the buddy. >> >> Signed-off-by: Nitesh Narayan Lal <nitesh@xxxxxxxxxx> [...] >> +static void scan_zone_bitmap(struct page_reporting_config *phconf, >> + struct zone *zone) >> +{ >> + unsigned long setbit; >> + struct page *page; >> + int count = 0; >> + >> + sg_init_table(phconf->sg, phconf->max_pages); >> + >> + for_each_set_bit(setbit, zone->bitmap, zone->nbits) { >> + /* Process only if the page is still online */ >> + page = pfn_to_online_page((setbit << PAGE_REPORTING_MIN_ORDER) + >> + zone->base_pfn); >> + if (!page) >> + continue; >> + > Shouldn't you be clearing the bit and dropping the reference to > free_pages before you move on to the next bit? Otherwise you are going > to be stuck with those aren't you? > >> + spin_lock(&zone->lock); >> + >> + /* Ensure page is still free and can be processed */ >> + if (PageBuddy(page) && page_private(page) >= >> + PAGE_REPORTING_MIN_ORDER) >> + count = process_free_page(page, phconf, count); >> + >> + spin_unlock(&zone->lock); > So I kind of wonder just how much overhead you are taking for bouncing > the zone lock once per page here. Especially since it can result in > you not actually making any progress since the page may have already > been reallocated. > I am wondering if there is a way to measure this overhead? After thinking about this, I do understand your point. One possible way which I can think of to address this is by having a page_reporting_dequeue() hook somewhere in the allocation path. For some reason, I am not seeing this work as I would have expected but I don't have solid reasoning to share yet. It could be simply because I am putting my hook at the wrong place. I will continue investigating this. In any case, I may be over complicating things here, so please let me if there is a better way to do this. If this overhead is not significant we can probably live with it. -- Thanks Nitesh