On Mon, Aug 5, 2019 at 7:05 AM Nitesh Narayan Lal <nitesh@xxxxxxxxxx> wrote: > > > On 8/1/19 6:33 PM, Alexander Duyck wrote: > > From: Alexander Duyck <alexander.h.duyck@xxxxxxxxxxxxxxx> > > > > In order to pave the way for free page reporting in virtualized > > environments we will need a way to get pages out of the free lists and > > identify those pages after they have been returned. To accomplish this, > > this patch adds the concept of a Reported Buddy, which is essentially > > meant to just be the Uptodate flag used in conjunction with the Buddy > > page type. > > > > It adds a set of pointers we shall call "boundary" which represents the > > upper boundary between the unreported and reported pages. The general idea > > is that in order for a page to cross from one side of the boundary to the > > other it will need to go through the reporting process. Ultimately a > > free_list has been fully processed when the boundary has been moved from > > the tail all they way up to occupying the first entry in the list. > > > > Doing this we should be able to make certain that we keep the reported > > pages as one contiguous block in each free list. This will allow us to > > efficiently manipulate the free lists whenever we need to go in and start > > sending reports to the hypervisor that there are new pages that have been > > freed and are no longer in use. > > > > An added advantage to this approach is that we should be reducing the > > overall memory footprint of the guest as it will be more likely to recycle > > warm pages versus trying to allocate the reported pages that were likely > > evicted from the guest memory. > > > > Since we will only be reporting one zone at a time we keep the boundary > > limited to being defined for just the zone we are currently reporting pages > > from. Doing this we can keep the number of additional pointers needed quite > > small. To flag that the boundaries are in place we use a single bit > > in the zone to indicate that reporting and the boundaries are active. > > > > The determination of when to start reporting is based on the tracking of > > the number of free pages in a given area versus the number of reported > > pages in that area. We keep track of the number of reported pages per > > free_area in a separate zone specific area. We do this to avoid modifying > > the free_area structure as this can lead to false sharing for the highest > > order with the zone lock which leads to a noticeable performance > > degradation. > > > > Signed-off-by: Alexander Duyck <alexander.h.duyck@xxxxxxxxxxxxxxx> > > --- > > include/linux/mmzone.h | 40 +++++ > > include/linux/page-flags.h | 11 + > > include/linux/page_reporting.h | 138 ++++++++++++++++++ > > mm/Kconfig | 5 + > > mm/Makefile | 1 > > mm/memory_hotplug.c | 1 > > mm/page_alloc.c | 136 ++++++++++++++++++ > > mm/page_reporting.c | 299 ++++++++++++++++++++++++++++++++++++++++ > > 8 files changed, 623 insertions(+), 8 deletions(-) > > create mode 100644 include/linux/page_reporting.h > > create mode 100644 mm/page_reporting.c > > <snip> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 71aadc7d5ff6..69b848e5b83f 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -68,6 +68,7 @@ > > #include <linux/lockdep.h> > > #include <linux/nmi.h> > > #include <linux/psi.h> > > +#include <linux/page_reporting.h> > > > > #include <asm/sections.h> > > #include <asm/tlbflush.h> > > @@ -915,7 +916,7 @@ static inline struct capture_control *task_capc(struct zone *zone) > > static inline void __free_one_page(struct page *page, > > unsigned long pfn, > > struct zone *zone, unsigned int order, > > - int migratetype) > > + int migratetype, bool reported) > > { > > struct capture_control *capc = task_capc(zone); > > unsigned long uninitialized_var(buddy_pfn); > > @@ -990,11 +991,20 @@ static inline void __free_one_page(struct page *page, > > done_merging: > > set_page_order(page, order); > > > > - if (is_shuffle_order(order) ? shuffle_add_to_tail() : > > - buddy_merge_likely(pfn, buddy_pfn, page, order)) > > + if (reported || > > + (is_shuffle_order(order) ? shuffle_add_to_tail() : > > + buddy_merge_likely(pfn, buddy_pfn, page, order))) > > add_to_free_list_tail(page, zone, order, migratetype); > > else > > add_to_free_list(page, zone, order, migratetype); > > + > > + /* > > + * No need to notify on a reported page as the total count of > > + * unreported pages will not have increased since we have essentially > > + * merged the reported page with one or more unreported pages. > > + */ > > + if (!reported) > > + page_reporting_notify_free(zone, order); > > } > > > > /* > > @@ -1305,7 +1315,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, > > if (unlikely(isolated_pageblocks)) > > mt = get_pageblock_migratetype(page); > > > > - __free_one_page(page, page_to_pfn(page), zone, 0, mt); > > + __free_one_page(page, page_to_pfn(page), zone, 0, mt, false); > > trace_mm_page_pcpu_drain(page, 0, mt); > > } > > spin_unlock(&zone->lock); > > @@ -1321,7 +1331,7 @@ static void free_one_page(struct zone *zone, > > is_migrate_isolate(migratetype))) { > > migratetype = get_pfnblock_migratetype(page, pfn); > > } > > - __free_one_page(page, pfn, zone, order, migratetype); > > + __free_one_page(page, pfn, zone, order, migratetype, false); > > spin_unlock(&zone->lock); > > } > > > > @@ -2183,6 +2193,122 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, > > return NULL; > > } > > > > +#ifdef CONFIG_PAGE_REPORTING > > +/** > > + * get_unreported_page - Pull an unreported page from the free_list > > + * @zone: Zone to draw pages from > > + * @order: Order to draw pages from > > + * @mt: Migratetype to draw pages from > > + * > > + * This function will obtain a page from the free list. It will start by > > + * attempting to pull from the tail of the free list and if that is already > > + * reported on it will instead pull the head if that is unreported. > > + * > > + * The page will have the migrate type and order stored in the page > > + * metadata. While being processed the page will not be avaialble for > > + * allocation. > > + * > > + * Return: page pointer if raw page found, otherwise NULL > > + */ > > +struct page *get_unreported_page(struct zone *zone, unsigned int order, int mt) > > +{ > > + struct list_head *tail = get_unreported_tail(zone, order, mt); > > + struct free_area *area = &(zone->free_area[order]); > > + struct list_head *list = &area->free_list[mt]; > > + struct page *page; > > + > > + /* zone lock should be held when this function is called */ > > + lockdep_assert_held(&zone->lock); > > + > > + /* Find a page of the appropriate size in the preferred list */ > > + page = list_last_entry(tail, struct page, lru); > > + list_for_each_entry_from_reverse(page, list, lru) { > > + /* If we entered this loop then the "raw" list isn't empty */ > > + > > + /* If the page is reported try the head of the list */ > > + if (PageReported(page)) { > > + page = list_first_entry(list, struct page, lru); > > + > > + /* > > + * If both the head and tail are reported then reset > > + * the boundary so that we read as an empty list > > + * next time and bail out. > > + */ > > + if (PageReported(page)) { > > + page_reporting_add_to_boundary(page, zone, mt); > > + break; > > + } > > + } > > + > > + del_page_from_free_list(page, zone, order); > > + > > + /* record migratetype and order within page */ > > + set_pcppage_migratetype(page, mt); > > + set_page_private(page, order); > > + > > + /* > > + * Page will not be available for allocation while we are > > + * processing it so update the freepage state. > > + */ > > + __mod_zone_freepage_state(zone, -(1 << order), mt); > > + > > + return page; > > + } > > + > > + return NULL; > > +} > > + > > +/** > > + * put_reported_page - Return a now-reported page back where we got it > > + * @zone: Zone to return pages to > > + * @page: Page that was reported > > + * > > + * This function will pull the migratetype and order information out > > + * of the page and attempt to return it where it found it. If the page > > + * is added to the free list without changes we will mark it as being > > + * reported. > > + */ > > +void put_reported_page(struct zone *zone, struct page *page) > > +{ > > + unsigned int order, mt; > > + unsigned long pfn; > > + > > + /* zone lock should be held when this function is called */ > > + lockdep_assert_held(&zone->lock); > > + > > + mt = get_pcppage_migratetype(page); > > + pfn = page_to_pfn(page); > > + > > + if (unlikely(has_isolate_pageblock(zone) || is_migrate_isolate(mt))) { > > + mt = get_pfnblock_migratetype(page, pfn); > > + set_pcppage_migratetype(page, mt); > > + } > > + > > + order = page_private(page); > > + set_page_private(page, 0); > > + > > + __free_one_page(page, pfn, zone, order, mt, true); > > I don't think we need to hold the zone lock for fetching migratetype and other > information. > We can save some lock held time by acquiring and releasing zone lock before and > after __free_one_page() respectively. Isn't? We could, but acquiring and releasing the lock also takes time. I thought it better to simply hold the lock while I dump the scatterlist back into the free_list, and until I have completed pulling the non-reported pages back out. Otherwise we take the overhead for acquiring/releasing the spinlock itself which isn't necessarily cheap since it will be frequently bounced between CPUs. > > + > > + /* > > + * If page was comingled with another page we cannot consider > > + * the result to be "reported" since part of the page hasn't been. > > + * In this case we will simply exit and not update the "reported" > > + * state. Instead just treat the result as a unreported page. > > + */ > > + if (!PageBuddy(page) || page_order(page) != order) > > + return; > > + > > + /* update areated page accounting */ > > + zone->reported_pages[order - PAGE_REPORTING_MIN_ORDER]++; > > + > > + /* update boundary of new migratetype and record it */ > > + page_reporting_add_to_boundary(page, zone, mt); > > + > > + /* flag page as reported */ > > + __SetPageReported(page); > > +} > > +#endif /* CONFIG_PAGE_REPORTING */ > > + > > /* > > * This array describes the order lists are fallen back to when > > * the free lists for the desirable migrate type are depleted > > diff --git a/mm/page_reporting.c b/mm/page_reporting.c > > new file mode 100644 > > index 000000000000..971138205ae5 > > --- /dev/null > > +++ b/mm/page_reporting.c > > @@ -0,0 +1,299 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +#include <linux/mm.h> > > +#include <linux/mmzone.h> > > +#include <linux/page-isolation.h> > > +#include <linux/gfp.h> > > +#include <linux/export.h> <snip> > > +int page_reporting_startup(struct page_reporting_dev_info *phdev) > > +{ > > + struct zone *zone; > > + > > + /* nothing to do if already in use */ > > + if (rcu_access_pointer(ph_dev_info)) > > + return -EBUSY; > > + > > + /* allocate scatterlist to store pages being reported on */ > > + phdev->sg = kcalloc(phdev->capacity, sizeof(*phdev->sg), GFP_KERNEL); > > + if (!phdev->sg) > > + return -ENOMEM; > > + > > + /* initialize refcnt and work structures */ > > + atomic_set(&phdev->refcnt, 0); > > + INIT_DELAYED_WORK(&phdev->work, &page_reporting_process); > > + > > + /* assign device, and begin initial flush of populated zones */ > > + rcu_assign_pointer(ph_dev_info, phdev); > > > Will, it not make sense to do this at the top after rcu_access_pointer check()? > Otherwise, there could be a race between two enablers. Am I missing something here? Placement wouldn't matter as a race would still be possible. Right now this is safe since there is really only one consumer for this. However I suppose I should look at adding a mutex so that we cannot have multiple threads doing the initialization at the same time. > > + for_each_populated_zone(zone) { > > + spin_lock(&zone->lock); > > + __page_reporting_request(zone); > > + spin_unlock(&zone->lock); > > + } > > + > > + /* enable page reporting notification */ > > + static_key_slow_inc(&page_reporting_notify_enabled); > > + > > + return 0; > > +} > > +EXPORT_SYMBOL_GPL(page_reporting_startup); > > + > > > -- > Thanks > Nitesh >