On 4/3/21 2:56 AM, Alexander Duyck wrote: > On Fri, Mar 26, 2021 at 2:45 AM Xunlei Pang <xlpang@xxxxxxxxxxxxxxxxx> wrote: >> >> Add new "/sys/kernel/mm/page_reporting/reporting_factor" >> within [0, 100], and stop page reporting when it reaches >> the configured threshold. Default is 100 which means no >> limitation is imposed. Percentile is adopted to reflect >> the fact that it reports on the per-zone basis. >> >> We can control the total number of reporting pages via >> this knob to avoid EPT violations which may affect the >> performance of the business, imagine the guest memory >> allocation burst or host long-tail memory reclaiming >> really hurt. > > I'm not a fan of the concept as I don't think it really does what it > was meant to do. The way page reporting was meant to work is that when > we have enough free pages we will cycle through memory a few pages at > a time reporting what is unused to the hypervisor. It was meant to be > a scan more than something that just would stop once it touched a > certain part of the memory. > > If you are wanting to truly reserve some amount of memory so that it > is always left held by the guest then it might make more sense to make > the value a fixed amount of memory rather than trying to do it as a > percentage. > > Also we may need to look at adding some sort of > linearization/defragmentation logic for the reported pages. One issue > is that there are several things that will add pages to the end of the > free page lists. One of the reasons why I was processing the entire > list when I was processing reported pages was because the page freeing > functions will normally cause pages to be interleaved with the > reported pages on the end of the list. So if you are wanting to > reserve some pages as being non-reported we may need to add something > sort the lists periodically. Yes, agreed. To make the counter accurate, I also noticed this problem, I'm going to figure out a way to handle it, e.g. maybe adding a new migratetype for reported free_list is a good choice, this also helps reduce the zone lock latency during the report procedue on large VM, which can be in milliseconds. > >> This knob can help make customized control policies according >> to VM priority, it is also useful for testing, gray-release, etc. > > As far as the knob itself it would make sense to combine this with > patch 3 since they are just different versions of the same control > >> --- >> mm/page_reporting.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++++- >> 1 file changed, 59 insertions(+), 1 deletion(-) >> >> diff --git a/mm/page_reporting.c b/mm/page_reporting.c >> index ba195ea..86c6479 100644 >> --- a/mm/page_reporting.c >> +++ b/mm/page_reporting.c >> @@ -11,6 +11,8 @@ >> #include "page_reporting.h" >> #include "internal.h" >> >> +static int reporting_factor = 100; >> + >> #define PAGE_REPORTING_DELAY (2 * HZ) >> static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly; >> >> @@ -134,6 +136,7 @@ void __page_reporting_notify(void) >> struct list_head *list = &area->free_list[mt]; >> unsigned int page_len = PAGE_SIZE << order; >> struct page *page, *next; >> + unsigned long threshold; >> long budget; >> int err = 0; >> >> @@ -144,6 +147,7 @@ void __page_reporting_notify(void) >> if (list_empty(list)) >> return err; >> >> + threshold = atomic_long_read(&zone->managed_pages) * reporting_factor / 100; > > So at 0 you are setting this threshold to 0, however based on the code > below you are still pulling at least one page. > >> spin_lock_irq(&zone->lock); >> >> /* >> @@ -181,6 +185,8 @@ void __page_reporting_notify(void) >> >> /* Attempt to pull page from list and place in scatterlist */ >> if (*offset) { >> + unsigned long nr_pages; >> + >> if (!__isolate_free_page(page, order)) { >> next = page; >> break; >> @@ -190,6 +196,12 @@ void __page_reporting_notify(void) >> --(*offset); >> sg_set_page(&sgl[*offset], page, page_len, 0); >> >> + nr_pages = (PAGE_REPORTING_CAPACITY - *offset) << order; >> + if (zone->reported_pages + nr_pages >= threshold) { >> + err = 1; >> + break; >> + } >> + > > So here we are checking the threshold after we have already pulled the > page. With this being the case it might make more sense to either > allow for the full capacity of pages to be pulled and then check this > after they have been reported, or to move this check up to somewhere > before you start processing the pages. What you want to avoid is > having to perform this check for every individual page. > >> continue; >> } >> >> @@ -244,9 +256,13 @@ void __page_reporting_notify(void) >> struct scatterlist *sgl, struct zone *zone) >> { >> unsigned int order, mt, leftover, offset = PAGE_REPORTING_CAPACITY; >> - unsigned long watermark; >> + unsigned long watermark, threshold; >> int err = 0; >> >> + threshold = atomic_long_read(&zone->managed_pages) * reporting_factor / 100; >> + if (zone->reported_pages >= threshold) >> + return err; >> + > > Rather than having to calculate the threshold in multiple spots it > might make more sense to move this to the start of > page_reporting_cycle and have it performed again before we reacquire > the spinlock and run page_reporting_drain. > >> /* Generate minimum watermark to be able to guarantee progress */ >> watermark = low_wmark_pages(zone) + >> (PAGE_REPORTING_CAPACITY << PAGE_REPORTING_MIN_ORDER); >> @@ -267,11 +283,18 @@ void __page_reporting_notify(void) >> >> err = page_reporting_cycle(prdev, zone, order, mt, >> sgl, &offset); >> + /* Exceed threshold go to report leftover */ >> + if (err > 0) { >> + err = 0; >> + goto leftover; >> + } >> + >> if (err) >> return err; >> } >> } >> >> +leftover: >> /* report the leftover pages before going idle */ >> leftover = PAGE_REPORTING_CAPACITY - offset; >> if (leftover) { > > You should optimize for processing full batches rather than chopping > things up into smaller groupings. > >> @@ -435,9 +458,44 @@ static ssize_t refault_kbytes_store(struct kobject *kobj, >> } >> REPORTING_ATTR(refault_kbytes); >> >> +static ssize_t reporting_factor_show(struct kobject *kobj, >> + struct kobj_attribute *attr, char *buf) >> +{ >> + return sprintf(buf, "%u\n", reporting_factor); >> +} >> + >> +static ssize_t reporting_factor_store(struct kobject *kobj, >> + struct kobj_attribute *attr, >> + const char *buf, size_t count) >> +{ >> + int new, old, err; >> + struct page *page; >> + >> + err = kstrtoint(buf, 10, &new); >> + if (err || (new < 0 || new > 100)) >> + return -EINVAL; >> + >> + old = reporting_factor; >> + reporting_factor = new; >> + >> + if (new <= old) >> + goto out; >> + >> + /* Trigger reporting with new larger reporting_factor */ >> + page = alloc_pages(__GFP_HIGHMEM | __GFP_NOWARN, >> + PAGE_REPORTING_MIN_ORDER); >> + if (page) >> + __free_pages(page, PAGE_REPORTING_MIN_ORDER); >> + >> +out: >> + return count; >> +} >> +REPORTING_ATTR(reporting_factor); >> + >> static struct attribute *reporting_attrs[] = { >> &reported_kbytes_attr.attr, >> &refault_kbytes_attr.attr, >> + &reporting_factor_attr.attr, >> NULL, >> }; >> >> -- >> 1.8.3.1 >> >>