Re: [PATCH v3 1/3] mm: add thp_utilization metrics to debugfs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



"Alex Zhu (Kernel)" <alexlzhu@xxxxxxxx> writes:

>> On Oct 13, 2022, at 4:35 AM, Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote:
>> 
>> On Wed, Oct 12, 2022 at 03:51:45PM -0700, alexlzhu@xxxxxx wrote:
>>> From: Alexander Zhu <alexlzhu@xxxxxx>
>>> 
>>> This change introduces a tool that scans through all of physical
>>> memory for anonymous THPs and groups them into buckets based
>>> on utilization. It also includes an interface under
>>> /sys/kernel/debug/thp_utilization.
>>> 
>>> Sample Output:
>>> 
>>> Utilized[0-50]: 1331 680884
>>> Utilized[51-101]: 9 3983
>>> Utilized[102-152]: 3 1187
>>> Utilized[153-203]: 0 0
>>> Utilized[204-255]: 2 539
>>> Utilized[256-306]: 5 1135
>>> Utilized[307-357]: 1 192
>>> Utilized[358-408]: 0 0
>>> Utilized[409-459]: 1 57
>>> Utilized[460-512]: 400 13
>>> Last Scan Time: 223.98s
>>> Last Scan Duration: 70.65s
>>> 
>>> This indicates that there are 1331 THPs that have between 0 and 50
>>> utilized (non zero) pages. In total there are 680884 zero pages in
>>> this utilization bucket. THPs in the [0-50] bucket compose 76% of total
>>> THPs, and are responsible for 99% of total zero pages across all
>>> THPs. In other words, the least utilized THPs are responsible for almost
>>> all of the memory waste when THP is always enabled. Similar results
>>> have been observed across production workloads.
>>> 
>>> The last two lines indicate the timestamp and duration of the most recent
>>> scan through all of physical memory. Here we see that the last scan
>>> occurred 223.98 seconds after boot time and took 70.65 seconds.
>>> 
>>> Utilization of a THP is defined as the percentage of nonzero
>>> pages in the THP. The worker thread will scan through all
>>> of physical memory and obtain utilization of all anonymous
>>> THPs. It will gather this information by periodically scanning
>>> through all of physical memory for anonymous THPs, group them
>>> into buckets based on utilization, and report utilization
>>> information through debugfs under /sys/kernel/debug/thp_utilization.
>>> 
>>> Signed-off-by: Alexander Zhu <alexlzhu@xxxxxx>
>>> ---
>>> v1 to v2
>>> -reversed ordering of is_transparent_hugepage and PageAnon in is_anon_transparent_hugepage, page->mapping is only meaningful for user pages
>>> 
>>> RFC to v1
>>> -Refactored out the code to obtain the thp_utilization_bucket, as that now has to be used in multiple places.
>>> 
>>> Documentation/admin-guide/mm/transhuge.rst |   9 +
>>> include/linux/huge_mm.h                    |   3 +
>>> mm/huge_memory.c                           | 202 +++++++++++++++++++++
>> 
>> Please, consider putting thp_scan functionality into a separate file.
>> mm/thp_scan.c or something.
>
> I’ll consider it. Do you think this is necessary? It is huge page related, but huge_memory has a lot of code already. 
>> 
>>> 3 files changed, 214 insertions(+)
>>> 
>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>>> index 8ee78ec232eb..21d86303c97e 100644
>>> --- a/Documentation/admin-guide/mm/transhuge.rst
>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>>> @@ -304,6 +304,15 @@ To identify what applications are mapping file transparent huge pages, it
>>> is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
>>> for each mapping.
>>> 
>>> +The utilization of transparent hugepages can be viewed by reading
>>> +``/sys/kernel/debug/thp_utilization``. The utilization of a THP is defined
>>> +as the ratio of non zero filled 4kb pages to the total number of pages in a
>>> +THP. The buckets are labelled by the range of total utilized 4kb pages with
>>> +one line per utilization bucket. Each line contains the total number of
>>> +THPs in that bucket and the total number of zero filled 4kb pages summed
>>> +over all THPs in that bucket. The last two lines show the timestamp and
>>> +duration respectively of the most recent scan over all of physical memory.
>>> +
>> 
>> debugfs as a primary interface? Looks wrong to me.
>
> Where would you recommend? We had initially put it under /proc, and then moved to debugfs. 
>
>> 
>>> Note that reading the smaps file is expensive and reading it
>>> frequently will incur overhead.
>>> 
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index a1341fdcf666..13ac7b2f29ae 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -178,6 +178,9 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
>>> unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>>> 		unsigned long len, unsigned long pgoff, unsigned long flags);
>>> 
>>> +int thp_number_utilized_pages(struct page *page);
>>> +int thp_utilization_bucket(int num_utilized_pages);
>>> +
>>> void prep_transhuge_page(struct page *page);
>>> void free_transhuge_page(struct page *page);
>>> 
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 1cc4a5f4791e..29e97df37c29 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -46,6 +46,16 @@
>>> #define CREATE_TRACE_POINTS
>>> #include <trace/events/thp.h>
>>> 
>>> +/*
>>> + * The number of utilization buckets THPs will be grouped in
>>> + * under /sys/kernel/debug/thp_utilization.
>>> + */
>>> +#define THP_UTIL_BUCKET_NR 10
>>> +/*
>>> + * The number of PFNs (and hence hugepages) to scan through on each periodic
>> 
>> PFNs here is misleading. They usually refer to base-pagesize frams. Just
>> say hugepages.
>
> Sounds good. 
>
>> 
>>> + * run of the scanner that generates /sys/kernel/debug/thp_utilization.
>>> + */
>>> +#define THP_UTIL_SCAN_SIZE 256
>>> /*
>>>  * By default, transparent hugepage support is disabled in order to avoid
>>>  * risking an increased memory footprint for applications that are not
>>> @@ -71,6 +81,25 @@ static atomic_t huge_zero_refcount;
>>> struct page *huge_zero_page __read_mostly;
>>> unsigned long huge_zero_pfn __read_mostly = ~0UL;
>>> 
>>> +static void thp_utilization_workfn(struct work_struct *work);
>>> +static DECLARE_DELAYED_WORK(thp_utilization_work, thp_utilization_workfn);
>>> +
>>> +struct thp_scan_info_bucket {
>>> +	int nr_thps;
>>> +	int nr_zero_pages;
>>> +};
>>> +
>>> +struct thp_scan_info {
>>> +	struct thp_scan_info_bucket buckets[THP_UTIL_BUCKET_NR];
>>> +	struct zone *scan_zone;
>>> +	struct timespec64 last_scan_duration;
>>> +	struct timespec64 last_scan_time;
>>> +	unsigned long pfn;
>>> +};
>>> +
>>> +static struct thp_scan_info thp_scan_debugfs;
>>> +static struct thp_scan_info thp_scan;
>> 
>> Any explanation why there are two of them? It is not obvious to me.
>
> The reason we have two is that one of them is used for debugfs if ‘cat /sys/kernel/debug/thp_utilization’ is called. 
> The other is used to keep track of the current scan. 
>> 
>>> +
>>> bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
>>> 			bool smaps, bool in_pf, bool enforce_sysfs)
>>> {
>>> @@ -485,6 +514,7 @@ static int __init hugepage_init(void)
>>> 	if (err)
>>> 		goto err_slab;
>>> 
>>> +	schedule_delayed_work(&thp_utilization_work, HZ);
>>> 	err = register_shrinker(&huge_zero_page_shrinker, "thp-zero");
>>> 	if (err)
>>> 		goto err_hzp_shrinker;
>>> @@ -599,6 +629,11 @@ static inline bool is_transparent_hugepage(struct page *page)
>>> 	       page[1].compound_dtor == TRANSHUGE_PAGE_DTOR;
>>> }
>>> 
>>> +static inline bool is_anon_transparent_hugepage(struct page *page)
>>> +{
>>> +	return is_transparent_hugepage(page) && PageAnon(page);
>>> +}
>>> +
>>> static unsigned long __thp_get_unmapped_area(struct file *filp,
>>> 		unsigned long addr, unsigned long len,
>>> 		loff_t off, unsigned long flags, unsigned long size)
>>> @@ -649,6 +684,49 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>>> }
>>> EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
>>> 
>>> +int thp_number_utilized_pages(struct page *page)
>>> +{
>>> +	struct folio *folio;
>>> +	unsigned long page_offset, value;
>>> +	int thp_nr_utilized_pages = HPAGE_PMD_NR;
>>> +	int step_size = sizeof(unsigned long);
>>> +	bool is_all_zeroes;
>>> +	void *kaddr;
>>> +	int i;
>>> +
>>> +	if (!page || !is_anon_transparent_hugepage(page))
>>> +		return -1;
>>> +
>>> +	folio = page_folio(page);
>>> +	for (i = 0; i < folio_nr_pages(folio); i++) {
>>> +		kaddr = kmap_local_folio(folio, i);
>>> +		is_all_zeroes = true;
>>> +		for (page_offset = 0; page_offset < PAGE_SIZE; page_offset += step_size) {
>>> +			value = *(unsigned long *)(kaddr + page_offset);
>>> +			if (value != 0) {
>>> +				is_all_zeroes = false;
>>> +				break;
>>> +			}
>> 
>> Uhmm.. memchr_inv()?
>
> I had considered that at the time but memchr_inv() used here would return the address of the first nonzero byte. 
> Here we are trying to find the utilization percentage of the THP. I do not believe memchr_inv() would be less code
> as compared what we do here.

In general, I think it's better to use library functions if possible.
memchar_inv() can be used here to check whether the subpage is all zero
via checking whether NULL is return.

memchar_inv() isn't perfect for your purpose, but your code doesn't look
like highly optimized too.  Is it necessary to add another library
function to check whether the contents of a range of memory are all
zero?  Then we can optimize the implementation.

>> 
>>> +		}
>>> +		if (is_all_zeroes)
>>> +			thp_nr_utilized_pages--;
>>> +
>>> +		kunmap_local(kaddr);
>>> +	}
>>> +	return thp_nr_utilized_pages;
>>> +}
>>> +
>>> +int thp_utilization_bucket(int num_utilized_pages)
>>> +{
>>> +	int bucket;
>>> +
>>> +	if (num_utilized_pages < 0 || num_utilized_pages > HPAGE_PMD_NR)
>> 
>> Shouldn't it be WARN() or something?
>> 
>>> +		return -1;
>> 
>> <newline>
>> 
>>> +	/* Group THPs into utilization buckets */
>>> +	bucket = num_utilized_pages * THP_UTIL_BUCKET_NR / HPAGE_PMD_NR;
>>> +	return min(bucket, THP_UTIL_BUCKET_NR - 1);
>>> +}
>>> +
>>> static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>>> 			struct page *page, gfp_t gfp)
>>> {
>>> @@ -3174,6 +3252,42 @@ static int __init split_huge_pages_debugfs(void)
>>> 	return 0;
>>> }
>>> late_initcall(split_huge_pages_debugfs);
>>> +
>>> +static int thp_utilization_show(struct seq_file *seqf, void *pos)
>>> +{
>>> +	int i;
>>> +	int start;
>>> +	int end;
>>> +
>>> +	for (i = 0; i < THP_UTIL_BUCKET_NR; i++) {
>>> +		start = i * HPAGE_PMD_NR / THP_UTIL_BUCKET_NR;
>>> +		end = (i + 1 == THP_UTIL_BUCKET_NR)
>>> +			   ? HPAGE_PMD_NR
>>> +			   : ((i + 1) * HPAGE_PMD_NR / THP_UTIL_BUCKET_NR - 1);
>>> +		/* The last bucket will need to contain 100 */
>>> +		seq_printf(seqf, "Utilized[%d-%d]: %d %d\n", start, end,
>>> +			   thp_scan_debugfs.buckets[i].nr_thps,
>>> +			   thp_scan_debugfs.buckets[i].nr_zero_pages);
>>> +	}
>> 
>> <newline>, again. Here and in many places below. Seriously, they are
>> cheap. :P
>> 
>>> +	seq_printf(seqf, "Last Scan Time: %lu.%02lus\n",
>>> +		   (unsigned long)thp_scan_debugfs.last_scan_time.tv_sec,
>>> +		   (thp_scan_debugfs.last_scan_time.tv_nsec / (NSEC_PER_SEC / 100)));
>>> +
>>> +	seq_printf(seqf, "Last Scan Duration: %lu.%02lus\n",
>>> +		   (unsigned long)thp_scan_debugfs.last_scan_duration.tv_sec,
>>> +		   (thp_scan_debugfs.last_scan_duration.tv_nsec / (NSEC_PER_SEC / 100)));
>>> +
>>> +	return 0;
>>> +}
>>> +DEFINE_SHOW_ATTRIBUTE(thp_utilization);
>>> +
>>> +static int __init thp_utilization_debugfs(void)
>>> +{
>>> +	debugfs_create_file("thp_utilization", 0200, NULL, NULL,
>>> +			    &thp_utilization_fops);
>>> +	return 0;
>>> +}
>>> +late_initcall(thp_utilization_debugfs);
>>> #endif
>>> 
>>> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>>> @@ -3269,3 +3383,91 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
>>> 	trace_remove_migration_pmd(address, pmd_val(pmde));
>>> }
>>> #endif
>>> +
>>> +static void thp_scan_next_zone(void)
>>> +{
>>> +	struct timespec64 current_time;
>>> +	int i;
>>> +	bool update_debugfs;
>>> +	/*
>>> +	 * THP utilization worker thread has reached the end
>>> +	 * of the memory zone. Proceed to the next zone.
>>> +	 */
>>> +	thp_scan.scan_zone = next_zone(thp_scan.scan_zone);
>>> +	update_debugfs = !thp_scan.scan_zone;
>>> +	thp_scan.scan_zone = update_debugfs ? (first_online_pgdat())->node_zones
>>> +			: thp_scan.scan_zone;
>> 
>> I don't follow what is going on. thp_scan vs thp_scan_debugfs looks
>> confusing.
>> 
>>> +	thp_scan.pfn = (thp_scan.scan_zone->zone_start_pfn + HPAGE_PMD_NR - 1)
>>> +			& ~(HPAGE_PMD_SIZE - 1);
>>> +	if (!update_debugfs)
>>> +		return;
>>> +	/*
>>> +	 * If the worker has scanned through all of physical
>>> +	 * memory. Then update information displayed in /sys/kernel/debug/thp_utilization
>>> +	 */
>>> +	ktime_get_ts64(&current_time);
>>> +	thp_scan_debugfs.last_scan_duration = timespec64_sub(current_time,
>>> +							     thp_scan_debugfs.last_scan_time);
>>> +	thp_scan_debugfs.last_scan_time = current_time;
>>> +
>>> +	for (i = 0; i < THP_UTIL_BUCKET_NR; i++) {
>>> +		thp_scan_debugfs.buckets[i].nr_thps = thp_scan.buckets[i].nr_thps;
>>> +		thp_scan_debugfs.buckets[i].nr_zero_pages = thp_scan.buckets[i].nr_zero_pages;
>>> +		thp_scan.buckets[i].nr_thps = 0;
>>> +		thp_scan.buckets[i].nr_zero_pages = 0;
>>> +	}
>>> +}
>>> +
>>> +static void thp_util_scan(unsigned long pfn_end)
>>> +{
>>> +	struct page *page = NULL;
>>> +	int bucket, num_utilized_pages, current_pfn;
>>> +	int i;
>>> +	/*
>>> +	 * Scan through each memory zone in chunks of THP_UTIL_SCAN_SIZE
>>> +	 * PFNs every second looking for anonymous THPs.
>>> +	 */
>>> +	for (i = 0; i < THP_UTIL_SCAN_SIZE; i++) {
>>> +		current_pfn = thp_scan.pfn;
>>> +		thp_scan.pfn += HPAGE_PMD_NR;
>>> +		if (current_pfn >= pfn_end)
>>> +			return;
>>> +
>>> +		if (!pfn_valid(current_pfn))
>>> +			continue;
>>> +
>>> +		page = pfn_to_page(current_pfn);
>> 
>> pfn_valid() + pfn_to_page() has to be replaced to pfn_to_online_page().
>
> Ah k thanks. 
>> 
>>> +		num_utilized_pages = thp_number_utilized_pages(page);
>>> +		bucket = thp_utilization_bucket(num_utilized_pages);
>>> +		if (bucket < 0)
>>> +			continue;
>>> +
>>> +		thp_scan.buckets[bucket].nr_thps++;
>>> +		thp_scan.buckets[bucket].nr_zero_pages += (HPAGE_PMD_NR - num_utilized_pages);
>>> +	}
>>> +}
>>> +
>>> +static void thp_utilization_workfn(struct work_struct *work)
>>> +{
>>> +	unsigned long pfn_end;
>>> +
>>> +	if (!thp_scan.scan_zone)
>>> +		thp_scan.scan_zone = (first_online_pgdat())->node_zones;
>>> +	/*
>>> +	 * Worker function that scans through all of physical memory
>>> +	 * for anonymous THPs.
>>> +	 */
>>> +	pfn_end = (thp_scan.scan_zone->zone_start_pfn +
>>> +			thp_scan.scan_zone->spanned_pages + HPAGE_PMD_NR - 1)
>>> +			& ~(HPAGE_PMD_SIZE - 1);
>>> +	/* If we have reached the end of the zone or end of physical memory
>>> +	 * move on to the next zone. Otherwise, scan the next PFNs in the
>>> +	 * current zone.
>>> +	 */
>>> +	if (!populated_zone(thp_scan.scan_zone) || thp_scan.pfn >= pfn_end)
>>> +		thp_scan_next_zone();
>>> +	else
>>> +		thp_util_scan(pfn_end);
>>> +
>>> +	schedule_delayed_work(&thp_utilization_work, HZ);
>> 
>> Why HZ?
>
> Scanning 256 PFNs per second is just what we have found to not have any noticeable effect on our hosts. 

Better to show some performance data here.  For example, ftrace
callgraph can be used to collect the run time of
thp_utilization_workfn() on a system full of THP.

And, it will take about 512s to scan all memory on a system with 256 GB
memory.  This appears long too.

>> 
>>> +}

Best Regards,
Huang, Ying





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux