Re: Pages doesn't belong to same large order folio in block IO path

Ryan Roberts <ryan.roberts@xxxxxxx> · Mon, 5 Feb 2024 08:46:58 +0000

On 05/02/2024 06:33, Kundan Kumar wrote:
> ------ Tessian Warning ------
> 
> Be careful, the email's sending address "kundanthebest@gmail[.]com" has never been seen on your company's network before today
> 
> This warning message will be removed if you reply to or forward this email to a recipient outside of your organization.
> 
> ---- Tessian Warning End ----
> 
> Hi All,
> 
> I am using the patch "Multi-size THP for anonymous memory"
> https://lore.kernel.org/all/20231214160251.3574571-1-ryan.roberts@xxxxxxx/T/#u

Thanks for trying this out!

> 
> 
> I enabled the mTHP using the sysfs interface :
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-128kB/enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-256kB/enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-512kB/enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-1024kB/enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> 
> I can see this patch allocates multi-order folio for anonymous memory.
> 
> With the large order folios getting allocated I tried direct block IO using fio.
> 
> fio -iodepth=1 -rw=write -ioengine=io_uring -direct=1 -bs=16K
> -numjobs=1 -size=16k -group_reporting -filename=/dev/nvme0n1
> -name=io_uring_test

I'm not familiar with fio, but my best guess is that this is an alignment issue.
mTHP will only allocate a large folio if it can be naturally aligned in virtual
memory. Assuming you are on a system with 4K base pages, then mmap will allocate
a 16K portion of the VA space aligned to 4K, so there is a 3/4 chance that it
won't be 16K aligned and then the system will have to allocate small folios to
it. A quick grep of the manual suggests that -iomem_align=16K should solve this.

If that doesn't solve it, then there are a couple of other (less likely)
possibilities:

The -iomem option defaults to malloc() when not explicitly provided. Is it
possible that your malloc implementation is using MADV_NOHUGEPAGE? This would
prevent the allocation of large folios. This seems unlikely because I would have
thought that malloc would pass 16K objects allocation through to mmap and this
wouldn't apply.

The only other possible reason that springs to mind is that if you have enabled
all of the possible sizes and you are running on a very memory constrained
device then perhaps the physical memory is so fragmented that it can't allocate
a large folio. This also feels unlikely though.

If -iomem_align=16K doesn't solve it on its own, I'd suggest trying with all
mTHP sizes disabled except for 16K (after a reboot just to be extra safe), then
use the -iomem=mmap option, which the manual suggests will use mmap with
MAP_ANONYMOUS.

> 
> The fio malloced memory is allocated from a multi-order folio in
> function alloc_anon_folio().
> Block I/O path takes the fio allocated memory and maps it in kernel in
> function iov_iter_extract_user_pages()
> As the pages are mapped using large folios, I try to see if the pages
> belong to same folio using page_folio(page) in function
> __bio_iov_iter_get_pages().
> 
> To my surprise I see that the pages belong to different folios.
> 
> Feb  5 10:34:33 kernel: [244413.315660] 1603
> iov_iter_extract_user_pages addr = 5593b252a000
> Feb  5 10:34:33 kernel: [244413.315680] 1610
> iov_iter_extract_user_pages nr_pages = 4
> Feb  5 10:34:33 kernel: [244413.315700] 1291 __bio_iov_iter_get_pages
> page = ffffea000d4bb9c0 folio = ffffea000d4bb9c0
> Feb  5 10:34:33 kernel: [244413.315749] 1291 __bio_iov_iter_get_pages
> page = ffffea000d796200 folio = ffffea000d796200
> Feb  5 10:34:33 kernel: [244413.315796] 1291 __bio_iov_iter_get_pages
> page = ffffea000d796240 folio = ffffea000d796240
> Feb  5 10:34:33 kernel: [244413.315852] 1291 __bio_iov_iter_get_pages
> page = ffffea000d7b2b80 folio = ffffea000d7b2b80
> 
> I repeat the same experiment with fio using HUGE pages
> fio -iodepth=1 -iomem=mmaphuge -rw=write -ioengine=io_uring -direct=1
> -bs=16K -numjobs=1 -size=16k -group_reporting -filename=/dev/nvme0n1
> -name=io_uring_test

according to the manual -iomem=mmaphuge is using hugetlb. So that will default
to 2M and always be naturally aligned in virtual space. So it makes sense that
you are seeing pages that belong to the same folio here.

> 
> This time when the memory is mmapped from HUGE pages I see that pages belong
> to the same folio.
> 
> Feb  5 10:51:50 kernel: [245450.439817] 1603
> iov_iter_extract_user_pages addr = 7f66e4c00000
> Feb  5 10:51:50 kernel: [245450.439825] 1610
> iov_iter_extract_user_pages nr_pages = 4
> Feb  5 10:51:50 kernel: [245450.439834] 1291 __bio_iov_iter_get_pages
> page = ffffea0005bc8000 folio = ffffea0005bc8000
> Feb  5 10:51:50 kernel: [245450.439858] 1291 __bio_iov_iter_get_pages
> page = ffffea0005bc8040 folio = ffffea0005bc8000
> Feb  5 10:51:50 kernel: [245450.439880] 1291 __bio_iov_iter_get_pages
> page = ffffea0005bc8080 folio = ffffea0005bc8000
> Feb  5 10:51:50 kernel: [245450.439903] 1291 __bio_iov_iter_get_pages
> page = ffffea0005bc80c0 folio = ffffea0005bc8000
> 
> Please let me know if you have any clue as to why the pages for malloced memory
> of fio don't belong to the same folio.

Let me know if -iomem_aligned=16K solves it for you!

Thanks,
Ryan

> 
> --
> Kundan Kumar