On Monday, February 5, 2024, Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
On 05/02/2024 06:33, Kundan Kumar wrote:
> ------ Tessian Warning ------
>
> Be careful, the email's sending address "kundanthebest@gmail[.]com" has never been seen on your company's network before today
>
> This warning message will be removed if you reply to or forward this email to a recipient outside of your organization.
>
> ---- Tessian Warning End ----
>
> Hi All,
>
> I am using the patch "Multi-size THP for anonymous memory"
> https://lore.kernel.org/all/20231214160251.3574571-1-ryan. roberts@xxxxxxx/T/#u
Thanks for trying this out!
>
>
> I enabled the mTHP using the sysfs interface :
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-16kB/ enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-32kB/ enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-64kB/ enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-128kB/ enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-256kB/ enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-512kB/ enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-1024kB/ enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/ enabled
>
> I can see this patch allocates multi-order folio for anonymous memory.
>
> With the large order folios getting allocated I tried direct block IO using fio.
>
> fio -iodepth=1 -rw=write -ioengine=io_uring -direct=1 -bs=16K
> -numjobs=1 -size=16k -group_reporting -filename=/dev/nvme0n1
> -name=io_uring_test
I'm not familiar with fio, but my best guess is that this is an alignment issue.
mTHP will only allocate a large folio if it can be naturally aligned in virtual
memory. Assuming you are on a system with 4K base pages, then mmap will allocate
a 16K portion of the VA space aligned to 4K, so there is a 3/4 chance that it
won't be 16K aligned and then the system will have to allocate small folios to
it. A quick grep of the manual suggests that -iomem_align=16K should solve this.
If that doesn't solve it, then there are a couple of other (less likely)
possibilities:
The -iomem option defaults to malloc() when not explicitly provided. Is it
possible that your malloc implementation is using MADV_NOHUGEPAGE? This would
prevent the allocation of large folios. This seems unlikely because I would have
thought that malloc would pass 16K objects allocation through to mmap and this
wouldn't apply.
The only other possible reason that springs to mind is that if you have enabled
all of the possible sizes and you are running on a very memory constrained
device then perhaps the physical memory is so fragmented that it can't allocate
a large folio. This also feels unlikely though.
If -iomem_align=16K doesn't solve it on its own, I'd suggest trying with all
mTHP sizes disabled except for 16K (after a reboot just to be extra safe), then
use the -iomem=mmap option, which the manual suggests will use mmap with
MAP_ANONYMOUS.
>
> The fio malloced memory is allocated from a multi-order folio in
> function alloc_anon_folio().
> Block I/O path takes the fio allocated memory and maps it in kernel in
> function iov_iter_extract_user_pages()
> As the pages are mapped using large folios, I try to see if the pages
> belong to same folio using page_folio(page) in function
> __bio_iov_iter_get_pages().
>
> To my surprise I see that the pages belong to different folios.
>
> Feb 5 10:34:33 kernel: [244413.315660] 1603
> iov_iter_extract_user_pages addr = 5593b252a000
> Feb 5 10:34:33 kernel: [244413.315680] 1610
> iov_iter_extract_user_pages nr_pages = 4
> Feb 5 10:34:33 kernel: [244413.315700] 1291 __bio_iov_iter_get_pages
> page = ffffea000d4bb9c0 folio = ffffea000d4bb9c0
> Feb 5 10:34:33 kernel: [244413.315749] 1291 __bio_iov_iter_get_pages
> page = ffffea000d796200 folio = ffffea000d796200
> Feb 5 10:34:33 kernel: [244413.315796] 1291 __bio_iov_iter_get_pages
> page = ffffea000d796240 folio = ffffea000d796240
> Feb 5 10:34:33 kernel: [244413.315852] 1291 __bio_iov_iter_get_pages
> page = ffffea000d7b2b80 folio = ffffea000d7b2b80
>
> I repeat the same experiment with fio using HUGE pages
> fio -iodepth=1 -iomem=mmaphuge -rw=write -ioengine=io_uring -direct=1
> -bs=16K -numjobs=1 -size=16k -group_reporting -filename=/dev/nvme0n1
> -name=io_uring_test
according to the manual -iomem=mmaphuge is using hugetlb. So that will default
to 2M and always be naturally aligned in virtual space. So it makes sense that
you are seeing pages that belong to the same folio here.
>
> This time when the memory is mmapped from HUGE pages I see that pages belong
> to the same folio.
>
> Feb 5 10:51:50 kernel: [245450.439817] 1603
> iov_iter_extract_user_pages addr = 7f66e4c00000
> Feb 5 10:51:50 kernel: [245450.439825] 1610
> iov_iter_extract_user_pages nr_pages = 4
> Feb 5 10:51:50 kernel: [245450.439834] 1291 __bio_iov_iter_get_pages
> page = ffffea0005bc8000 folio = ffffea0005bc8000
> Feb 5 10:51:50 kernel: [245450.439858] 1291 __bio_iov_iter_get_pages
> page = ffffea0005bc8040 folio = ffffea0005bc8000
> Feb 5 10:51:50 kernel: [245450.439880] 1291 __bio_iov_iter_get_pages
> page = ffffea0005bc8080 folio = ffffea0005bc8000
> Feb 5 10:51:50 kernel: [245450.439903] 1291 __bio_iov_iter_get_pages
> page = ffffea0005bc80c0 folio = ffffea0005bc8000
>
> Please let me know if you have any clue as to why the pages for malloced memory
> of fio don't belong to the same folio.
Let me know if -iomem_aligned=16K solves it for you!
Thanks,
Ryan
Thanks Ryan for help and good elaborate reply.
I tried various combinations. Good news is mmap and aligned memory allocates large folio and solves the issue.
Lets see the various cases one by one :
==============
Aligned malloc
==============
Only the align didnt solve the issue. The command I used :
fio -iodepth=1 -iomem_align=16K -rw=write -ioengine=io_uring -direct=1 -hipri -bs=16K -numjobs=1 -size=16k -group_reporting -filename=/dev/nvme0n1 -name=io_uring_test
The block IO path has separate pages and separate folios.
Logs
Feb 5 15:27:32 kernel: [261992.075752] 1603 iov_iter_extract_user_pages addr = 55b2a0542000
Feb 5 15:27:32 kernel: [261992.075762] 1610 iov_iter_extract_user_pages nr_pages = 4
Feb 5 15:27:32 kernel: [261992.075786] 1291 __bio_iov_iter_get_pages page = ffffea000d9461c0 folio = ffffea000d9461c0
Feb 5 15:27:32 kernel: [261992.075812] 1291 __bio_iov_iter_get_pages page = ffffea000d7ef7c0 folio = ffffea000d7ef7c0
Feb 5 15:27:32 kernel: [261992.075836] 1291 __bio_iov_iter_get_pages page = ffffea000d7d30c0 folio = ffffea000d7d30c0
Feb 5 15:27:32 kernel: [261992.075861] 1291 __bio_iov_iter_get_pages page = ffffea000d7f2680 folio = ffffea000d7f2680
==============
Non aligned mmap
==============
mmap not aligned does somewhat better, we see 3 pages from same folio
fio -iodepth=1 -iomem=mmap -rw=write -ioengine=io_uring -direct=1 -hipri -bs=16K -numjobs=1 -size=16k -group_reporting -filename=/dev/nvme0n1 -name=io_uring_test
Feb 5 15:31:08 kernel: [262208.082789] 1603 iov_iter_extract_user_pages addr = 7f72bc711000
Feb 5 15:31:08 kernel: [262208.082808] 1610 iov_iter_extract_user_pages nr_pages = 4
Feb 5 15:24:31 kernel: [261811.086973] 1291 __bio_iov_iter_get_pages page = ffffea000aed36c0 folio = ffffea000aed36c0
Feb 5 15:24:31 kernel: [261811.087010] 1291 __bio_iov_iter_get_pages page = ffffea000d2d0200 folio = ffffea000d2d0200
Feb 5 15:24:31 kernel: [261811.087044] 1291 __bio_iov_iter_get_pages page = ffffea000d2d0240 folio = ffffea000d2d0200
Feb 5 15:24:31 kernel: [261811.087078] 1291 __bio_iov_iter_get_pages page = ffffea000d2d0280 folio = ffffea000d2d0200
==============
Aligned mmap
==============
mmap and aligned "-iomem_align=16K -iomem=mmap" solves the issue !!!
Even with all the mTHP sizes enabled I see that 1 folio is present
corresponding to the 4 pages.
fio -iodepth=1 -iomem_align=16K -iomem=mmap -rw=write -ioengine=io_uring -direct=1 -hipri -bs=16K -numjobs=1 -size=16k -group_reporting -filename=/dev/nvme0n1 -name=io_uring_test
Feb 5 15:29:36 kernel: [262115.791589] 1603 iov_iter_extract_user_pages addr = 7f5c9087b000
Feb 5 15:29:36 kernel: [262115.791611] 1610 iov_iter_extract_user_pages nr_pages = 4
Feb 5 15:29:36 kernel: [262115.791635] 1291 __bio_iov_iter_get_pages page = ffffea000e0116c0 folio = ffffea000e011600
Feb 5 15:29:36 kernel: [262115.791696] 1291 __bio_iov_iter_get_pages page = ffffea000e011700 folio = ffffea000e011600
Feb 5 15:29:36 kernel: [262115.791755] 1291 __bio_iov_iter_get_pages page = ffffea000e011740 folio = ffffea000e011600
Feb 5 15:29:36 kernel: [262115.791814] 1291 __bio_iov_iter_get_pages page = ffffea000e011780 folio = ffffea000e011600
So it looks like normal malloc even if aligned doesn't allocate large order
folios. Only if we do a mmap which sets the flag "OS_MAP_ANON | MAP_PRIVATE"
then we get the same folio.
I was under assumption that malloc will internally use mmap with MAP_ANON
and we shall get same folio.
For just the malloc case :
On another front I have logs in alloc_anon_folio. For just the malloc case I
see allocation of 64 pages. "addr = 5654feac0000" is the address malloced by
fio(without align and without mmap)
Feb 5 15:56:56 kernel: [263756.413095] alloc_anon_folio comm=fio order = 6 folio = ffffea000e044000 addr = 5654feac0000 vma = ffff88814cfc7c20
Feb 5 15:56:56 kernel: [263756.413110] alloc_anon_folio comm=fio folio_nr_pages = 64
64 pages with be 0x40000, when added to 5654feac0000 we get 5654feb00000.
So this range user space address shall be covered in this folio itself.
And after this when IO is issued I see the user space address passed in this
range to block IO path. But the code of iov_iter_extract_user_pages() doesnt
fetch the same pages/folio.
Feb 5 15:56:57 kernel: [263756.678586] 1603 iov_iter_extract_user_pages addr = 5654fead4000
Feb 5 15:56:57 kernel: [263756.678606] 1610 iov_iter_extract_user_pages nr_pages = 4
Feb 5 15:56:57 kernel: [263756.678629] 1291 __bio_iov_iter_get_pages page = ffffea000dfc2b80 folio = ffffea000dfc2b80
Feb 5 15:56:57 kernel: [263756.678684] 1291 __bio_iov_iter_get_pages page = ffffea000dfc2bc0 folio = ffffea000dfc2bc0
Feb 5 15:56:57 kernel: [263756.678738] 1291 __bio_iov_iter_get_pages page = ffffea000d7b9100 folio = ffffea000d7b9100
Feb 5 15:56:57 kernel: [263756.678790] 1291 __bio_iov_iter_get_pages page = ffffea000d7b9140 folio = ffffea000d7b9140
Please let me know your thoughts on same.
--
Kundan Kumar