On Wed, Dec 20, 2023 at 08:53:38PM -0800, Fangrui Song wrote: > Thanks for the comment. Frankly, I am not familiar with huge pages... > I noticed this VM_EXEC condition when I was writing this > hugepage-related section in > https://maskray.me/blog/2023-12-17-exploring-the-section-layout-in-linker-output#transparent-huge-pages-for-mapped-files > (Thanks to Alexander Monakov's comment about > CONFIG_READ_ONLY_THP_FOR_FS in > https://mazzo.li/posts/check-huge-page.html). CONFIG_READ_ONLY_THP_FOR_FS is a preliminary hack which solves some problems. The real solution is using large folios, which at the moment means that you should test on XFS or AFS; filesystem authors have not been enthusiastic about adding support to their filesystems so far. In your blog, you write: : In -z noseparate-code layouts, the file content starts somewhere at : the first page, potentially wasting half a huge page on unrelated : content. Switching to -z separate-code allows reclaiming the benefits : of the half huge page but increases the file size. Balancing : these aspects poses a challenge. One potential solution is using : fallocate(FALLOC_FL_PUNCH_HOLE), which introduces complexity into the : linker. However, this approach feels like a workaround to address a : kernel limitation. It would be preferable if a file-backed huge page : didn't necessitate a file offset aligned to a huge page boundary. You should distinguish between file size (ie st_size in stat(3)) and amount of space occupied on storage (st_blocks). The linker should be fine with creating a sparse file. If it doesn't, cp --sparse will do the trick. Yes, it's a kernel limitation that folios have to be aligned within the file as well as in both virtual and physical address space. It's a huge complexity win to do that; I don't think we'd be able to tile the page cache effectively if we allowed folios to be placed at arbitrary offsets (I think it turns into a knapsack problem at that point). > As dTLB for read-only data is also an important optimization of > file-backed THP, it seems straightforward that we should drop the > VM_EXEC condition :) I'm not particularly enthusiastic about making CONFIG_READ_ONLY_THP_FOR_FS better. Large folios are the future. Indeed, I'd like to see CONFIG_READ_ONLY_THP_FOR_FS go away in the next year or two once btrfs and ext4 have support for large folios.