On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote: > On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote: >> >> >> On 2024/10/17 19:26, Kirill A. Shutemov wrote: >> > On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote: >> > > + Kirill >> > > >> > > On 2024/10/16 22:06, Matthew Wilcox wrote: >> > > > On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote: >> > > > > Considering that tmpfs already has the 'huge=' option to control the THP >> > > > > allocation, it is necessary to maintain compatibility with the 'huge=' >> > > > > option, as well as considering the 'deny' and 'force' option controlled >> > > > > by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. >> > > > >> > > > No, it's not. No other filesystem honours these settings. tmpfs would >> > > > not have had these settings if it were written today. It should simply >> > > > ignore them, the way that NFS ignores the "intr" mount option now that >> > > > we have a better solution to the original problem. >> > > > >> > > > To reiterate my position: >> > > > >> > > > - When using tmpfs as a filesystem, it should behave like other >> > > > filesystems. >> > > > - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should >> > > > behave like anonymous memory. >> > > >> > > I do agree with your point to some extent, but the ‘huge=’ option has >> > > existed for nearly 8 years, and the huge orders based on write size may not >> > > achieve the performance of PMD-sized THP in some scenarios, such as when the >> > > write length is consistently 4K. So, I am still concerned that ignoring the >> > > 'huge' option could lead to compatibility issues. >> > >> > Yeah, I don't think we are there yet to ignore the mount option. >> >> OK. >> >> > Maybe we need to get a new generic interface to request the semantics >> > tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_* >> > handles to make kernel allocate PMD-size folio on any allocation or on >> > allocations within i_size. I think this behaviour is useful beyond tmpfs. >> > >> > Then huge= implementation for tmpfs can be re-defined to set these >> > per-inode FADV_ flags by default. This way we can keep tmpfs compatible >> > with current deployments and less special comparing to rest of >> > filesystems on kernel side. >> >> I did a quick search, and I didn't find any other fs that require PMD-sized >> huge pages, so I am not sure if FADV_* is useful for filesystems other than >> tmpfs. Please correct me if I missed something. > > What do you mean by "require"? THPs are always opportunistic. > > IIUC, we don't have a way to hint kernel to use huge pages for a file on > read from backing storage. Readahead is not always the right way. > >> > If huge= is not set, tmpfs would behave the same way as the rest of >> > filesystems. >> >> So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate large >> folios based on the write size? If yes, that means it will change the >> default huge behavior for tmpfs. Because previously having 'huge=' is not >> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar to what I >> mentioned: >> "Another possible choice is to make the huge pages allocation based on write >> size as the *default* behavior for tmpfs, ..." > > I am more worried about breaking existing users of huge pages. So changing > behaviour of users who don't specify huge is okay to me. I think moving tmpfs to allocate large folios opportunistically by default (as it was proposed initially) doesn't necessary conflict with the default behaviour (huge=never). We just need to clarify that in the documentation. However, and IIRC, one of the requests from Hugh was to have a way to disable large folios which is something other FS do not have control of as of today. Ryan sent a proposal to actually control that globally but I think it didn't move forward. So, what are we missing to go back to implement large folios in tmpfs in the default case, as any other fs leveraging large folios?