On Fri, Dec 06, 2024 at 02:11:26PM -0800, Joanne Koong wrote: > On Fri, Dec 6, 2024 at 12:36 PM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote: > > > > On Fri, Dec 06, 2024 at 09:41:25AM -0800, Joanne Koong wrote: > > > On Fri, Dec 6, 2024 at 1:50 AM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: > > [...] > > > > > > > > > > > > > > Writes are still effectively one page size. Benchmarks showed that trying to get > > > > > the largest folios possible from __filemap_get_folio() is an over-optimization > > > > > and ends up being significantly more expensive. Fine-tuning for the optimal > > > > > order size for the __filemap_get_folio() calls can be done in a future patchset. > > > > > > This is the behavior I noticed as well when running some benchmarks on > > > v1 [1]. I think it's because when we call into __filemap_get_folio(), > > > we hit the FGP_CREAT path and if the order we set is too high, the > > > internal call to filemap_alloc_folio() will repeatedly fail until it > > > finds an order it's able to allocate (eg the do { ... } while (order-- > > > > min_order) loop). > > > > > > > What is the mapping_min_folio_order(mapping) for fuse? One thing we can > > The mapping_min_folio_order used is 0. The folio order range gets set here [1] > > [1] https://lore.kernel.org/linux-fsdevel/20241125220537.3663725-13-joannelkoong@xxxxxxxxx/ > > > do is decide for which range of orders we want a cheap failure i.e. without > > __GFP_DIRECT_RECLAIM and then the range where we are fine with some > > effort and work. I see __GFP_NORETRY is being used for orders larger > > The gfp flags we pass into __filemap_get_folio() are the gfp flags of > the mapping, and that gets set in inode_init_always_gfp() to > GFP_HIGHUSER_MOVABLE, which does include __GFP_RECLAIM. > > If __GFP_RECLAIM is set and the filemap_alloc_folio() call can't find > enough space, does this automatically trigger a round of reclaim and > compaction as well? Yes, it will trigger reclaim/compaction rounds and based on order size (order <= PAGE_ALLOC_COSTLY_ORDER), it can be very aggressive. The __GFP_NORETRY flag limits to one iteration but a single iteration can be expensive depending on the system condition. For anon memory or specifically THPs allocation, we can tune though sysctls to be less aggressive but there is infrastructure like khugepaged which in background converts small pages into THPs. I can imagine that we might want something similar for filesystem as well.