On Thu, 12 Sept 2024 at 15:30, Jens Axboe <axboe@xxxxxxxxx> wrote: > > It might be an iomap thing... Other file systems do use it, but to > various degrees, and XFS is definitely the primary user. I have to say, I looked at the iomap code, and it's disgusting. The "I don't support large folios" check doesn't even say "don't do large folios". That's what the regular __filemap_get_folio() code does for reads, and that's the sane thing to do. But that's not what the iomap code does. AT ALL. No, the iomap code limits "len" of a write in iomap_write_begin() to be within one page, and then magically depends on (a) __iomap_get_folio() using that length to decide how big a folio to allocate (b) iomap_write_begin() doing its own "what is the real length:" based on that. (c) the *caller* then having to do the same thing, to see what length iomap_write_begin() _actually_ used (because it wasn't the 'bytes' that was passed in). Honestly, the iomap code is just odd. Having these kinds of subtle interdependencies doesn't make sense. The two code sequences don't even use the same logic, with iomap_write_begin() doing if (!mapping_large_folio_support(iter->inode->i_mapping)) len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos)); [... alloc folio ...] if (pos + len > folio_pos(folio) + folio_size(folio)) len = folio_pos(folio) + folio_size(folio) - pos; and the caller (iomap_write_iter) doing offset = offset_in_folio(folio, pos); if (bytes > folio_size(folio) - offset) bytes = folio_size(folio) - offset; and yes, the two completely different ways of picking 'len' (called 'bytes' in the second case) had *better* match. I do think they match, but code shouldn't be organized this way. It's not just the above kind of odd thing either, it's things like iomap_get_folio() using that fgf_set_order(len), which does unsigned int shift = ilog2(size); if (shift <= PAGE_SHIFT) return 0; so now it has done that potentially expensive ilog2() for the common case of "len < PAGE_SIZE", but dammit, it should never have even bothered looking at 'len' if the inode didn't support large folios in the first place, and we shouldn't have had that special odd 'len = min_t(..)" magic rule to force an order-0 thing, because Yeah, yeah, modern CPU's all have reasonably cheap bit finding instructions. But the code simply shouldn't have this kind of thing in the first place. The folio should have been allocated *before* iomap_write_begin(), the "no large folios" should just have fixed the order to zero there, and the actual real-life length of the write should have been limited in *one* piece of code after the allocation point instead of then having two different pieces of code depending on matching (subtle and undocumented) logic. Put another way: I most certainly don't see the bug here - it may look _odd_, but not wrong - but at the same time, looking at that code doesn't make me get the warm and fuzzies about the iomap large-folio situation either. Linus