Hi all, I would like to propose a discussion topic about improving large folio writeback performance. As more filesystems adopt large folios, it becomes increasingly important that writeback is made to be as performant as possible. There are two areas I'd like to discuss: == Granularity of dirty pages writeback == Currently, the granularity of writeback is at the folio level. If one byte in a folio is dirty, the entire folio will be written back. This becomes unscalable for larger folios and significantly degrades performance, especially for workloads that employ random writes. One idea is to track dirty pages at a smaller granularity using a 64-bit bitmap stored inside the folio struct where each bit tracks a smaller chunk of pages (eg for 2 MB folios, each bit would track 32k pages), and only write back dirty chunks rather than the entire folio. == Balancing dirty pages == It was observed that the dirty page balancing logic used in balance_dirty_pages() fails to scale for large folios [1]. For example, fuse saw around a 125% drop in throughput for writes when using large folios vs small folios on 1MB block sizes, which was attributed to scheduled io waits in the dirty page balancing logic. In generic_perform_write(), dirty pages are balanced after every write to the page cache by the filesystem. With large folios, each write dirties a larger number of pages which can grossly exceed the ratelimit, whereas with small folios each write is one page and so pages are balanced more incrementally and adheres more closely to the ratelimit. In order to accomodate large folios, likely the logic in balancing dirty pages needs to be reworked. Thanks, Joanne [1] https://lore.kernel.org/linux-fsdevel/Z1N505RCcH1dXlLZ@xxxxxxxxxxxxxxxxxxxx/T/#m9e3dd273aa202f9f4e12eb9c96602b5fec2d383d