On Mon, Dec 9, 2024 at 4:31 PM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > On Fri, Dec 6, 2024 at 2:25 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > On Fri, Dec 06, 2024 at 09:41:25AM -0800, Joanne Koong wrote: > > > On Fri, Dec 6, 2024 at 1:50 AM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: > > > > - folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN, > > > > + folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN | > > > > fgf_set_order(len), > > > > > > > > Otherwise the large folio is not enabled on the buffer write path. > > > > > > > > > > > > Besides, when applying the above diff, the large folio is indeed enabled > > > > but it suffers severe performance regression: > > > > > > > > fio 1 job buffer write: > > > > 2GB/s BW w/o large folio, and 200MB/s BW w/ large folio > > > > > > This is the behavior I noticed as well when running some benchmarks on > > > v1 [1]. I think it's because when we call into __filemap_get_folio(), > > > we hit the FGP_CREAT path and if the order we set is too high, the > > > internal call to filemap_alloc_folio() will repeatedly fail until it > > > finds an order it's able to allocate (eg the do { ... } while (order-- > > > > min_order) loop). > > > > But this is very different frrom what other filesystems have measured > > when allocating large folios during writes. eg: > > > > https://lore.kernel.org/linux-fsdevel/20240527163616.1135968-1-hch@xxxxxx/ > > Ok, this seems like something particular to FUSE then, if all the > other filesystems are seeing 2x throughput improvements for buffered > writes. If someone doesn't get to this before me, I'll look deeper > into this. > > > Thanks, > Joanne > > > > So we need to understand what's different about fuse. My suspicion is > > that it's disabling some other optimisation that is only done on > > order 0 folios, but that's just wild speculation. Needs someone to > > dig into it and look at profiles to see what's really going on. I got a chance to look more into this. This is happening because with large folios, a large number of pages is diritied per write, and when the kernel balances pages, it uses "HZ * pages_dirtied / task_ratelimit" to determine if an io timeout needs to be scheduled while the writeback is happening in the background - for large folios, where lots of pages are dirtied at once, this usually results in a io timeout, while small folios skirt this because they incrementally balance / write back pages. the io wait is what's incurring the extra cost for large folios. The entry point into this is in generic_perform_write() where fuse writeback caching calls into this through fuse_cache_write_iter() generic_file_write_iter() __generic_file_write_iter() generic_perform_write() In generic_perform_write(), balance_dirty_pages_ratelimited() is called per folio that's written. If we're doing a 1GB write where the block size is 1MB, for small folios we write 1 page, call balance_dirty_pages_ratelimited(), write the next page, call balance_dirty_pages_ratelimited(), etc. In balance_dirty_pages_ratelimited(), we only actually write back the pages if the number of dirty pages exceeds ratelimit (on my running system that's 16 pages), so effectively for small folios the number of accumulated dirty pages is the ratelimit. Whereas with large folios, we write 256 pages at a time, we call balance_dirty_pages_ratelimited(), this is larger than the ratelimit, we go to actually balance pages with balance_dirty_pages(), and then we have to schedule an io wait. Small folios avoids scheduling this in "if (pause < min_pause) { ... break; }" in balance_dirty_pages(). Without the io wait, I'm seeing a significant improvement in large folio performance, eg running fio with bs=1M size=1G: small folios: ~1300 MB/s large folios (w/ io waits) : ~300 MB/s large folios (w/out io waits): ~2400 MB/s Also fwiw, nfs also calls into generic_perform_write() for handling writeback writes (eg nfs_file_write()). Running nfs on my localhost, I see a perf drop for size=1G bs=1M writes (~430 MB/s with large folios and ~550 Mb/s with small folios), though it's nowhere as large as the perf drop for fuse. Matthew, what are your thoughts on the best way to address this? do you think we should increase the min_pause threshold? Thanks, Joanne