On Mon, Feb 10, 2025 at 3:12 PM Raphael S. Carvalho <raphaelsc@xxxxxxxxxxxx> wrote: > > While running scylladb test suite, which uses io_uring + buffered > writes + XFS, the system was spuriously returning ENOMEM, despite > there being plenty of available memory to be reclaimed from the page > cache. FWIW, I am running: 6.12.9-100.fc40.x86_64 > > Tracing showed io_uring_complete failing the request with ENOMEM: > # cat /sys/kernel/debug/tracing/trace | grep "result -12" -B 100 | > grep "0000000065b91cd1" > reactor-1-707139 [000] ..... 46737.358518: > io_uring_submit_req: ring 00000000e52339b8, req 0000000065b91cd1, > user_data 0x50f0001e4000, opcode WRITE, flags 0x200000, sq_thread 0 > reactor-1-707139 [000] ..... 46737.358526: io_uring_file_get: > ring 00000000e52339b8, req 0000000065b91cd1, user_data 0x50f0001e4000, > fd 45 > reactor-1-707139 [000] ...1. 46737.358560: io_uring_complete: > ring 00000000e52339b8, req 0000000065b91cd1, user_data 0x50f0001e4000, > result -12, cflags 0x0 extra1 0 extra2 0 > > That puzzled me. > > Using retsnoop, it pointed to iomap_get_folio: > > 00:34:16.180612 -> 00:34:16.180651 TID/PID 253786/253721 > (reactor-1/combined_tests): > > entry_SYSCALL_64_after_hwframe+0x76 > do_syscall_64+0x82 > __do_sys_io_uring_enter+0x265 > io_submit_sqes+0x209 > io_issue_sqe+0x5b > io_write+0xdd > xfs_file_buffered_write+0x84 > iomap_file_buffered_write+0x1a6 > 32us [-ENOMEM] iomap_write_begin+0x408 > iter=&{.inode=0xffff8c67aa031138,.len=4096,.flags=33,.iomap={.addr=0xffffffffffffffff,.length=4096,.type=1,.flags=3,.bdev=0x… > pos=0 len=4096 foliop=0xffffb32c296b7b80 > ! 4us [-ENOMEM] iomap_get_folio > iter=&{.inode=0xffff8c67aa031138,.len=4096,.flags=33,.iomap={.addr=0xffffffffffffffff,.length=4096,.type=1,.flags=3,.bdev=0x… > pos=0 len=4096 > > Another trace shows iomap_file_buffered_write with ki_flags 2359304, > which translate into (IOCB_WRITE & IOCB_ALLOC_CACHE & IOCB_NOWAIT) > And flags 33 in iomap_get_folio means IOMAP_NOWAIT, which makes sense > since XFS translates IOCB_NOWAIT into IOMAP_NOWAIT for performing the > buffered write through iomap subsystem: > > fs/iomap/buffered-io.c- if (iocb->ki_flags & IOCB_NOWAIT) > fs/iomap/buffered-io.c: iter.flags |= IOMAP_NOWAIT; > > > We know io_uring works by first attempting to write with IOCB_NOWAIT, > and if it fails with EAGAIN, it falls back to worker thread without > the NOWAIT semantics. > > iomap_get_folio(), once called with IOMAP_NOWAIT, will request the > allocation to follow GFP_NOWAIT behavior, so allocation can > potentially fail under pressure. > > Coming across 'iomap: Add async buffered write support', I see Darrick wrote: > > "FGP_NOWAIT can cause __filemap_get_folio to return a NULL folio, which > makes iomap_write_begin return -ENOMEM. If nothing has been written > yet, won't that cause the ENOMEM to escape to userspace? Why do we want > that instead of EAGAIN?" > > In the patch ''mm: return an ERR_PTR from __filemap_get_folio', I see > the following changes: > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -468,19 +468,12 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate); > struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos) > { > unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS; > - struct folio *folio; > > if (iter->flags & IOMAP_NOWAIT) > fgp |= FGP_NOWAIT; > > - folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT, > + return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT, > fgp, mapping_gfp_mask(iter->inode->i_mapping)); > - if (folio) > - return folio; > - > - if (iter->flags & IOMAP_NOWAIT) > - return ERR_PTR(-EAGAIN); > - return ERR_PTR(-ENOMEM); > } > > This leads to me believe we have a regression in this area, after that > patch, since iomap_get_folio() is no longer returning EAGAIN with > IOMAP_NOWAIT, if __filemap_get_folio() failed to get a folio. Now it > returns ENOMEM unconditionally. > > Since we pushed the error picking decision to __filemap_get_folio, I > think it makes sense for us to patch it such that it returns EAGAIN if > allocation failed (under pressure) because IOMAP_NOWAIT was requested > by its caller and allocation is not allowed to block waiting for > reclaimer to do its thing. > > A possible way to fix it is this one-liner, but I am not well versed > in this area, so someone may end up suggesting a better fix: > diff --git a/mm/filemap.c b/mm/filemap.c > index 804d7365680c..9e698a619545 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1964,7 +1964,7 @@ struct folio *__filemap_get_folio(struct > address_space *mapping, pgoff_t index, > do { > gfp_t alloc_gfp = gfp; > > - err = -ENOMEM; > + err = (fgp_flags & FGP_NOWAIT) ? -ENOMEM : -EAGAIN; Sorry, I actually meant this: + err = (fgp_flags & FGP_NOWAIT) ? -EAGAIN : -ENOMEM;