On 8/16/23 8:33 AM, Jens Axboe wrote: >> However I didn't see any io_uring related callback inside btrfs code, >> any advice on the io_uring part would be appreciated. > > io_uring doesn't do anything special here, it uses the normal page cache > read/write parts for buffered IO. But you may get extra parallellism > with io_uring here. For example, with the buffered write that this most > likely is, libaio would be exactly the same as a pwrite(2) on the file. > If this would've blocked, io_uring would offload this to a helper > thread. Depending on the workload, you could have multiple of those in > progress at the same time. I poked a bit at fsstress, and it's a bit odd imho. For example, any aio read/write seems to hardcode O_DIRECT. The io_uring side will be buffered. Not sure why there are those differences and why buffered/dio isn't a variable. But this does mean that these are certainly buffered writes with io_uring. Are any of the writes overlapping? You could have a situation where writeA and writeB overlap, and writeA will get punted to io-wq for execution and writeB will complete inline. In other words, writeA is issued, writeB is issued. writeA goes to io-wq, writeB now completes inline, and now writeA is done and completed. It may be exposing issues in btrfs. You can try the below patch, which should serialize all the writes to a given file. If this makes a difference for you, then I'd strongly suspect that the issue is deeper than the delivery mechanism of the write. diff --git a/ltp/fsstress.c b/ltp/fsstress.c index 6641a525fe5d..034cbba27c6e 100644 --- a/ltp/fsstress.c +++ b/ltp/fsstress.c @@ -2317,6 +2317,7 @@ do_uring_rw(opnum_t opno, long r, int flags) off %= maxfsize; memset(buf, nameseq & 0xff, len); io_uring_prep_writev(sqe, fd, &iovec, 1, off); + sqe->flags |= IOSQE_ASYNC; } else { off = (off64_t)(lr % stb.st_size); io_uring_prep_readv(sqe, fd, &iovec, 1, off); -- Jens Axboe