On Fri 24-05-24 15:17:55, Jan Kara wrote: > On Thu 23-05-24 18:00:01, Trond Myklebust wrote: > > On Thu, 2024-05-23 at 18:54 +0200, Jan Kara wrote: > > > Hello! > > > > > > I've been debugging NFS performance regression with recent kernels. > > > It > > > seems to be at least partially related to the following behavior of > > > NFS > > > (which is there for a long time AFAICT). Suppose the following > > > workload: > > > > > > fio --direct=0 --ioengine=sync --thread --directory=/test -- > > > invalidate=1 \ > > > --group_reporting=1 --runtime=100 --fallocate=posix --ramp_time=10 > > > \ > > > --name=RandomWrites-async --new_group --rw=randwrite --size=32000m > > > \ > > > --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1 \ > > > --filename_format='FioWorkloads.$jobnum' > > > > > > So we do 4k buffered random writes from 4 threads into 4 different > > > files. > > > Now the interesting behavior comes on the final fsync(2). What I > > > observe is > > > that the NFS server getting a stream of 4-8k writes which have > > > 'stable' > > > flag set. What the server does for each such write is that performs > > > the > > > write and calls fsync(2). Since by the time fio calls fsync(2) on the > > > NFS > > > client there is like 6-8 GB worth of dirty pages to write and the > > > server > > > effectively ends up writing each individual 4k page as O_SYNC write, > > > the > > > throughput is not great... > > > > > > The reason why the client sets 'stable' flag for each page write > > > seems to > > > be because nfs_writepages() issues writes with FLUSH_COND_STABLE for > > > WB_SYNC_ALL writeback and nfs_pgio_rpcsetup() has this logic: > > > > > > switch (how & (FLUSH_STABLE | FLUSH_COND_STABLE)) { > > > case 0: > > > break; > > > case FLUSH_COND_STABLE: > > > if (nfs_reqs_to_commit(cinfo)) > > > break; > > > fallthrough; > > > default: > > > hdr->args.stable = NFS_FILE_SYNC; > > > } > > > > > > but since this is final fsync(2), there are no more requests to > > > commit so > > > we set NFS_FILE_SYNC flag. > > > > > > Now I'd think the client is stupid in submitting so many > > > NFS_FILE_SYNC > > > writes instead of submitting all as async and then issuing commit > > > (i.e., > > > the switch above in nfs_pgio_rpcsetup() could gain something like: > > > > > > if (count > <small_magic_number>) > > > break; > > > > > > But I'm not 100% sure this is a correct thing to do since I'm not > > > 100% sure > > > about the FLUSH_COND_STABLE requirements. On the other hand it could > > > be > > > also argued that the NFS server could be more clever and batch the > > > fsync(2)s for many sync writes to the same file. But there the > > > heuristic is > > > less clear. > > > > > > So what do people think? > > > > We can probably remove that case FLUSH_COND_STABLE in > > nfs_pgio_rpcsetup() altogether, since we have the following just before > > the call to nfs_pgio_rpcsetup() > > > > if ((desc->pg_ioflags & FLUSH_COND_STABLE) && > > (desc->pg_moreio || nfs_reqs_to_commit(&cinfo))) > > desc->pg_ioflags &= ~FLUSH_COND_STABLE; > > Yeah, I was somewhat wondering about that as well. But my point was that > the client should apparently be dropping FLUSH_COND_STABLE in more cases > because currently fsync(2) submits several GB worth of dirty pages, each > page with NFS_FILE_SYNC set. Which is suboptimal... I'll try to understand > better why non of the above conditions is met. OK, I think I've tracked the problem down. Patch submitted [1]. Honza [1] https://lore.kernel.org/r/20240524161419.18448-1-jack@xxxxxxx -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR