Hello! I've been debugging NFS performance regression with recent kernels. It seems to be at least partially related to the following behavior of NFS (which is there for a long time AFAICT). Suppose the following workload: fio --direct=0 --ioengine=sync --thread --directory=/test --invalidate=1 \ --group_reporting=1 --runtime=100 --fallocate=posix --ramp_time=10 \ --name=RandomWrites-async --new_group --rw=randwrite --size=32000m \ --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1 \ --filename_format='FioWorkloads.$jobnum' So we do 4k buffered random writes from 4 threads into 4 different files. Now the interesting behavior comes on the final fsync(2). What I observe is that the NFS server getting a stream of 4-8k writes which have 'stable' flag set. What the server does for each such write is that performs the write and calls fsync(2). Since by the time fio calls fsync(2) on the NFS client there is like 6-8 GB worth of dirty pages to write and the server effectively ends up writing each individual 4k page as O_SYNC write, the throughput is not great... The reason why the client sets 'stable' flag for each page write seems to be because nfs_writepages() issues writes with FLUSH_COND_STABLE for WB_SYNC_ALL writeback and nfs_pgio_rpcsetup() has this logic: switch (how & (FLUSH_STABLE | FLUSH_COND_STABLE)) { case 0: break; case FLUSH_COND_STABLE: if (nfs_reqs_to_commit(cinfo)) break; fallthrough; default: hdr->args.stable = NFS_FILE_SYNC; } but since this is final fsync(2), there are no more requests to commit so we set NFS_FILE_SYNC flag. Now I'd think the client is stupid in submitting so many NFS_FILE_SYNC writes instead of submitting all as async and then issuing commit (i.e., the switch above in nfs_pgio_rpcsetup() could gain something like: if (count > <small_magic_number>) break; But I'm not 100% sure this is a correct thing to do since I'm not 100% sure about the FLUSH_COND_STABLE requirements. On the other hand it could be also argued that the NFS server could be more clever and batch the fsync(2)s for many sync writes to the same file. But there the heuristic is less clear. So what do people think? Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR