Bad NFS performance for fsync(2)

Jan Kara <jack@xxxxxxx> · Thu, 23 May 2024 18:54:36 +0200

Hello!

I've been debugging NFS performance regression with recent kernels. It
seems to be at least partially related to the following behavior of NFS
(which is there for a long time AFAICT). Suppose the following workload:

fio --direct=0 --ioengine=sync --thread --directory=/test --invalidate=1 \
  --group_reporting=1 --runtime=100 --fallocate=posix --ramp_time=10 \
  --name=RandomWrites-async --new_group --rw=randwrite --size=32000m \
  --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1 \
  --filename_format='FioWorkloads.$jobnum'

So we do 4k buffered random writes from 4 threads into 4 different files.
Now the interesting behavior comes on the final fsync(2). What I observe is
that the NFS server getting a stream of 4-8k writes which have 'stable'
flag set. What the server does for each such write is that performs the
write and calls fsync(2). Since by the time fio calls fsync(2) on the NFS
client there is like 6-8 GB worth of dirty pages to write and the server
effectively ends up writing each individual 4k page as O_SYNC write, the
throughput is not great...

The reason why the client sets 'stable' flag for each page write seems to
be because nfs_writepages() issues writes with FLUSH_COND_STABLE for
WB_SYNC_ALL writeback and nfs_pgio_rpcsetup() has this logic:

        switch (how & (FLUSH_STABLE | FLUSH_COND_STABLE)) {
        case 0:
                break;
        case FLUSH_COND_STABLE:
                if (nfs_reqs_to_commit(cinfo))
                        break;
                fallthrough;
        default:
                hdr->args.stable = NFS_FILE_SYNC;
        }

but since this is final fsync(2), there are no more requests to commit so
we set NFS_FILE_SYNC flag.

Now I'd think the client is stupid in submitting so many NFS_FILE_SYNC
writes instead of submitting all as async and then issuing commit (i.e.,
the switch above in nfs_pgio_rpcsetup() could gain something like:

		if (count > <small_magic_number>)
			break;

But I'm not 100% sure this is a correct thing to do since I'm not 100% sure
about the FLUSH_COND_STABLE requirements. On the other hand it could be
also argued that the NFS server could be more clever and batch the
fsync(2)s for many sync writes to the same file. But there the heuristic is
less clear.

So what do people think?

								Honza

-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR