Re: Bad NFS performance for fsync(2)

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Thu, 23 May 2024 18:00:01 +0000

On Thu, 2024-05-23 at 18:54 +0200, Jan Kara wrote:
> Hello!
> 
> I've been debugging NFS performance regression with recent kernels.
> It
> seems to be at least partially related to the following behavior of
> NFS
> (which is there for a long time AFAICT). Suppose the following
> workload:
> 
> fio --direct=0 --ioengine=sync --thread --directory=/test --
> invalidate=1 \
>   --group_reporting=1 --runtime=100 --fallocate=posix --ramp_time=10
> \
>   --name=RandomWrites-async --new_group --rw=randwrite --size=32000m
> \
>   --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1 \
>   --filename_format='FioWorkloads.$jobnum'
> 
> So we do 4k buffered random writes from 4 threads into 4 different
> files.
> Now the interesting behavior comes on the final fsync(2). What I
> observe is
> that the NFS server getting a stream of 4-8k writes which have
> 'stable'
> flag set. What the server does for each such write is that performs
> the
> write and calls fsync(2). Since by the time fio calls fsync(2) on the
> NFS
> client there is like 6-8 GB worth of dirty pages to write and the
> server
> effectively ends up writing each individual 4k page as O_SYNC write,
> the
> throughput is not great...
> 
> The reason why the client sets 'stable' flag for each page write
> seems to
> be because nfs_writepages() issues writes with FLUSH_COND_STABLE for
> WB_SYNC_ALL writeback and nfs_pgio_rpcsetup() has this logic:
> 
>         switch (how & (FLUSH_STABLE | FLUSH_COND_STABLE)) {
>         case 0:
>                 break;
>         case FLUSH_COND_STABLE:
>                 if (nfs_reqs_to_commit(cinfo))
>                         break;
>                 fallthrough;
>         default:
>                 hdr->args.stable = NFS_FILE_SYNC;
>         }
> 
> but since this is final fsync(2), there are no more requests to
> commit so
> we set NFS_FILE_SYNC flag.
> 
> Now I'd think the client is stupid in submitting so many
> NFS_FILE_SYNC
> writes instead of submitting all as async and then issuing commit
> (i.e.,
> the switch above in nfs_pgio_rpcsetup() could gain something like:
> 
> 		if (count > <small_magic_number>)
> 			break;
> 
> But I'm not 100% sure this is a correct thing to do since I'm not
> 100% sure
> about the FLUSH_COND_STABLE requirements. On the other hand it could
> be
> also argued that the NFS server could be more clever and batch the
> fsync(2)s for many sync writes to the same file. But there the
> heuristic is
> less clear.
> 
> So what do people think?

We can probably remove that case FLUSH_COND_STABLE in
nfs_pgio_rpcsetup() altogether, since we have the following just before
the call to nfs_pgio_rpcsetup()

        if ((desc->pg_ioflags & FLUSH_COND_STABLE) &&
            (desc->pg_moreio || nfs_reqs_to_commit(&cinfo)))
                desc->pg_ioflags &= ~FLUSH_COND_STABLE;

The above is telling you that if we're flushing because we cannot
coalesce any more in __nfs_pageio_add_request(), then we do an unstable
write. Ditto if there are already unstable requests waiting for a
COMMIT.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx