Re: Bad NFS performance for fsync(2)

Jan Kara <jack@xxxxxxx> · Fri, 24 May 2024 18:16:08 +0200

On Fri 24-05-24 15:17:55, Jan Kara wrote:
> On Thu 23-05-24 18:00:01, Trond Myklebust wrote:
> > On Thu, 2024-05-23 at 18:54 +0200, Jan Kara wrote:
> > > Hello!
> > > 
> > > I've been debugging NFS performance regression with recent kernels.
> > > It
> > > seems to be at least partially related to the following behavior of
> > > NFS
> > > (which is there for a long time AFAICT). Suppose the following
> > > workload:
> > > 
> > > fio --direct=0 --ioengine=sync --thread --directory=/test --
> > > invalidate=1 \
> > >   --group_reporting=1 --runtime=100 --fallocate=posix --ramp_time=10
> > > \
> > >   --name=RandomWrites-async --new_group --rw=randwrite --size=32000m
> > > \
> > >   --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1 \
> > >   --filename_format='FioWorkloads.$jobnum'
> > > 
> > > So we do 4k buffered random writes from 4 threads into 4 different
> > > files.
> > > Now the interesting behavior comes on the final fsync(2). What I
> > > observe is
> > > that the NFS server getting a stream of 4-8k writes which have
> > > 'stable'
> > > flag set. What the server does for each such write is that performs
> > > the
> > > write and calls fsync(2). Since by the time fio calls fsync(2) on the
> > > NFS
> > > client there is like 6-8 GB worth of dirty pages to write and the
> > > server
> > > effectively ends up writing each individual 4k page as O_SYNC write,
> > > the
> > > throughput is not great...
> > > 
> > > The reason why the client sets 'stable' flag for each page write
> > > seems to
> > > be because nfs_writepages() issues writes with FLUSH_COND_STABLE for
> > > WB_SYNC_ALL writeback and nfs_pgio_rpcsetup() has this logic:
> > > 
> > >         switch (how & (FLUSH_STABLE | FLUSH_COND_STABLE)) {
> > >         case 0:
> > >                 break;
> > >         case FLUSH_COND_STABLE:
> > >                 if (nfs_reqs_to_commit(cinfo))
> > >                         break;
> > >                 fallthrough;
> > >         default:
> > >                 hdr->args.stable = NFS_FILE_SYNC;
> > >         }
> > > 
> > > but since this is final fsync(2), there are no more requests to
> > > commit so
> > > we set NFS_FILE_SYNC flag.
> > > 
> > > Now I'd think the client is stupid in submitting so many
> > > NFS_FILE_SYNC
> > > writes instead of submitting all as async and then issuing commit
> > > (i.e.,
> > > the switch above in nfs_pgio_rpcsetup() could gain something like:
> > > 
> > > 		if (count > <small_magic_number>)
> > > 			break;
> > > 
> > > But I'm not 100% sure this is a correct thing to do since I'm not
> > > 100% sure
> > > about the FLUSH_COND_STABLE requirements. On the other hand it could
> > > be
> > > also argued that the NFS server could be more clever and batch the
> > > fsync(2)s for many sync writes to the same file. But there the
> > > heuristic is
> > > less clear.
> > > 
> > > So what do people think?
> > 
> > We can probably remove that case FLUSH_COND_STABLE in
> > nfs_pgio_rpcsetup() altogether, since we have the following just before
> > the call to nfs_pgio_rpcsetup()
> > 
> >         if ((desc->pg_ioflags & FLUSH_COND_STABLE) &&
> >             (desc->pg_moreio || nfs_reqs_to_commit(&cinfo)))
> >                 desc->pg_ioflags &= ~FLUSH_COND_STABLE;
> 
> Yeah, I was somewhat wondering about that as well. But my point was that
> the client should apparently be dropping FLUSH_COND_STABLE in more cases
> because currently fsync(2) submits several GB worth of dirty pages, each
> page with NFS_FILE_SYNC set. Which is suboptimal... I'll try to understand
> better why non of the above conditions is met.

OK, I think I've tracked the problem down. Patch submitted [1].

								Honza

[1] https://lore.kernel.org/r/20240524161419.18448-1-jack@xxxxxxx

-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR