Re: Small O_SYNC writes are no longer NFS_DATA_SYNC

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 2011-03-18 at 14:52 +1100, NeilBrown wrote:
> On Thu, 17 Mar 2011 22:25:08 -0400 Trond Myklebust
> <Trond.Myklebust@xxxxxxxxxx> wrote:
> 
> > On Fri, 2011-03-18 at 13:12 +1100, NeilBrown wrote:
> > > On Thu, 17 Mar 2011 21:49:26 -0400 Trond Myklebust
> > > <Trond.Myklebust@xxxxxxxxxx> wrote:
> > > 
> > > > However we could adopt the Solaris convention of always starting
> > > > writebacks with a FILE_SYNC, and then falling back to UNSTABLE for the
> > > > second rpc call and all subsequent calls...
> > > > 
> > > 
> > > That approach certainly has merit.
> > > 
> > > However, as we know from the wbc info whether the write is small and sync -
> > > which is the only case where I think a STABLE write is needed - I cannot see
> > > why you don't want to just use that information to guide the choice of
> > > 'stable' or not ???
> > 
> > By far the most common case we would want to optimise for is the sync at
> > close() or fsync() when you have written a small file (<= wsize). If we
> > can't optimise for that case, then the optimisation isn't worth doing at
> > all.
> 
> Fair point.  I hadn't thought of that.
> 
> > 
> > The point is that in that particular case, the wbc doesn't help you at
> > all since the limits are set at 0 and LLONG_MAX (see nfs_wb_all(),
> > write_inode_now(),...)
> > 
> 
> I would be trivial to use min(wbc->range_end, i_size_read(inode)) as the
> upper bound when assessing the size of the range to compare with 'wsize'.
> 
> However that wouldn't address the case of a small append to a large file
> which would also be good to optimise.
> 
> If you can detect the 'first' RPC reliably at the same time that you still
> have access to the wbc information, then:
> 
>  if this is the first request in a writeback, and the difference beween
>   the address of this page, and min(wbc->range_end, i_size_read(inode))
>   is less than wsize, then make it a STABLE write
> 
> might be a heuristic that catches most interesting cases.
> It might be a bit complex though.
> 
> I think we should in general err on the size of not using a STABLE write
> when it might be useful rather than using a STABLE write when it is not
> necessary as, while there a costs each way, I think the cost of incorrectly
> using STABLE would be higher.

How about something like the following (as of yet untested) patch?

Cheers
  Trond

8<-------------------------------------------------------------------------------
>From f959f78a80185f9a8aab1aadc90874c59a118a3c Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx>
Date: Mon, 21 Mar 2011 16:57:32 -0400
Subject: [PATCH] NFS: Use stable writes when not doing a bulk flush

If we're only doing a single write, and there are no other unstable
writes being queued up, we might want to just flip to using a stable
write RPC call.

Signed-off-by: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx>
---
 fs/nfs/pagelist.c        |    3 +++
 fs/nfs/write.c           |   14 ++++++++++++--
 include/linux/nfs_fs.h   |    1 +
 include/linux/nfs_page.h |    1 +
 4 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 23e7944..fd85618 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -223,6 +223,7 @@ void nfs_pageio_init(struct nfs_pageio_descriptor *desc,
 	desc->pg_count = 0;
 	desc->pg_bsize = bsize;
 	desc->pg_base = 0;
+	desc->pg_moreio = 0;
 	desc->pg_inode = inode;
 	desc->pg_doio = doio;
 	desc->pg_ioflags = io_flags;
@@ -335,9 +336,11 @@ int nfs_pageio_add_request(struct nfs_pageio_descriptor *desc,
 			   struct nfs_page *req)
 {
 	while (!nfs_pageio_do_add_request(desc, req)) {
+		desc->pg_moreio = 1;
 		nfs_pageio_doio(desc);
 		if (desc->pg_error < 0)
 			return 0;
+		desc->pg_moreio = 0;
 	}
 	return 1;
 }
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 47a3ad6..9af52f1 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -179,8 +179,8 @@ static int wb_priority(struct writeback_control *wbc)
 	if (wbc->for_reclaim)
 		return FLUSH_HIGHPRI | FLUSH_STABLE;
 	if (wbc->for_kupdate || wbc->for_background)
-		return FLUSH_LOWPRI;
-	return 0;
+		return FLUSH_LOWPRI | FLUSH_STABLE | FLUSH_COND_STABLE;
+	return FLUSH_STABLE | FLUSH_COND_STABLE;
 }
 
 /*
@@ -912,6 +912,12 @@ static int nfs_flush_multi(struct nfs_pageio_descriptor *desc)
 
 	nfs_list_remove_request(req);
 
+	if ((desc->pg_ioflags & FLUSH_COND_STABLE) &&
+	    (desc->pg_moreio || NFS_I(desc->pg_inode)->ncommit ||
+	     desc->pg_count > wsize))
+		desc->pg_ioflags &= ~(FLUSH_COND_STABLE|FLUSH_STABLE);
+
+
 	nbytes = desc->pg_count;
 	do {
 		size_t len = min(nbytes, wsize);
@@ -1002,6 +1008,10 @@ static int nfs_flush_one(struct nfs_pageio_descriptor *desc)
 	if ((!lseg) && list_is_singular(&data->pages))
 		lseg = pnfs_update_layout(desc->pg_inode, req->wb_context, IOMODE_RW);
 
+	if ((desc->pg_ioflags & FLUSH_COND_STABLE) &&
+	    (desc->pg_moreio || NFS_I(desc->pg_inode)->ncommit))
+		desc->pg_ioflags &= ~(FLUSH_COND_STABLE|FLUSH_STABLE);
+
 	/* Set up the argument struct */
 	ret = nfs_write_rpcsetup(req, data, &nfs_write_full_ops, desc->pg_count, 0, lseg, desc->pg_ioflags);
 out:
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index f88522b..13cb07e 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -33,6 +33,7 @@
 #define FLUSH_STABLE		4	/* commit to stable storage */
 #define FLUSH_LOWPRI		8	/* low priority background flush */
 #define FLUSH_HIGHPRI		16	/* high priority memory reclaim flush */
+#define FLUSH_COND_STABLE	32	/* conditional stable write */
 
 #ifdef __KERNEL__
 
diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index 90907ad..92d54c8 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -57,6 +57,7 @@ struct nfs_pageio_descriptor {
 	size_t			pg_count;
 	size_t			pg_bsize;
 	unsigned int		pg_base;
+	char			pg_moreio;
 
 	struct inode		*pg_inode;
 	int			(*pg_doio)(struct nfs_pageio_descriptor *);
-- 
1.7.4


-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@xxxxxxxxxx
www.netapp.com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux