On Fri, 2011-03-18 at 14:52 +1100, NeilBrown wrote: > On Thu, 17 Mar 2011 22:25:08 -0400 Trond Myklebust > <Trond.Myklebust@xxxxxxxxxx> wrote: > > > On Fri, 2011-03-18 at 13:12 +1100, NeilBrown wrote: > > > On Thu, 17 Mar 2011 21:49:26 -0400 Trond Myklebust > > > <Trond.Myklebust@xxxxxxxxxx> wrote: > > > > > > > However we could adopt the Solaris convention of always starting > > > > writebacks with a FILE_SYNC, and then falling back to UNSTABLE for the > > > > second rpc call and all subsequent calls... > > > > > > > > > > That approach certainly has merit. > > > > > > However, as we know from the wbc info whether the write is small and sync - > > > which is the only case where I think a STABLE write is needed - I cannot see > > > why you don't want to just use that information to guide the choice of > > > 'stable' or not ??? > > > > By far the most common case we would want to optimise for is the sync at > > close() or fsync() when you have written a small file (<= wsize). If we > > can't optimise for that case, then the optimisation isn't worth doing at > > all. > > Fair point. I hadn't thought of that. > > > > > The point is that in that particular case, the wbc doesn't help you at > > all since the limits are set at 0 and LLONG_MAX (see nfs_wb_all(), > > write_inode_now(),...) > > > > I would be trivial to use min(wbc->range_end, i_size_read(inode)) as the > upper bound when assessing the size of the range to compare with 'wsize'. > > However that wouldn't address the case of a small append to a large file > which would also be good to optimise. > > If you can detect the 'first' RPC reliably at the same time that you still > have access to the wbc information, then: > > if this is the first request in a writeback, and the difference beween > the address of this page, and min(wbc->range_end, i_size_read(inode)) > is less than wsize, then make it a STABLE write > > might be a heuristic that catches most interesting cases. > It might be a bit complex though. > > I think we should in general err on the size of not using a STABLE write > when it might be useful rather than using a STABLE write when it is not > necessary as, while there a costs each way, I think the cost of incorrectly > using STABLE would be higher. How about something like the following (as of yet untested) patch? Cheers Trond 8<------------------------------------------------------------------------------- >From f959f78a80185f9a8aab1aadc90874c59a118a3c Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx> Date: Mon, 21 Mar 2011 16:57:32 -0400 Subject: [PATCH] NFS: Use stable writes when not doing a bulk flush If we're only doing a single write, and there are no other unstable writes being queued up, we might want to just flip to using a stable write RPC call. Signed-off-by: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx> --- fs/nfs/pagelist.c | 3 +++ fs/nfs/write.c | 14 ++++++++++++-- include/linux/nfs_fs.h | 1 + include/linux/nfs_page.h | 1 + 4 files changed, 17 insertions(+), 2 deletions(-) diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c index 23e7944..fd85618 100644 --- a/fs/nfs/pagelist.c +++ b/fs/nfs/pagelist.c @@ -223,6 +223,7 @@ void nfs_pageio_init(struct nfs_pageio_descriptor *desc, desc->pg_count = 0; desc->pg_bsize = bsize; desc->pg_base = 0; + desc->pg_moreio = 0; desc->pg_inode = inode; desc->pg_doio = doio; desc->pg_ioflags = io_flags; @@ -335,9 +336,11 @@ int nfs_pageio_add_request(struct nfs_pageio_descriptor *desc, struct nfs_page *req) { while (!nfs_pageio_do_add_request(desc, req)) { + desc->pg_moreio = 1; nfs_pageio_doio(desc); if (desc->pg_error < 0) return 0; + desc->pg_moreio = 0; } return 1; } diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 47a3ad6..9af52f1 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -179,8 +179,8 @@ static int wb_priority(struct writeback_control *wbc) if (wbc->for_reclaim) return FLUSH_HIGHPRI | FLUSH_STABLE; if (wbc->for_kupdate || wbc->for_background) - return FLUSH_LOWPRI; - return 0; + return FLUSH_LOWPRI | FLUSH_STABLE | FLUSH_COND_STABLE; + return FLUSH_STABLE | FLUSH_COND_STABLE; } /* @@ -912,6 +912,12 @@ static int nfs_flush_multi(struct nfs_pageio_descriptor *desc) nfs_list_remove_request(req); + if ((desc->pg_ioflags & FLUSH_COND_STABLE) && + (desc->pg_moreio || NFS_I(desc->pg_inode)->ncommit || + desc->pg_count > wsize)) + desc->pg_ioflags &= ~(FLUSH_COND_STABLE|FLUSH_STABLE); + + nbytes = desc->pg_count; do { size_t len = min(nbytes, wsize); @@ -1002,6 +1008,10 @@ static int nfs_flush_one(struct nfs_pageio_descriptor *desc) if ((!lseg) && list_is_singular(&data->pages)) lseg = pnfs_update_layout(desc->pg_inode, req->wb_context, IOMODE_RW); + if ((desc->pg_ioflags & FLUSH_COND_STABLE) && + (desc->pg_moreio || NFS_I(desc->pg_inode)->ncommit)) + desc->pg_ioflags &= ~(FLUSH_COND_STABLE|FLUSH_STABLE); + /* Set up the argument struct */ ret = nfs_write_rpcsetup(req, data, &nfs_write_full_ops, desc->pg_count, 0, lseg, desc->pg_ioflags); out: diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h index f88522b..13cb07e 100644 --- a/include/linux/nfs_fs.h +++ b/include/linux/nfs_fs.h @@ -33,6 +33,7 @@ #define FLUSH_STABLE 4 /* commit to stable storage */ #define FLUSH_LOWPRI 8 /* low priority background flush */ #define FLUSH_HIGHPRI 16 /* high priority memory reclaim flush */ +#define FLUSH_COND_STABLE 32 /* conditional stable write */ #ifdef __KERNEL__ diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h index 90907ad..92d54c8 100644 --- a/include/linux/nfs_page.h +++ b/include/linux/nfs_page.h @@ -57,6 +57,7 @@ struct nfs_pageio_descriptor { size_t pg_count; size_t pg_bsize; unsigned int pg_base; + char pg_moreio; struct inode *pg_inode; int (*pg_doio)(struct nfs_pageio_descriptor *); -- 1.7.4 -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@xxxxxxxxxx www.netapp.com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html