Re: [PATCH] improve the performance of large sequential write NFS workloads

Wu Fengguang <fengguang.wu@xxxxxxxxx> · Thu, 24 Dec 2009 10:52:28 +0800

Trond,

On Thu, Dec 24, 2009 at 03:12:54AM +0800, Trond Myklebust wrote:
> On Wed, 2009-12-23 at 19:05 +0100, Jan Kara wrote: 
> > On Wed 23-12-09 15:21:47, Trond Myklebust wrote:
> > > @@ -474,6 +482,18 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> > >  	}
> > >  
> > >  	spin_lock(&inode_lock);
> > > +	/*
> > > +	 * Special state for cleaning NFS unstable pages
> > > +	 */
> > > +	if (inode->i_state & I_UNSTABLE_PAGES) {
> > > +		int err;
> > > +		inode->i_state &= ~I_UNSTABLE_PAGES;
> > > +		spin_unlock(&inode_lock);
> > > +		err = commit_unstable_pages(inode, wait);
> > > +		if (ret == 0)
> > > +			ret = err;
> > > +		spin_lock(&inode_lock);
> > > +	}
> >   I don't quite understand this chunk: We've called writeback_single_inode
> > because it had some dirty pages. Thus it has I_DIRTY_DATASYNC set and a few
> > lines above your chunk, we've called nfs_write_inode which sent commit to
> > the server. Now here you sometimes send the commit again? What's the
> > purpose?
> 
> We no longer set I_DIRTY_DATASYNC. We only set I_DIRTY_PAGES (and later
> I_UNSTABLE_PAGES).
> 
> The point is that we now do the commit only _after_ we've sent all the
> dirty pages, and waited for writeback to complete, whereas previously we
> did it in the wrong order.

Sorry I still don't get it. The timing used to be:

write 4MB   ==> WRITE block 0 (ie. first 512KB)
                WRITE block 1
                WRITE block 2
                WRITE block 3         ack from server for WRITE block 0 => mark 0 as unstable (inode marked need-commit)
                WRITE block 4         ack from server for WRITE block 1 => mark 1 as unstable
                WRITE block 5         ack from server for WRITE block 2 => mark 2 as unstable
                WRITE block 6         ack from server for WRITE block 3 => mark 3 as unstable
                WRITE block 7         ack from server for WRITE block 4 => mark 4 as unstable
                                      ack from server for WRITE block 5 => mark 5 as unstable
write_inode ==> COMMIT blocks 0-5
                                      ack from server for WRITE block 6 => mark 6 as unstable (inode marked need-commit)
                                      ack from server for WRITE block 7 => mark 7 as unstable 

                                      ack from server for COMMIT blocks 0-5 => mark 0-5 as clean

write_inode ==> COMMIT blocks 6-7

                                      ack from server for COMMIT blocks 6-7 => mark 6-7 as clean

Note that the first COMMIT is submitted before receiving all ACKs for
the previous writes, hence the second COMMIT is necessary. It seems
that your patch does not improve the timing at all.

Thanks,
Fengguang

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html