RE: [PATCH] pnfsblock: bail out partial page IO

<tao.peng@xxxxxxx> · Wed, 30 May 2012 23:33:01 -0400

> -----Original Message-----
> From: Boaz Harrosh [mailto:bharrosh@xxxxxxxxxxx]
> Sent: Tuesday, May 29, 2012 3:07 PM
> To: Peng Tao
> Cc: Trond.Myklebust@xxxxxxxxxx; linux-nfs@xxxxxxxxxxxxxxx; stable@xxxxxxxxxxxxxxx; Peng, Tao
> Subject: Re: [PATCH] pnfsblock: bail out partial page IO
> 
> On 05/28/2012 09:07 PM, Peng Tao wrote:
> 
> > Current block layout driver read/write code assumes page
> > aligned IO in many places. Add a checker to validate the assumption.
> >
> > Cc: stable@xxxxxxxxxxxxxxx
> > Signed-off-by: Peng Tao <tao.peng@xxxxxxx>
> > ---
> > This is the minimal patch for buffer IO case. I will cook DIO alignment checker
> > based on it, since DIO isn't needed for stable.
> >
> >  fs/nfs/blocklayout/blocklayout.c |   39 +++++++++++++++++++++++++++++++++++--
> >  1 files changed, 36 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> > index 7ae8a60..447f8d1 100644
> > --- a/fs/nfs/blocklayout/blocklayout.c
> > +++ b/fs/nfs/blocklayout/blocklayout.c
> > @@ -228,6 +228,14 @@ bl_end_par_io_read(void *data, int unused)
> >  	schedule_work(&rdata->task.u.tk_work);
> >  }
> >
> > +static bool
> > +bl_nfspage_aligned(u64 offset, u32 len, u32 blkmask)
> > +{
> > +	if ((offset & blkmask) || (len & blkmask))
> > +		return false;
> > +	return true;
> > +}
> > +
> >  static enum pnfs_try_status
> >  bl_read_pagelist(struct nfs_read_data *rdata)
> >  {
> > @@ -244,6 +252,9 @@ bl_read_pagelist(struct nfs_read_data *rdata)
> >  	dprintk("%s enter nr_pages %u offset %lld count %u\n", __func__,
> >  	       rdata->pages.npages, f_offset, (unsigned int)rdata->args.count);
> >
> > +	if (!bl_nfspage_aligned(f_offset, rdata->args.count, PAGE_CACHE_MASK))
> > +		goto use_mds;
> > +
> 
> 
> Please put a dprint in the true case so we can see if this hits a lot.
> What you are saying is that this should happen very rarely.
> 
> After you put the print. Does it hit? what is the test case that causes it?
> 
> >  	par = alloc_parallel(rdata);
> >  	if (!par)
> >  		goto use_mds;
> > @@ -552,7 +563,7 @@ bl_write_pagelist(struct nfs_write_data *wdata, int sync)
> >  	struct bio *bio = NULL;
> >  	struct pnfs_block_extent *be = NULL, *cow_read = NULL;
> >  	sector_t isect, last_isect = 0, extent_length = 0;
> > -	struct parallel_io *par;
> > +	struct parallel_io *par = NULL;
> >  	loff_t offset = wdata->args.offset;
> >  	size_t count = wdata->args.count;
> >  	struct page **pages = wdata->args.pages;
> > @@ -563,6 +574,10 @@ bl_write_pagelist(struct nfs_write_data *wdata, int sync)
> >  	    NFS_SERVER(header->inode)->pnfs_blksize >> PAGE_CACHE_SHIFT;
> >
> >  	dprintk("%s enter, %Zu@%lld\n", __func__, count, offset);
> > +	/* Check for alignment first */
> > +	if (!bl_nfspage_aligned(offset, count, PAGE_CACHE_MASK))
> > +		goto out_mds;
> > +
> 
> 
> Also here
> 
> >  	/* At this point, wdata->pages is a (sequential) list of nfs_pages.
> >  	 * We want to write each, and if there is an error set pnfs_error
> >  	 * to have it redone using nfs.
> > @@ -996,14 +1011,32 @@ bl_clear_layoutdriver(struct nfs_server *server)
> >  	return 0;
> >  }
> >
> > +static void
> > +bl_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
> > +{
> > +	if (!bl_nfspage_aligned(req->wb_offset, req->wb_bytes, PAGE_CACHE_MASK))
> > +		nfs_pageio_reset_read_mds(pgio);
> 
> 
> Also here
> 
> > +	else
> > +		pnfs_generic_pg_init_read(pgio, req);
> > +}
> > +
> > +static void
> > +bl_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
> > +{
> > +	if (!bl_nfspage_aligned(req->wb_offset, req->wb_bytes, PAGE_CACHE_MASK))
> > +		nfs_pageio_reset_write_mds(pgio);
> 
> 
> And here
> 
> > +	else
> > +		pnfs_generic_pg_init_write(pgio, req);
> > +}
> > +
> >  static const struct nfs_pageio_ops bl_pg_read_ops = {
> > -	.pg_init = pnfs_generic_pg_init_read,
> > +	.pg_init = bl_pg_init_read,
> >  	.pg_test = pnfs_generic_pg_test,
> >  	.pg_doio = pnfs_generic_pg_readpages,
> >  };
> >
> >  static const struct nfs_pageio_ops bl_pg_write_ops = {
> > -	.pg_init = pnfs_generic_pg_init_write,
> > +	.pg_init = bl_pg_init_write,
> >  	.pg_test = pnfs_generic_pg_test,
> >  	.pg_doio = pnfs_generic_pg_writepages,
> >  };
> 
> 
> If the prints actually never hit except with some rare applications
> than that's fine. But if they do Perhaps a fix is not that hard.
> 
Boaz, hi,

Thank you very much for reviewing these patches. Sorry that I've been having some issues with my testing environment. I will come back to this once I sort out the environment issues.

I haven't tested yet but I tend to believe you are correct. Even if generic NFS initiates read-modify-write cycle, the nfs_page passed to LD doesn't always contain entire page but just the range that is written by application because in flock or O_DSYNC case, nfs_updatepage() doesn't extend the write to entire page. So the test in pg_init will fail even though it is safe for LD to write out the entire page in these cases.

So to summary, IMO the alignment tests will fail in following cases when partial page write is issued:
1. nfs_want_read_modify_write() returns false (e.g., file opened O_WRONLY)
2. byte range locking is used
3. file opened with D_SYNC

And only the first case will cause data corruption if we do not bailout the write. So the bailout logic in the patch is much stricter than it really needs to be.

> Specially the read case should not be that hard to fix. Because you
> are reading into pages, and page boundaries are aligned to page boundaries
> in the stream, so all you need to do is read the reminders (at both ends)
> as well.
Yes. Current read code is sufficient enough.

> 
> Also the write case is not that hard. You just need to call a read-for-write()
> routine that does a sync read, when done continues with today's write.
In the write case, read-for-write routine is only needed when application does partial page write to uninitialized extent. There is already some read-for-write logic there but it assumes page aligned IO. So it needs to be extended to handle partial pages as well. Also the io summit routine need to be changed to handle partial write, which is easy BTW.

However, even with above change, the direct write case still need to be block size aligned because otherwise we need to allocate extra pages to zero out partial uninitialized extents and it would require serialization between concurrent writers. One possible optimization is to pre-check in bl_write_pagelist whether the write out range contains uninitialized extents when IO is sector aligned but block size unaligned. And if the range doesn't contain any uninitialized extents, it can then be issued, otherwise it still need to be rejected.

Thanks,
Tao

��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥