Re: [RFC] Dataloss on NFS-clients who modify an mmap()'d area after closing the file-descriptor

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 2012-05-04 at 18:03 +0200, Niels de Vos wrote:
> On 05/03/2012 07:26 PM, Myklebust, Trond wrote:
>  > On Thu, 2012-05-03 at 19:07 +0200, Niels de Vos wrote:
>  >> On 05/03/2012 05:43 PM, Myklebust, Trond wrote:
>  >>   >  On Thu, 2012-05-03 at 17:34 +0200, Niels de Vos wrote:
>  >>   >>  When an application on an NFS-client (tested with NFSv3) executes the
>  >>   >>  following steps, data written after the close() is never flushed to the
>  >>   >>  server:
>  >>   >>
>  >>   >>  1. open()
>  >>   >>  2. mmap()
>  >>   >>  3. close()
>  >>   >>  4.<modify data in the mmap'ed area>
>  >>   >>  5. munmap()
>  >>   >>
>  >>   >>  Dropping the caches (via /proc/sys/vm/drop_caches) or unmounting does not
>  >>   >>  result in the data being sent to the server.
>  >>   >>
>  >>   >>  The man-page for mmap (man 2 mmap) does mention that closing the file-
>  >>   >>  descriptor does not munmap() the area. Using the mmap'ed area after a
>  >>   >>  close() sound valid to me (even if it may be bad practice).
>  >>   >>
>  >>   >>  Investigation and checking showed that the NFS-client does not handle
>  >>   >>  munmap(), and only flushes on close(). To solve this problem, least two
>  >>   >>  solutions can be proposed:
>  >>   >>
>  >>   >>  a. f_ops->release() is called on munmap() as well as on close(),
>  >>   >>       therefore release() can be used to flush data as well.
>  >>   >>  b. In the 'struct vm_operations_struct' add a .close to the
>  >>   >>       'struct vm_area_struct' on calling mmap()/nfs_file_mmap() and flush
>  >>   >>       the data in the new close() function.
>  >>   >>
>  >>   >>  Solution a. contains currently very few code changes:
>  >>   >>
>  >>   >>  --- a/fs/nfs/inode.c
>  >>   >>  +++ b/fs/nfs/inode.c
>  >>   >>  @@ -713,6 +713,8 @@ int nfs_open(struct inode *inode, struct file *filp)
>  >>   >>
>  >>   >>     int nfs_release(struct inode *inode, struct file *filp)
>  >>   >>     {
>  >>   >>  +       if (S_ISREG(inode->i_mode)&&   inode->i_mapping->nrpages != 0) {
>  >>   >>  +               nfs_sync_mapping(inode->i_mapping);
>  >>   >>            nfs_file_clear_open_context(filp);
>  >>   >>            return 0;
>  >>   >>     }
>  >>   >>
>  >>   >>  The disadvantage is, that nfs_release() is called on close() too. That
>  >>   >>  means this causes a flushing of dirty pages, and just after that the
>  >>   >>  nfs_file_clear_open_context() might flush again. The advantage is that
>  >>   >>  it is possible (though not done at the moment) to return an error in
>  >>   >>  case flushing failed.
>  >>   >>
>  >>   >>  Solution b. does not provide an option to return an error, but does not
>  >>   >>  get called on each close():
>  >>   >>
>  >>   >>  --- a/fs/nfs/file.c
>  >>   >>  +++ b/fs/nfs/file.c
>  >>   >>  @@ -547,9 +547,17 @@ out:
>  >>   >>     	return ret;
>  >>   >>     }
>  >>   >>
>  >>   >>  +static void nfs_vm_close(struct vm_area_struct * vma)
>  >>   >>  +{
>  >>   >>  +	struct file *filp = vma->vm_file;
>  >>   >>  +
>  >>   >>  +	nfs_file_flush(filp, (fl_owner_t)filp);
>  >>   >>  +}
>  >>   >>  +
>  >>   >>     static const struct vm_operations_struct nfs_file_vm_ops = {
>  >>   >>     	.fault = filemap_fault,
>  >>   >>     	.page_mkwrite = nfs_vm_page_mkwrite,
>  >>   >>  +	.close = nfs_vm_close,
>  >>   >>     };
>  >>   >>
>  >>   >>     static int nfs_need_sync_write(struct file *filp, struct inode *inode)
>  >>   >>
>  >>   >>  I would like some feedback on what solution is most acceptable, or any
>  >>   >>  other suggestions.
>  >>   >
>  >>   >  Neither solution is acceptable. This isn't a close-to-open cache
>  >>   >  consistency issue.
>  >>   >
>  >>   >  The syntax of mmap() for both block and NFS mounts is the same: writes
>  >>   >  are not guaranteed to hit the disk until your application explicitly
>  >>   >  calls msync().
>  >>   >
>  >>
>  >> Okay, that makes sense. But if the application never calls msync(), and
>  >> just munmap()'s the area, when should the changes be written? I did not
>  >> expect that unmounting just disregards the data.
>  >
>  > That suggests that the VM is failing to dirty the pages on munmap()
>  > before releasing the vma->vm_file. If so, then that would be a VM bug...
>  >
> 
> I've checked if the VM tags the pages as dirty:
> - f_ops->release() is called on munmap(). An added printk there, shows
>    that inode->i_state is set to I_DIRTY_PAGE.
> - mapping_tagged(filp->f_mapping, PAGECACHE_TAG_DIRTY) also returns true
> 
>  From my understanding this is what the VM is expected to do, and the
> pages are marked dirty correctly.
> 
> However, nfs_inode->ndirty and nfs_inode->ncommit are both 0. It is
> unclear to me how the VM is supposed to interact with the nfs_inode.
> Some clarification or suggestion what to look into would be much
> appreciated.

The first time the page is touched, it will to trigger a ->pg_mkwrite(),
which in the case of NFS will set up the necessary tracking structures
to ensure that the page is written out using the correct credentials
etc. In the case of NFSv4, it will also ensure that the file doesn't get
closed on the server until the page is written out to disk.

When the page is cleaned (i.e. something calls clear_page_dirty_for_io()
as part of a write to disk), the call to page_mkclean() is supposed to
re-write-protect the pte, ensuring that any future changes will
re-trigger pg_mkwrite().

You should be able to check if/when nfs_vm_page_mkwrite() is triggered
using 'rpcdebug -m nfs -s pagecache' to turn on the NFS page cache
debugging printks.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@xxxxxxxxxx
www.netapp.com

��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux