Re: [RFC] Convert ceph_page_mkwrite to use a folio

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Thu, 7 Jul 2022 19:13:25 +0100

On Thu, Jul 07, 2022 at 01:48:48PM -0400, Jeff Layton wrote:
> On Thu, 2022-07-07 at 15:51 +0100, Matthew Wilcox wrote:
> > There are some latent bugs that I fix here (eg, you can't call
> > thp_size() on a tail page), but the real question is how Ceph in
> > particular (and FS in general) want to handle mkwrite in a world
> > of multi-page folios.
> > 
> > If we have a multi-page folio which is occupying an entire PMD, then
> > no question, we have to mark all 2MB (or whatever) as dirty.  But
> > if it's being mapped with PTEs, either because it's mapped misaligned,
> > or it's smaller than a PMD, then we have a choice.  We can either
> > work in 4kB chunks, marking each one dirty (and storing the sub-folio
> > dirty state in the fs private data) like a write might.  Or we can
> > just say "Hey, the whole folio is dirty now" and not try to track
> > dirtiness on a per-page granularity.
> > 
> > The latter course seems to have been taken, modulo the bugs, but I
> > don't know if any thought was taken or whether it was done by rote.
> > 
> 
> Done by rote, I'm pretty sure.
> 
> If each individual page retains its own dirty bit, what does
> folio_test_dirty return when its pages are only partially dirty? I guess
> the folio is still dirty even if some of its pages are clean?

ah, each *PTE* retains its own dirty bit.  So we can track which pages
within the folio are actually dirty.  But the folio as a whole has a
single dirty bit (filesystems, of course, are welcome to keep track of
sub-folio dirtiness at whatever granularity makes sense for them).

> Ceph can do a vectored write if a folio has disjoint dirty regions that
> we need to flush. Hashing out an API to handle that with the netfs layer
> is going to be "interesting" though.

Yes; I'm sure Dave has ideas about that ...

> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > index 6dee88815491..fb346b929f65 100644
> > --- a/fs/ceph/addr.c
> > +++ b/fs/ceph/addr.c
> > @@ -1503,8 +1503,8 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
> >  	struct ceph_inode_info *ci = ceph_inode(inode);
> >  	struct ceph_file_info *fi = vma->vm_file->private_data;
> >  	struct ceph_cap_flush *prealloc_cf;
> > -	struct page *page = vmf->page;
> > -	loff_t off = page_offset(page);
> > +	struct folio *folio = page_folio(vmf->page);
> > +	loff_t pos = folio_pos(folio);
> >  	loff_t size = i_size_read(inode);
> >  	size_t len;
> >  	int want, got, err;
> > @@ -1521,50 +1521,50 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
> >  	sb_start_pagefault(inode->i_sb);
> >  	ceph_block_sigs(&oldset);
> >  
> > -	if (off + thp_size(page) <= size)
> > -		len = thp_size(page);
> > +	if (pos + folio_size(folio) <= size)
> > +		len = folio_size(folio);
> >  	else
> > -		len = offset_in_thp(page, size);
> > +		len = offset_in_folio(folio, size);
> >  
> >  	dout("page_mkwrite %p %llx.%llx %llu~%zd getting caps i_size %llu\n",
> > -	     inode, ceph_vinop(inode), off, len, size);
> > +	     inode, ceph_vinop(inode), pos, len, size);
> >  	if (fi->fmode & CEPH_FILE_MODE_LAZY)
> >  		want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
> >  	else
> >  		want = CEPH_CAP_FILE_BUFFER;
> >  
> >  	got = 0;
> > -	err = ceph_get_caps(vma->vm_file, CEPH_CAP_FILE_WR, want, off + len, &got);
> > +	err = ceph_get_caps(vma->vm_file, CEPH_CAP_FILE_WR, want, pos + len, &got);
> >  	if (err < 0)
> >  		goto out_free;
> >  
> >  	dout("page_mkwrite %p %llu~%zd got cap refs on %s\n",
> > -	     inode, off, len, ceph_cap_string(got));
> > +	     inode, pos, len, ceph_cap_string(got));
> >  
> > -	/* Update time before taking page lock */
> > +	/* Update time before taking folio lock */
> >  	file_update_time(vma->vm_file);
> >  	inode_inc_iversion_raw(inode);
> >  
> >  	do {
> >  		struct ceph_snap_context *snapc;
> >  
> > -		lock_page(page);
> > +		folio_lock(folio);
> >  
> > -		if (page_mkwrite_check_truncate(page, inode) < 0) {
> > -			unlock_page(page);
> > +		if (folio_mkwrite_check_truncate(folio, inode) < 0) {
> > +			folio_unlock(folio);
> >  			ret = VM_FAULT_NOPAGE;
> >  			break;
> >  		}
> >  
> > -		snapc = ceph_find_incompatible(page);
> > +		snapc = ceph_find_incompatible(&folio->page);
> >  		if (!snapc) {
> > -			/* success.  we'll keep the page locked. */
> > -			set_page_dirty(page);
> > +			/* success.  we'll keep the folio locked. */
> > +			folio_mark_dirty(folio);
> >  			ret = VM_FAULT_LOCKED;
> >  			break;
> >  		}
> >  
> > -		unlock_page(page);
> > +		folio_unlock(folio);
> >  
> >  		if (IS_ERR(snapc)) {
> >  			ret = VM_FAULT_SIGBUS;
> > @@ -1588,7 +1588,7 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
> >  	}
> >  
> >  	dout("page_mkwrite %p %llu~%zd dropping cap refs on %s ret %x\n",
> > -	     inode, off, len, ceph_cap_string(got), ret);
> > +	     inode, pos, len, ceph_cap_string(got), ret);
> >  	ceph_put_cap_refs_async(ci, got);
> >  out_free:
> >  	ceph_restore_sigs(&oldset);
> 
> -- 
> Jeff Layton <jlayton@xxxxxxxxxx>