Re: [RFC] Convert ceph_page_mkwrite to use a folio

Xiubo Li <xiubli@xxxxxxxxxx> · Fri, 8 Jul 2022 10:10:24 +0800

On 7/8/22 1:48 AM, Jeff Layton wrote:
On Thu, 2022-07-07 at 15:51 +0100, Matthew Wilcox wrote:
There are some latent bugs that I fix here (eg, you can't call
thp_size() on a tail page), but the real question is how Ceph in
particular (and FS in general) want to handle mkwrite in a world
of multi-page folios.

If we have a multi-page folio which is occupying an entire PMD, then
no question, we have to mark all 2MB (or whatever) as dirty.  But
if it's being mapped with PTEs, either because it's mapped misaligned,
or it's smaller than a PMD, then we have a choice.  We can either
work in 4kB chunks, marking each one dirty (and storing the sub-folio
dirty state in the fs private data) like a write might.  Or we can
just say "Hey, the whole folio is dirty now" and not try to track
dirtiness on a per-page granularity.

The latter course seems to have been taken, modulo the bugs, but I
don't know if any thought was taken or whether it was done by rote.

Done by rote, I'm pretty sure.

If each individual page retains its own dirty bit, what does
folio_test_dirty return when its pages are only partially dirty? I guess
the folio is still dirty even if some of its pages are clean?

Ceph can do a vectored write if a folio has disjoint dirty regions that
we need to flush. Hashing out an API to handle that with the netfs layer
is going to be "interesting" though.

Yeah, sounds reasonable to me.

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 6dee88815491..fb346b929f65 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1503,8 +1503,8 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
  	struct ceph_inode_info *ci = ceph_inode(inode);
  	struct ceph_file_info *fi = vma->vm_file->private_data;
  	struct ceph_cap_flush *prealloc_cf;
-	struct page *page = vmf->page;
-	loff_t off = page_offset(page);
+	struct folio *folio = page_folio(vmf->page);
+	loff_t pos = folio_pos(folio);
  	loff_t size = i_size_read(inode);
  	size_t len;
  	int want, got, err;
@@ -1521,50 +1521,50 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
  	sb_start_pagefault(inode->i_sb);
  	ceph_block_sigs(&oldset);
  
-	if (off + thp_size(page) <= size)
-		len = thp_size(page);
+	if (pos + folio_size(folio) <= size)
+		len = folio_size(folio);
  	else
-		len = offset_in_thp(page, size);
+		len = offset_in_folio(folio, size);
  
  	dout("page_mkwrite %p %llx.%llx %llu~%zd getting caps i_size %llu\n",
-	     inode, ceph_vinop(inode), off, len, size);
+	     inode, ceph_vinop(inode), pos, len, size);
  	if (fi->fmode & CEPH_FILE_MODE_LAZY)
  		want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
  	else
  		want = CEPH_CAP_FILE_BUFFER;
  
  	got = 0;
-	err = ceph_get_caps(vma->vm_file, CEPH_CAP_FILE_WR, want, off + len, &got);
+	err = ceph_get_caps(vma->vm_file, CEPH_CAP_FILE_WR, want, pos + len, &got);
  	if (err < 0)
  		goto out_free;
  
  	dout("page_mkwrite %p %llu~%zd got cap refs on %s\n",
-	     inode, off, len, ceph_cap_string(got));
+	     inode, pos, len, ceph_cap_string(got));
  
-	/* Update time before taking page lock */
+	/* Update time before taking folio lock */
  	file_update_time(vma->vm_file);
  	inode_inc_iversion_raw(inode);
  
  	do {
  		struct ceph_snap_context *snapc;
  
-		lock_page(page);
+		folio_lock(folio);
  
-		if (page_mkwrite_check_truncate(page, inode) < 0) {
-			unlock_page(page);
+		if (folio_mkwrite_check_truncate(folio, inode) < 0) {
+			folio_unlock(folio);
  			ret = VM_FAULT_NOPAGE;
  			break;
  		}
  
-		snapc = ceph_find_incompatible(page);
+		snapc = ceph_find_incompatible(&folio->page);
  		if (!snapc) {
-			/* success.  we'll keep the page locked. */
-			set_page_dirty(page);
+			/* success.  we'll keep the folio locked. */
+			folio_mark_dirty(folio);
  			ret = VM_FAULT_LOCKED;
  			break;
  		}
  
-		unlock_page(page);
+		folio_unlock(folio);
  
  		if (IS_ERR(snapc)) {
  			ret = VM_FAULT_SIGBUS;
@@ -1588,7 +1588,7 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
  	}
  
  	dout("page_mkwrite %p %llu~%zd dropping cap refs on %s ret %x\n",
-	     inode, off, len, ceph_cap_string(got), ret);
+	     inode, pos, len, ceph_cap_string(got), ret);
  	ceph_put_cap_refs_async(ci, got);
  out_free:
  	ceph_restore_sigs(&oldset);