On 14 Oct 2020, at 11:38, Matthew Wilcox wrote:
On Wed, Oct 14, 2020 at 10:50:51AM -0400, Chris Mason wrote:
On 14 Oct 2020, at 9:49, Matthew Wilcox wrote:
Our handling of PagePrivate, page->private and PagePrivate2 is a
giant
mess. Let's recap.
Filesystems which use bufferheads (ie most of them) set
page->private
to point to a buffer_head, set the PagePrivate bit and increment the
refcount on the page.
The vmscan pageout code (*) needs to know whether a page is
freeable:
if (!is_page_cache_freeable(page))
return PAGE_KEEP;
... where is_page_cache_freeable() contains ...
return page_count(page) - page_has_private(page) == 1 +
page_cache_pins;
That's a little inscrutable, but the important thing is that if
page_has_private() is true, then the page's reference count is
supposed
to be one higher than it would be otherwise. And that makes sense
given
how "having bufferheads" means "page refcount ges incremented".
But page_has_private() doesn't actually mean "PagePrivate is set".
It means "PagePrivate or PagePrivate2 is set". And I don't
understand
how filesystems are supposed to keep that straight -- if we're
setting
PagePrivate2, and PagePrivate is clear, increment the refcount?
If we're clearing PagePrivate, decrement the refcount if
PagePrivate2
is also clear?
At least for btrfs, only PagePrivate elevates the refcount on the
page.
PagePrivate2 means:
This page has been properly setup for COW’d IO, and it went through
the
normal path of page_mkwrite() or file_write() instead of being
silently
dirtied by a deep corner of the MM.
What's not clear to me is whether btrfs can be in the situation where
PagePrivate2 is set and PagePrivate is clear. If so, then we have a
bug
to fix.
I don’t think it’ll happen. Everyone in btrfs setting PagePrivate2
seems to have already called attach_page_private(). It’s possible I
missed a corner but I don’t think so.
We introduced attach_page_private() and detach_page_private()
earlier
this year to help filesystems get the refcount right. But we still
have a few filesystems using PagePrivate themselves (afs, btrfs,
ceph,
crypto, erofs, f2fs, jfs, nfs, orangefs & ubifs) and I'm not
convinced
they're all getting it right.
Here's a bug I happened on while looking into this:
if (page_has_private(page))
attach_page_private(newpage,
detach_page_private(page));
if (PagePrivate2(page)) {
ClearPagePrivate2(page);
SetPagePrivate2(newpage);
}
The aggravating thing is that this doesn't even look like a bug.
You have to be in the kind of mood where you're thinking "What if
page
has Private2 set and Private clear?" and the answer is that newpage
ends up with PagePrivate set, but page->private set to NULL.
Do you mean PagePrivate2 set but page->private NULL?
Sorry, I skipped a step of the explanation.
page_has_private returns true if Private or Private2 is set. So if
page has PagePrivate clear and PagePrivate2 set, newpage will end up
with both PagePrivate and PagePrivate2 set -- attach_page_private()
doesn't check whether the pointer is NULL (and IMO, it shouldn't).
I see what you mean, given how often we end up calling
btrfs_migratepage() in the fleet, I’m willing to say this is probably
fine. But it’s easy enough to toss in a warning.
Given our current macros, what was _meant_ here was:
if (PagePrivate(page))
attach_page_private(newpage,
detach_page_private(page));
but that's not obviously right.
Btrfs should only hage
PagePrivate2 set on pages that are formally in our writeback state
machine,
so it’ll get cleared as we unwind through normal IO or truncate
etc. For
data pages, btrfs page->private is simply set to 1 so the MM will
kindly
call releasepage for us.
That's not what I'm seeing here:
static void attach_extent_buffer_page(struct extent_buffer *eb,
struct page *page)
{
if (!PagePrivate(page))
attach_page_private(page, eb);
else
WARN_ON(page->private != (unsigned long)eb);
}
Or is that not a data page?
Correct, extent_buffers are metadata only. They wouldn’t be Private2.
So what shold we do about all this? First, I want to make the code
snippet above correct, because it looks right. So
page_has_private()
needs to test just PagePrivate and not PagePrivate2. Now we need a
new function to call to determine whether the filesystem needs its
invalidatepage callback invoked. Not sure what that should be
named.
I haven’t checked all the page_has_private() callers, but maybe
page_has_private() should stay the same and add page_private_count()
for
times where we need to get out our fingers and toes for the refcount
math.
I was thinking about page_expected_count() which returns the number of
references from the page cache plus the number of references from
the various page privates. So is_page_cache_freeable() becomes:
return page_count(page) == page_expected_count(page) + 1;
can_split_huge_page() becomes:
if (page_has_private(page))
return false;
return page_count(page) == page_expected_count(page) +
total_mapcount(page) + 1;
I think I also want to rename PG_private_2 to PG_owner_priv_2.
There's a clear relationship between PG_private and page->private.
There is no relationship between PG_private_2 and page->private, so
it's
a misleading name. Or maybe it should just be PG_fscache and btrfs
can
find some other way to mark the pages?
Btrfs should be able to flip bits in page->private to cover our
current
usage of PG_private_2. If we ever attach something real to
page->private,
we can flip bits in that instead. It’s kinda messy though and
we’d have to
change attach_page_private a little to reflect its new life as a bit
setting
machine.
It's not great, but with David wanting to change how PageFsCache is
used,
it may be unavoidable (I'm not sure if he's discussed that with you
yet)
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/commit/?h=fscache-iter&id=6f10fd7766ed6d87c3f696bb7931281557b389f5
shows part of it
-- essentially he wants to make PagePrivate2 mean that I/O is
currently
ongoing to an fscache, and so truncate needs to wait on it being
finished.
It’s fine-ish for btrfs since we’re setting Private2 only on pages
that are either writeback or will be writeback really soon. But, I’d
prefer this end up in a callback so that we don’t have magic meaning
for magic flags in magic places.
-chris