Re: Freeing page flags

Xiubo Li <xiubli@xxxxxxxxxx> · Tue, 17 May 2022 08:34:21 +0800

On 5/13/22 8:53 PM, Matthew Wilcox wrote:
On Fri, May 13, 2022 at 10:40:05AM +0100, Luís Henriques wrote:
Matthew Wilcox <willy@xxxxxxxxxxxxx> writes:

On Thu, May 12, 2022 at 10:41:41PM -0400, Josef Bacik wrote:
On Thu, May 12, 2022 at 09:54:59PM +0100, Matthew Wilcox wrote:
The LWN writeup [1] on merging the MGLRU reminded me that I need to send
out a plan for removing page flags that we can do without.

1. PG_error.  It's basically useless.  If the page was read successfully,
PG_uptodate is set.  If not, PG_uptodate is clear.  The page cache
doesn't use PG_error.  Some filesystems do, and we need to transition
them away from using it.

What about writes?  A cursory look shows we don't clear Uptodate if we fail to
write, which is correct I think.  The only way to indicate we had a write error
to check later is the page error.
On encountering a write error, we're supposed to call mapping_set_error(),
not SetPageError().

2. PG_private.  This tells us whether we have anything stored at
page->private.  We can just check if page->private is NULL or not.
No need to have this extra bit.  Again, there may be some filesystems
that are a bit wonky here, but I'm sure they're fixable.

At least for Btrfs we serialize the page->private with the private_lock, so we
could probably just drop PG_private, but it's kind of nice to check first before
we have to take the spin lock.  I suppose we can just do

if (page->private)
	// do lock and check thingy
That's my hope!  I think btrfs is already using folio_attach_private() /
attach_page_private(), which makes everything easier.  Some filesystems
still manipulate page->private and PagePrivate by hand.
In ceph we've recently [1] spent a bit of time debugging a bug related
with ->private not being NULL even though we expected it to be.  The
solution found was to replace the check for NULL and use
folio_test_private() instead, but we _may_ have not figured the whole
thing out.

We assumed that folios were being recycled and not cleaned-up.  The values
we were seeing in ->private looked like they were some sort of flags as
only a few bits were set (e.g. 0x0200000):

[ 1672.578313] page:00000000e23868c1 refcount:2 mapcount:0 mapping:0000000022e0d3b4 index:0xd8 pfn:0x74e83
[ 1672.581934] aops:ceph_aops [ceph] ino:10000016c9e dentry name:"faed"
[ 1672.584457] flags: 0x4000000000000015(locked|uptodate|lru|zone=1)
[ 1672.586878] raw: 4000000000000015 ffffea0001d3a108 ffffea0001d3a088 ffff888003491948
[ 1672.589894] raw: 00000000000000d8 0000000000200000 00000002ffffffff 0000000000000000
[ 1672.592935] page dumped because: VM_BUG_ON_FOLIO(1)

[1] https://lore.kernel.org/all/20220508061543.318394-1-xiubli@xxxxxxxxxx/
I remember Jeff asking me about this problem a few days ago.  A folio
passed to you in ->dirty_folio() or ->invalidate_folio() belongs to
your filesystem.  Nobody else should be storing to the ->private field;
there's no race that could lead to it being freed while you see it.
There may, of course, be bugs that are overwriting folio->private, but
it's definitely not supposed to happen.  I agree with you that it looks
like a bit has been set (is it possibly bad RAM?)
I don't think so.

Please see the values I saw from my tests locally below.

We do use page->private in the buddy allocator, but that stores the order
of the page; it wouldn't be storing 1<<21.  PG flag 21 is PG_mlocked,
which seems like a weird one to be setting in the wrong field, so probably
not that.

Is it always bit 21 that gets set?

No, from my test I can reproduce it locally very easy and almost every 
time the values were different, which were random values, just like:
0000000000040000,0000000000070000,000000000002e0000...

More detail please see https://tracker.ceph.com/issues/55421.

I am sure that for all the none zero values the lower 16 bits were 0000.

-- Xiubo