Re: Freeing page flags

Jeff Layton <jlayton@xxxxxxxxxx> · Fri, 13 May 2022 09:21:11 -0400

On Fri, 2022-05-13 at 13:53 +0100, Matthew Wilcox wrote:
> On Fri, May 13, 2022 at 10:40:05AM +0100, Luís Henriques wrote:
> > Matthew Wilcox <willy@xxxxxxxxxxxxx> writes:
> > 
> > > On Thu, May 12, 2022 at 10:41:41PM -0400, Josef Bacik wrote:
> > > > On Thu, May 12, 2022 at 09:54:59PM +0100, Matthew Wilcox wrote:
> > > > > The LWN writeup [1] on merging the MGLRU reminded me that I need to send
> > > > > out a plan for removing page flags that we can do without.
> > > > > 
> > > > > 1. PG_error.  It's basically useless.  If the page was read successfully,
> > > > > PG_uptodate is set.  If not, PG_uptodate is clear.  The page cache
> > > > > doesn't use PG_error.  Some filesystems do, and we need to transition
> > > > > them away from using it.
> > > > > 
> > > > 
> > > > What about writes?  A cursory look shows we don't clear Uptodate if we fail to
> > > > write, which is correct I think.  The only way to indicate we had a write error
> > > > to check later is the page error.
> > > 
> > > On encountering a write error, we're supposed to call mapping_set_error(),
> > > not SetPageError().
> > > 
> > > > > 2. PG_private.  This tells us whether we have anything stored at
> > > > > page->private.  We can just check if page->private is NULL or not.
> > > > > No need to have this extra bit.  Again, there may be some filesystems
> > > > > that are a bit wonky here, but I'm sure they're fixable.
> > > > > 
> > > > 
> > > > At least for Btrfs we serialize the page->private with the private_lock, so we
> > > > could probably just drop PG_private, but it's kind of nice to check first before
> > > > we have to take the spin lock.  I suppose we can just do
> > > > 
> > > > if (page->private)
> > > > 	// do lock and check thingy
> > > 
> > > That's my hope!  I think btrfs is already using folio_attach_private() /
> > > attach_page_private(), which makes everything easier.  Some filesystems
> > > still manipulate page->private and PagePrivate by hand.
> > 
> > In ceph we've recently [1] spent a bit of time debugging a bug related
> > with ->private not being NULL even though we expected it to be.  The
> > solution found was to replace the check for NULL and use
> > folio_test_private() instead, but we _may_ have not figured the whole
> > thing out.
> > 
> > We assumed that folios were being recycled and not cleaned-up.  The values
> > we were seeing in ->private looked like they were some sort of flags as
> > only a few bits were set (e.g. 0x0200000):
> > 
> > [ 1672.578313] page:00000000e23868c1 refcount:2 mapcount:0 mapping:0000000022e0d3b4 index:0xd8 pfn:0x74e83
> > [ 1672.581934] aops:ceph_aops [ceph] ino:10000016c9e dentry name:"faed"
> > [ 1672.584457] flags: 0x4000000000000015(locked|uptodate|lru|zone=1)
> > [ 1672.586878] raw: 4000000000000015 ffffea0001d3a108 ffffea0001d3a088 ffff888003491948
> > [ 1672.589894] raw: 00000000000000d8 0000000000200000 00000002ffffffff 0000000000000000
> > [ 1672.592935] page dumped because: VM_BUG_ON_FOLIO(1)
> > 
> > [1] https://lore.kernel.org/all/20220508061543.318394-1-xiubli@xxxxxxxxxx/
> 
> I remember Jeff asking me about this problem a few days ago.  A folio
> passed to you in ->dirty_folio() or ->invalidate_folio() belongs to
> your filesystem.  Nobody else should be storing to the ->private field;
> there's no race that could lead to it being freed while you see it.
> There may, of course, be bugs that are overwriting folio->private, but
> it's definitely not supposed to happen.  I agree with you that it looks
> like a bit has been set (is it possibly bad RAM?)
> 
> We do use page->private in the buddy allocator, but that stores the order
> of the page; it wouldn't be storing 1<<21.  PG flag 21 is PG_mlocked,
> which seems like a weird one to be setting in the wrong field, so probably
> not that.
> 
> Is it always bit 21 that gets set?

No, it varies, but it was always just a few bits in the field that end
up being set. I was never able to reproduce it locally, but saw it in a
run in ceph's teuthology lab a few times. Xiubo did the most digging
here, so he may be able to add more info.

Basically though, we call __filemap_get_folio in netfs_write_begin and
it will sometimes give us a folio that has PG_private clear, but the
->private field has just a few bits that aren't zeroed out. I'm pretty
sure we zero out that field in ceph, so the theory was that the page was
traveling through some other subsystem before coming to us.

He wasn't able to ascertain the cause, and just decided to check for
PG_private instead since you (presumably) shouldn't trust ->private
unless that's set anyway.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>