Re: [patch][rfc] mm: new address space calls

Nick Piggin <npiggin@xxxxxxx> · Sat, 28 Feb 2009 06:52:21 +0100

On Fri, Feb 27, 2009 at 08:52:47AM -0500, Chris Mason wrote:
> On Fri, 2009-02-27 at 12:26 +0100, Nick Piggin wrote:
> > Well I don't see how that limits us? Either we prefer to keep the
> > metadata, or we throw it away and it is inevitable that we lose
> > information. 
> > 
> 
> We can't have metadata that isn't freed by releasepage unless we want to
> pin the page completely.  There was a time when the btrfs metadata had a
> bit for 'this block needs defrag', and I ended up not being able to use
> it because releasepage was consistently freeing my extra data while the
> page was still around.

Hmm, it sounds like that data perhaps is more a property of the
filesystem / block management rather than the pagecache (OK, it's
a blurry line)...

But I mean 'this block neds defrag' sounds like important metadata
even if the page is *not* still around? (but the block is)

Having your own private metadata, perhaps with the ->shrinker callback
is an option. In fsblock actually for the block mapping cache tree,
I don't use a shrinker, because (I'm lazy and) reclaim will eventaully
reclaim the inode in which case the tree will be taken down with the
new aop->release callback.

But in theory even when the in-memory inode goes away, the block mapping
is still valid metadata, so you could keep it around somewhere (in which
case it would need a shrinker callback).

> > > I'd like a form of releasepage that knows if the vm is going to really
> > > get rid of the page.  Or another callback that happens when the VM is
> > > sure the page will be freed so we can drop extra metadata that doesn't
> > > pin the page, but we always want to stay with the page.
> > 
> > Well, for page reclaim/invalidate/truncate, we have releasepage that you
> > can use even if the metadata is stored outside the page, just set PagePrivate
> > and it will still get called when the page is about to be freed.
> > 
> 
> For clean pages, shrink_page_list seems to check the page count after
> the releasepage call.  It was a big enough window for me to see it in
> practice under normal workloads.

Oh yes, you would see it, but it just shouldn't be *too* common I think.
It's a hard race to close. You would ned to effectively take a spinlock
to prevent pagecache lookup over the releasepage call (OK, with lockless
pagecache it is no longer really tree_lock, but setting page->_count to
0, which causes lookup to basically do equivalent spinning anyway).

Of course it still may be closed with a new callback at pagecache
removal time... but I'm not convinced you need one yet ;) Maybe I don't
understand the requirements properly yet.

> > There are *some* races that can result in the page subsequently not being
> > freed, but I don't think that should be a big deal. I don't want to add
> > a callback in the pagecache remove path if possible, but we can try to
> > rework or improve things if btrfs needs something specific..
> 
> Btrfs doesn't need it today, but it should help once I finally get
> subpage blocks going again (and metadata defrag as well).
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html