Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?

Chris Mason <chris.mason@xxxxxxxxxx> · Mon, 26 Apr 2010 13:24:42 -0400

On Mon, Apr 26, 2010 at 10:50:45AM -0400, Theodore Tso wrote:
> 
> On Apr 26, 2010, at 6:18 AM, KOSAK
> > AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing
> > (and later rd choosed to use another way).
> > Then, It assume writepage refusing aren't happen on majority pages.
> > IOW, the VM assume other many pages can writeout although the page can't.
> > Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is returned.
> > but now ext4 and btrfs refuse all writepage(). (right?)
> 
> No, not exactly.   Btrfs refuses the writepage() in the direct reclaim
> cases (i.e., if PF_MEMALLOC is set), but will do writepage() in the
> case of zone scanning.  I don't want to speak for Chris, but I assume
> it's due to stack depth concerns --- if it was just due to worrying
> about fs recursion issues, i assume all of the btrfs allocations could
> be done GFP_NOFS.
> 

Btrfs refuses all PF_MEMALLOC writepage.  It will go ahead and process a
regular writepage but in practice that never happens...everyone else
except a few internal btrfs callers use writepages.

I wish I had thought of stack depth back then, but really this was to
keep kswapd out of the heavy work done by delalloc.  From a locking
point of view we're properly GPF_NOFS, so its safe, but it just isn't a
great way to use precious PF_MEMALLOC cycles.

> Ext4 is slightly different; it refuses writepages() if the inode
> blocks for the page haven't yet been allocated.  (Regardless of
> whether it's happening for direct reclaim or zone scanning.)  However,
> if the on-disk block has been assigned (i.e., this isn't a delalloc
> case), ext4 will honor the writepage().   (i.e., if this is an mmap of
> an already existing file, or if the space has been pre-allocated using
> fallocate()).    The reason for ext4's concern is lock ordering,
> although I'm investigating whether I can fix this.   If we call
> set_page_writeback() to set PG_writeback (plus set the various bits of
> magic fs accounting), and then drop the page_lock, does that protect
> us from random changes happening to the page (i.e., from vmtruncate,
> etc.)?

PG_writeback will protect you from vmtruncate, but may also want to
have page_mkwrite wait for pages in flight.

> 
> > 
> > IOW, I don't think such documentation suppose delayed allocation issue ;)
> > 
> > The point is, Our dirty page accounting only account per-system-memory
> > dirty ratio and per-task dirty pages. but It doesn't account per-numa-node
> > nor per-zone dirty ratio. and then, to refuse write page and fake numa
> > abusing can make confusing our vm easily. if _all_ pages in our VM LRU
> > list (it's per-zone), page activation doesn't help. It also lead to OOM.
> > 
> > And I'm sorry. I have to say now all vm developers fake numa is not
> > production level quority yet. afaik, nobody have seriously tested our
> > vm code on such environment. (linux/arch/x86/Kconfig says "This is only 
> > useful for debugging".)
> 

> So I'm sorry I mentioned the fake numa bit, since I think this is a
> bit of a red herring.   That code is in production here, and we've
> made all sorts of changes so ti can be used for more than just
> debugging.  So please ignore it, it's our local hack, and if it breaks
> that's our problem.    More importantly, just two weeks ago I talked
> to soeone in the financial sector, who was testing out ext4 on an
> upstream kernel, and not using our hacks that force 128MB zones, and
> he ran into the ext4/OOM problem while using an upstream kernel.  It
> involved Oracle pinning down 3G worth of pages, and him trying to do a
> huge streaming backup (which of course wasn't using fallocate or
> direct I/O) under ext4, and he had the same issue --- an OOM, that I'm
> pretty sure was caused by the fact that ext4_writepage() was refusing
> the writepage() and most of the pages weren't nailed down by Oracle
> were delalloc.    The same test scenario using ext3 worked just fine,
> of course.
> 
> Under normal cases it's not a problem since statistically there should
> be enough other pages in the system compared to the number of pages
> that are subject to delalloc, such that pages can usually get pushed
> out until the writeback code can get around to writing out the pages.
> But in cases where the zones have been made artificially small, or you
> have a big program like Oracle pinning down a large number of pages,
> then of course we have problems. 
> 
> I'm trying to fix things from the file system side, which means trying
> to understand magic flags like AOP_WRITEPAGE_ACTIVATE, which is
> described in Documentation/filesystems/Locking as something which MUST
> be used if writepage() is going refuse a page.  And then I discovered
> no one is actually using it.   So that's why I was asking with respect
> whether the Locking documentation file was out of date, or whether all
> of the file systems are doing it wrong.
> 

> On a related example of how file system code isnt' necessarily
> following what is required/recommended by the Locking documentation,
> ext2 and ext3 are both NOT using
> set_page_writeback()/end_page_writeback(), but are rather keeping the
> page locked until after they call block_write_full_page(), because of
> concerns of truncate coming in and screwing things up.

block_write_full_page takes a locked page and if all goes well produces
a writeback page without the page locked.  Basically it needs the page
locked until after it has the writeback bit set to protect against
truncate and make sure the page buffers don't go away while it is
looping over them.

So, I don't think ext23 are breaking the rules here.

>But now looking at Locking, it appears that set_page_writeback() is as
>good as page_lock() for preventing the truncate code from coming in and
>screwing everything up?   It's not clear to me exactly what locking
>guarantees are provided against truncate by set_page_writeback().   And
>suppose we are writing out a whole cluster of pages, say 4MB worth of
>pages; do we need to call set_page_writeback() on every single page in
>the cluster before we do the I/O to make sure things don't change out
>from under us?  (I'm pretty sure at least some of the other filesystems
>that are submitting huge numbers of pages using bio instead of 4k at a
>time like ext2/3/4 aren' t calling set_page_writeback() on all of the
>pages first.)
> 
> Part of the problem is that the writeback Locking semantics aren't
> well documented, and where they are documented, it's not clear they
> are up to date --- and all of the file systems that are doing delayed
> allocation writeback are doing things slightly differently, or in some
> cases very differently.    (And even without delalloc, as I've pointed
> out ext2/3 don't use set_page_writeback() --- if this is a MUST USE as
> implied by the Locking file, why did whoever added this requirement
> didn't go in and modify common filesystems like ext2 and ext3 to use
> the set_page_writeback/end_page_writeback calls?)
> 
> I'm happy to change things in ext4; in fact I'm pretty sure ext4
> probably isn't completely right here.   But it's not clear what
> "right" actually is, and when I look to see what protects writepage()
> racing with vmtruncate(), it's enough to give me a headache.  :-(    
> 
> Hence my question about wouldn't it be simpler if we simply added more
> high-level locking to prevent truncate from racing against
> writepage/writeback.  

My understanding of the current scheme is that truncate will wait on
both locked and writeback pages.  The page lock is used while setting
up the page for writeback, which is true both for writepages and
writepage.

I don't think we need a new lock on top of the page lock and the
writeback bit, but maybe I don't see exactly which problem you're
solving.  A given range of pages is either:

1) allocated but not under IO.  ext4 must write these pages to disk
before truncate can finish for data=ordered reasons, unless it manages
to log the orphan item.  Figuring out dependency between the orphan
item, which i_size is on disk right now, and holes is pretty tricky, so
I'd go with the less complex: just wait for all the allocated delalloc
pages to hit the disk.

2) Allocated and under IO.  These pages go to disk.

3) Delalloc and not under IO.  Truncate (or notify_change if you lean
toward the xfs crowd) should be able to clean these up
without waiting for the IO.

Of the three, #3 is probably the most common, which #1 a close second.
Is this a case that we really need to optimize for?

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html