Re: [PATCH 2/3] writeback: allow for dirty metadata accounting

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 13 Sep 2016 09:01:52 +1000

On Mon, Sep 12, 2016 at 10:24:12AM -0400, Josef Bacik wrote:
> Dave your reply got eaten somewhere along the way for me, so all i
> have is this email.  I'm going to respond to your stuff here.

No worries, I'll do a 2-in-1 reply :P

> On 09/12/2016 03:34 AM, Jan Kara wrote:
> >On Mon 12-09-16 10:46:56, Dave Chinner wrote:
> >>On Fri, Sep 09, 2016 at 10:17:43AM +0200, Jan Kara wrote:
> >>>On Mon 22-08-16 13:35:01, Josef Bacik wrote:
> >>>>Provide a mechanism for file systems to indicate how much dirty metadata they
> >>>>are holding.  This introduces a few things
> >>>>
> >>>>1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY.
> >>>>2) WB stat for dirty metadata.  This way we know if we need to try and call into
> >>>>the file system to write out metadata.  This could potentially be used in the
> >>>>future to make balancing of dirty pages smarter.
> >>>
> >>>So I'm curious about one thing: In the previous posting you have mentioned
> >>>that the main motivation for this work is to have a simple support for
> >>>sub-pagesize dirty metadata blocks that need tracking in btrfs. However you
> >>>do the dirty accounting at page granularity. What are your plans to handle
> >>>this mismatch?
> >>>
> >>>The thing is you actually shouldn't miscount by too much as that could
> >>>upset some checks in mm checking how much dirty pages a node has directing
> >>>how reclaim should be done... But it's a question whether NR_METADATA_DIRTY
> >>>should be actually used in the checks in node_limits_ok() or in
> >>>node_pagecache_reclaimable() at all because once you start accounting dirty
> >>>slab objects, you are really on a thin ice...
> >>
> >>The other thing I'm concerned about is that it's a btrfs-only thing,
> >>which means having dirty btrfs metadata on a system with different
> >>filesystems (e.g. btrfs root/home, XFS data) is going to affect how
> >>memory balance and throttling is run on other filesystems. i.e. it's
> >>going ot make a filesystem specific issue into a problem that
> >>affects global behaviour.
> >
> >Umm, I don't think it will be different than currently. Currently btrfs
> >dirty metadata is accounted as dirty page cache because they have this
> >virtual fs inode which owns all metadata pages.

Yes, so effectively they are treated the same as file data pages
w.r.t. throttling, writeback and reclaim....

> >It is pretty similar to
> >e.g. ext2 where you have bdev inode which effectively owns all metadata
> >pages and these dirty pages account towards the dirty limits. For ext4
> >things are more complicated due to journaling and thus ext4 hides the fact
> >that a metadata page is dirty until the corresponding transaction is
> >committed.  But from that moment on dirty metadata is again just a dirty
> >pagecache page in the bdev inode.

Yeah, though those filesystems don't suffer from the uncontrolled
explosion of metadata that btrfs is suffering from, so simply
treating them as another dirty inode that needs flushing works just
fine.

> >So current Josef's patch just splits the counter in which btrfs metadata
> >pages would be accounted but effectively there should be no change in the
> >behavior.

Yup, I missed the addition to the node_pagecache_reclaimable() that
ensures reclaim sees the same number or dirty pages...

> >It is just a question whether this approach is workable in the
> >future when they'd like to track different objects than just pages in the
> >counter.

I don't think it can. Because the counters directly influences the
page lru reclaim scanning algorithms, it can only be used to
account for pages that are in the LRUs. Other objects like slab
objects need to be accounted for and reclaimed by the shrinker
infrastructure.

Accounting for metadata writeback is a different issue - it could
track slab objects if we wanted to, but the issue is that these are
often difficult to determine the amount of IO needed to clean them
so generic balancing is hard. (e.g. effect of inode write
clustering).

> +1 to what Jan said.  Btrfs's dirty metadata is always going to
> affect any other file systems in the system, no matter how we deal
> with it.  In fact it's worse with our btree_inode approach as the
> dirtied_when thing will likely screw somebody and make us skip
> writing out dirty metadata when we want to.

XFS takes care of metadata flushing with a periodic background work
controlled by /proc/sys/fs/xfs/xfssyncd_centisecs. We trigger both
background async inode reclaim and background dirty metadata
flushing from this (run on workqueues) if the system is idle or
hasn't had some other trigger fire to run these sooner.  It works
well enough that I can't remember the last time someone asked a
question about needing to tune this parameter, or had a problem that
required tuning it to fix....

> At least with this
> framework in place we can start to make the throttling smarter, so
> say make us flush metadata if that is the bigger % of the dirty
> pages in the system.  All I do now is move the status quo around, we
> are no worse, and arguably better with these patches than we were
> without them.

Agreed - I misread part of the patch, so I was a little off-target.

> >OK, thanks for providing the details about XFS. So my idea was (and Josef's
> >patches seem to be working towards this), that filesystems that decide to
> >use the generic throttling, would just account the dirty metadata in some
> >counter. That counter would be included in the limits checked by
> >balance_dirty_pages(). Filesystem that don't want to use generic throttling
> >would have the counter 0 and thus for them there'd be no difference. And in
> >appropriate points after metadata was dirtied, filesystems that care could
> >call balance_dirty_pages() to throttle the process creating dirty
> >metadata.
> >
> >>I can see how tracking of information such as the global amount of
> >>dirty metadata is useful for diagnostics, but I'm not convinced we
> >>should be using it for globally scoped external control of deeply
> >>integrated and highly specific internal filesystem functionality.
> >
> >You are right that when journalling comes in, things get more complex. But
> >btrfs already uses a scheme similar to the above and I believe ext4 could
> >be made to use a similar scheme as well. If you have something better for
> >XFS, then that's good for you and we should make sure we don't interfere
> >with that. Currently I don't think we will.
> 
> XFS doesn't have the problem that btrfs has, so I don't expect it to
> take advantage of this.  Your writeback throttling is tied to your
> journal transactions, which are already limited in size.  Btrfs on
> the other hand is only limited by the amount of memory in the
> system, which is why I want to leverage the global throttling code.
> Btrfs doesn't have the benefit of a arbitrary journal size
> constraint on it's dirty metadata, so we have to rely on the global
> limits to make sure we're kept in check.

Though you do have a transaction reservation system, so I would
expect that be the place where a dirty metadata throttling could be
applied sanely. i.e. if there's too much dirty metadata in the
system, then you can't get a reservation to dirty more until some
of the metadata has been cleaned.

> The only thing my patches
> do is allow us to account for this separately and trigger writeback
> specifically.  Thanks,

I'm not sure that global metadata accounting makes sense if there is
no mechanism provided for per-sb throttling based purely on metadata
demand. Sure, global dirty metadata might be a way of measuring how
much metadata all the active filesystems have created, but then
you're going to throttle filesystems with no dirty metadata because
some other filesystem is being a massive hog.

I suspect that for a well balanced system we are going to need some
kind of "in-between" solution that shares the dirty metadata limits
across all bdis somewhat fairly similar to file data. How that ties
into the bdi writeback rate estimates and other IO/dirty page
throttling feedback loops is an interesting question - to me this
isn't as obviously simple as it simply separating out metadata
accounting...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html