Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

Josef Bacik <josef@xxxxxxxxxxxxxx> · Wed, 3 Jan 2018 11:29:24 -0500

On Wed, Jan 03, 2018 at 05:26:03PM +0100, Jan Kara wrote:
> On Wed 03-01-18 10:49:33, Josef Bacik wrote:
> > On Wed, Jan 03, 2018 at 02:59:21PM +0100, Jan Kara wrote:
> > > On Wed 03-01-18 13:32:19, Dave Chinner wrote:
> > > > On Tue, Jan 02, 2018 at 11:13:06AM -0500, Josef Bacik wrote:
> > > > > On Wed, Dec 20, 2017 at 03:30:55PM +0100, Jan Kara wrote:
> > > > > > On Wed 20-12-17 08:35:05, Dave Chinner wrote:
> > > > > > > On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote:
> > > > > > > > On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> > > > > > > > > IOWs, treating metadata like it's one great big data inode doesn't
> > > > > > > > > seem to me to be the right abstraction to use for this - in most
> > > > > > > > > fileystems it's a bunch of objects with a complex dependency tree
> > > > > > > > > and unknown write ordering, not an inode full of data that can be
> > > > > > > > > sequentially written.
> > > > > > > > > 
> > > > > > > > > Maybe we need multiple ops with well defined behaviours. e.g.
> > > > > > > > > ->writeback_metadata() for background writeback, ->sync_metadata() for
> > > > > > > > > sync based operations. That way different filesystems can ignore the
> > > > > > > > > parts they don't need simply by not implementing those operations,
> > > > > > > > > and the writeback code doesn't need to try to cater for all
> > > > > > > > > operations through the one op. The writeback code should be cleaner,
> > > > > > > > > the filesystem code should be cleaner, and we can tailor the work
> > > > > > > > > guidelines for each operation separately so there's less mismatch
> > > > > > > > > between what writeback is asking and how filesystems track dirty
> > > > > > > > > metadata...
> > > > > > > > 
> > > > > > > > I agree that writeback for memory cleaning and writeback for data integrity
> > > > > > > > are two very different things especially for metadata. In fact for data
> > > > > > > > integrity writeback we already have ->sync_fs operation so there the
> > > > > > > > functionality gets duplicated. What we could do is that in
> > > > > > > > writeback_sb_inodes() we'd call ->write_metadata only when
> > > > > > > > work->for_kupdate or work->for_background is set. That way ->write_metadata
> > > > > > > > would be called only for memory cleaning purposes.
> > > > > > > 
> > > > > > > That makes sense, but I still think we need a better indication of
> > > > > > > how much writeback we need to do than just "writeback this chunk of
> > > > > > > pages". That "writeback a chunk" interface is necessary to share
> > > > > > > writeback bandwidth across numerous data inodes so that we don't
> > > > > > > starve any one inode of writeback bandwidth. That's unnecessary for
> > > > > > > metadata writeback on a superblock - we don't need to share that
> > > > > > > bandwidth around hundreds or thousands of inodes. What we actually
> > > > > > > need to know is how much writeback we need to do as a total of all
> > > > > > > the dirty metadata on the superblock.
> > > > > > > 
> > > > > > > Sure, that's not ideal for btrfs and mayext4, but we can write a
> > > > > > > simple generic helper that converts "flush X percent of dirty
> > > > > > > metadata" to a page/byte chunk as the current code does. DOing it
> > > > > > > this way allows filesystems to completely internalise the accounting
> > > > > > > that needs to be done, rather than trying to hack around a
> > > > > > > writeback accounting interface with large impedance mismatches to
> > > > > > > how the filesystem accounts for dirty metadata and/or tracks
> > > > > > > writeback progress.
> > > > > > 
> > > > > > Let me think loud on how we could tie this into how memory cleaning
> > > > > > writeback currently works - the one with for_background == 1 which is
> > > > > > generally used to get amount of dirty pages in the system under control.
> > > > > > We have a queue of inodes to write, we iterate over this queue and ask each
> > > > > > inode to write some amount (e.g. 64 M - exact amount depends on measured
> > > > 
> > > > It's a maximum of 1024 pages per inode.
> > > 
> > > That's actually a minimum, not maximum, if I read the code in
> > > writeback_chunk_size() right.
> > > 
> > > > > > writeback bandwidth etc.). Some amount from that inode gets written and we
> > > > > > continue with the next inode in the queue (put this one at the end of the
> > > > > > queue if it still has dirty pages). We do this until:
> > > > > > 
> > > > > > a) the number of dirty pages in the system is below background dirty limit
> > > > > >    and the number dirty pages for this device is below background dirty
> > > > > >    limit for this device.
> > > > > > b) run out of dirty inodes on this device
> > > > > > c) someone queues different type of writeback
> > > > > > 
> > > > > > And we need to somehow incorporate metadata writeback into this loop. I see
> > > > > > two questions here:
> > > > > > 
> > > > > > 1) When / how often should we ask for metadata writeback?
> > > > > > 2) How much to ask to write in one go?
> > > > > > 
> > > > > > The second question is especially tricky in the presence of completely
> > > > > > async metadata flushing in XFS - we can ask to write say half of dirty
> > > > > > metadata but then we have no idea whether the next observation of dirty
> > > > > > metadata counters is with that part of metadata already under writeback /
> > > > > > cleaned or whether xfsaild didn't even start working and pushing more has
> > > > > > no sense.
> > > > 
> > > > Well, like with ext4, we've also got to consider that a bunch of the
> > > > recently dirtied metadata (e.g. from delalloc, EOF updates on IO
> > > > completion, etc) is still pinned in memory because the
> > > > journal has not been flushed/checkpointed. Hence we should not be
> > > > attempting to write back metadata we've dirtied as a result of
> > > > writing data in the background writeback loop.
> > > 
> > > Agreed. Actually for ext4 I would not expose 'pinned' buffers as dirty to
> > > VM - the journalling layer currently already works that way and it works
> > > well for us. But that's just a small technical detail and different
> > > filesystems can decide differently.
> > > 
> > > > That greatly simplifies what we need to consider here. That is, we
> > > > just need to sample the ratio of dirty metadata to clean metadata
> > > > before we start data writeback, and we calculate the amount of
> > > > metadata writeback we should trigger from there. We only need to
> > > > do this *once* per background writeback scan for a superblock
> > > > as there is no need for sharing bandwidth between lots of data
> > > > inodes - there's only one metadata inode for ext4/btrfs, and XFS is
> > > > completely async....
> > > 
> > > OK, agreed again.
> > > 
> > > > > > Partly, this could be dealt with by telling the filesystem
> > > > > > "metadata dirty target" - i.e. "get your dirty metadata counters below X"
> > > > > > - and whether we communicate that in bytes, pages, or a fraction of
> > > > > > current dirty metadata counter value is a detail I don't have a strong
> > > > > > opinion on now. And the fact is the amount written by the filesystem
> > > > > > doesn't have to be very accurate anyway - we basically just want to make
> > > > > > some forward progress with writing metadata, don't want that to take too
> > > > > > long (so that other writeback from the thread isn't stalled), and if
> > > > > > writeback code is unhappy about the state of counters next time it looks,
> > > > > > it will ask the filesystem again...
> > > > 
> > > > Right. The problem is communicating "how much" to the filesystem in
> > > > a useful manner....
> > > 
> > > Yep. I'm fine with communication in the form of 'write X% of your dirty
> > > metadata'. That should be useful for XFS and as you mentioned in some
> > > previous email, we can provide a helper function to compute number of pages
> > > to write (including some reasonable upper limit to bound time spent in one
> > > ->write_metadata invocation) for ext4 and btrfs.
> > > 
> > > > > > This gets me directly to another problem with async nature of XFS metadata
> > > > > > writeback. That is that it could get writeback thread into busyloop - we
> > > > > > are supposed to terminate memory cleaning writeback only once dirty
> > > > > > counters are below limit and in case dirty metadata is causing counters to
> > > > > > be over limit, we would just ask in a loop XFS to get metadata below the
> > > > > > target. I suppose XFS could just return "nothing written" from its
> > > > > > ->write_metadata operation and in such case we could sleep a bit before
> > > > > > going for another writeback loop (the same thing happens when filesystem
> > > > > > reports all inodes are locked / busy and it cannot writeback anything). But
> > > > > > it's getting a bit ugly and is it really better than somehow waiting inside
> > > > > > XFS for metadata writeback to occur?  Any idea Dave?
> > > > 
> > > > I tend to think that the whole point of background writeback is to
> > > > do it asynchronously and keep the IO pipe full by avoiding blocking
> > > > on any specific object. i.e. if we can't do writeback from this
> > > > object, then skip it and do it from the next....
> > > 
> > > Agreed.
> > > 
> > > > I think we could probably block ->write_metadata if necessary via a
> > > > completion/wakeup style notification when a specific LSN is reached
> > > > by the log tail, but realistically if there's any amount of data
> > > > needing to be written it'll throttle data writes because the IO
> > > > pipeline is being kept full by background metadata writes....
> > > 
> > > So the problem I'm concerned about is a corner case. Consider a situation
> > > when you have no dirty data, only dirty metadata but enough of them to
> > > trigger background writeback. How should metadata writeback behave for XFS
> > > in this case? Who should be responsible that wb_writeback() just does not
> > > loop invoking ->write_metadata() as fast as CPU allows until xfsaild makes
> > > enough progress?
> > > 
> > > Thinking about this today, I think this looping prevention belongs to
> > > wb_writeback(). Sadly we don't have much info to decide how long to sleep
> > > before trying more writeback so we'd have to just sleep for
> > > <some_magic_amount> if we found no writeback happened in the last writeback
> > > round before going through the whole writeback loop again. And
> > > ->write_metadata() for XFS would need to always return 0 (as in "no progress
> > > made") to make sure this busyloop avoidance logic in wb_writeback()
> > > triggers. ext4 and btrfs would return number of bytes written from
> > > ->write_metadata (or just 1 would be enough to indicate some progress in
> > > metadata writeback was made and busyloop avoidance is not needed).
> > > 
> > > So overall I think I have pretty clear idea on how this all should work to
> > > make ->write_metadata useful for btrfs, XFS, and ext4 and we agree on the
> > > plan.
> > > 
> > 
> > I'm glad you do, I'm still confused.  I'm totally fine with sending a % to the
> > fs to figure out what it wants, what I'm confused about is how to get that % for
> > xfs?  Since xfs doesn't mark its actual buffers dirty, so wouldn't use
> > account_metadata_dirtied and it's family, how do we generate this % for xfs?  Or
> > am I misunderstanding and you do plan to use those helpers?
> 
> AFAIU he plans to use account_metadata_dirtied() & co. in XFS.
> 
> > If you do plan to use them, then we just need to figure out what we want
> > the ratio to be of, and then you'll be happy Dave?
> 
> Reasonably natural dirty target would be dirty_background_ratio of total
> metadata amount to be dirty. We would have to be somewhat creative if
> dirty_background_bytes is actually set instead of dirty_background_ratio
> and use ratio like dirty_background_bytes / (data + metadata amount) but
> it's doable...
> 

Oh ok well if that's the case then I'll fix this up to be a ratio, test
everything, and send it along probably early next week.  Thanks,

Josef