Re: [PATCH 0/6] writeback time order/delay fixes take 3

Fengguang Wu <wfg@xxxxxxxxxxxxxxxx> · Fri, 24 Aug 2007 21:24:58 +0800

On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
> > My vague idea is to
> > - keep the s_io/s_more_io as a FIFO/cyclic writeback dispatching
> > queue.
> > - convert s_dirty to some radix-tree/rbtree based data structure.
> >   It would have dual functions: delayed-writeback and
> > clustered-writeback. 
> > clustered-writeback:
> > - Use inode number as clue of locality, hence the key for the sorted
> >   tree.
> > - Drain some more s_dirty inodes into s_io on every kupdate wakeup,
> >   but do it in the ascending order of inode number instead of
> >   ->dirtied_when. 
> > 
> > delayed-writeback:
> > - Make sure that a full scan of the s_dirty tree takes <=30s, i.e.
> >   dirty_expire_interval.
> 
> I think we should assume a full scan of s_dirty is impossible in the
> presence of concurrent writers.  We want to be able to pick a start
> time (right now) and find all the inodes older than that start time.
> New things will come in while we're scanning.  But perhaps that's what
> you're saying...

Yeah, I was thinking about elevators :)
Or call it sweeping based on address-hint(inode number).

> At any rate, we've got two types of lists now.  One keeps track of age
> and the other two keep track of what is currently being written.  I
> would try two things:
> 
> 1) s_dirty stays a list for FIFO.  s_io becomes a radix tree that
> indexes by inode number (or some arbitrary field the FS can set in the
> inode).  Radix tree tags are used to indicate which things in s_io are
> already in progress or are pending (hand waving because I'm not sure
> exactly).
> 
> inodes are pulled off s_dirty and the corresponding slot in s_io is
> tagged to indicate IO has started.  Any nearby inodes in s_io are also
> sent down.
> 
> 2) s_dirty and s_io both become radix trees.  s_dirty is indexed by a
> sequence number that corresponds to age.  It is treated as a big
> circular indexed list that can wrap around over time.  Radix tree tags
> are used both on s_dirty and s_io to flag which inodes are in progress.

It's meaningless to convert s_io to radix tree. Because inodes on s_io
will normally be sent to block layer elevators at the same time.

Also s_dirty holds 30 seconds of inodes, while s_io only 5 seconds.
The more inodes, the more chances of good clustering. That's the
general rule.

s_dirty is the right place to do address-clustering.
As for the dirty_expire_interval parameter on dirty age,
we can apply a simple rule: do one full scan/sweep over the
fs-address-space in every 30s, syncing all inodes encountered,
and sparing those newly dirtied in less than 5s. With that rule,
any inode will get synced after being dirtied for 5-35 seconds.

-fengguang

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html