Re: [PATCH v3 1/5] add metadata_incore ioctl in vfs

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 24 Jan 2011 15:29:59 +1100

On Thu, Jan 20, 2011 at 01:44:57PM +0800, Shaohua Li wrote:
> On Thu, 2011-01-20 at 12:41 +0800, Dave Chinner wrote:
> > On Wed, Jan 19, 2011 at 08:10:14PM -0800, Andrew Morton wrote:
> > > On Thu, 20 Jan 2011 11:21:49 +0800 Shaohua Li <shaohua.li@xxxxxxxxx> wrote:
> > > 
> > > > > It seems to return a single offset/length tuple which refers to the
> > > > > btrfs metadata "file", with the intent that this tuple later be fed
> > > > > into a btrfs-specific readahead ioctl.
> > > > > 
> > > > > I can see how this might be used with say fatfs or ext3 where all
> > > > > metadata resides within the blockdev address_space.  But how is a
> > > > > filesytem which keeps its metadata in multiple address_spaces supposed
> > > > > to use this interface?
> > > > Oh, this looks like a big problem, thanks for letting me know such
> > > > filesystems. is it possible specific filesystem mapping multiple
> > > > address_space ranges to a virtual big ranges? the new ioctls handle the
> > > > mapping.
> > > 
> > > I'm not sure what you mean by that.
> > > 
> > > ext2, minix and probably others create an address_space for each
> > > directory.  Heaven knows what xfs does (for example).
> > 
> > In 2.6.39 it won't even use address spaces for metadata caching.
> > 
> > Besides, XFS already has pretty sophisticated metadata readahead
> > built in - it's one of the reasons why the XFS directory code scales
> > so well on cold cache lookups of arge directories - so I don't see
> > much need for such an interface for XFS.
> > 
> > Perhaps btrfs would be better served by implementing speculative
> > metadata readahead in the places where it makes sense (e.g. readdir)
> > bcause it will improve cold-cache performance on a much wider range
> > of workloads than at just boot-time....
> I don't know about xfs. A sophisticated metadata readahead might make
> metadata async, but I thought it's impossible it can removes the disk
> seek.

Nothing you do will remove the disk seek. What readahead is supposed
to do is  _minimise the latency_ of the disk seek.

> Since metadata and data usually lives in different disk block
> ranges, doing data readahead will unavoidable read metadata and cause
> disk seek between reading data and metadata.

Which comes back to how well the filesystem lays out the metadata
related to the data that needs to be read. In the case of XFS, the
metadata it needs is already in the inode, so once the inodes are
read into memory, there is no extra metadata seeks between data
seeks.

That is, if you are using XFS all you need to do in terms of
metadata readahead is stat every file needed by the boot process.
The optimal order for doing this is simply by ordering them in
ascending inode number. IOWs, the problem can be optimised without
any special kernel interfaces to do metadata readahead, especially
if you multithread the stat() walk to avoid blocking on IO that
metadata readahead hasn't already brought into cache....

IIRC, btrfs tends to keep all it's per-inode metadata close together
like XFS does, so it should be read at the same time the inode is
read.

Indeed, the dependencies of readahead are pretty well understood.  A
demonstration of optimising reading of file data across a complex
directory heirarchy is well deomonstrated by this little tool from
Chris Mason:

http://oss.oracle.com/~mason/acp/

I suspect that applying such a technique to the problem of optimising
boot-time IO pattern with net you the same gains as this new kernel
API will. And it will do it in a manner that is filesystem
agnostic...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html