Re: [PATCH v3 1/5] add metadata_incore ioctl in vfs

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Wed, 19 Jan 2011 22:27:40 -0800

On Thu, 20 Jan 2011 14:12:33 +0800 Shaohua Li <shaohua.li@xxxxxxxxx> wrote:

> On Thu, 2011-01-20 at 13:55 +0800, Andrew Morton wrote:
> > On Thu, 20 Jan 2011 13:38:18 +0800 Shaohua Li <shaohua.li@xxxxxxxxx> wrote:
> > 
> > > > ext2, minix and probably others create an address_space for each
> > > > directory.  Heaven knows what xfs does (for example).
> > > yes, this is for one directiory, but the all files's metadata are in
> > > block_dev address_space.
> > > I thought you mean there are several block_dev address_space like
> > > address_space in some filesystems, which doesn't fit well in my
> > > implementation. for ext like filesystem, there is only one
> > > address_space. for filesystems with several address_space, my proposal
> > > is map them to a virtual big address_space in the new ioctls.
> > 
> > ext2 and minixfs (and I think sysv and ufs) have a separate
> > address_space for each directory.  I don't see how those can be
> > represented with a single "virtual big address_space" - we also need
> > identifiers in there so each directory's address_space can be created
> > and appropriately populated.
> Oh, I misunderstand your comments. you are right, the ioctl methods
> don't work for ext2. the dir's address_space can't be readahead either.
> Looks we could only do the metadata readahead in filesystem specific
> way.

Another way of doing all this would be to implement some sort of
lookaside cache at the vfs->block boundary.  At boot time, load that
cache up with all the disk blocks which we know the boot will need (a
single ascending pass across the disk) and then when the vfs/fs goes to
read a disk block take a peek in that cache first and if it's a hit,
either steal the page or memcpy it.

It has the obvious coherence problems which would be pretty simple to
solve by hooking into the block write path as well.  The list of needed
blocks can be very simply generated with existing blktrace
infrastructure.  It does add permanent runtime overhead - once the
cache is invalidated and disabled, every IO operation would incur a
test-n-not-taken-branch.  Maybe not too bad.

Need to handle small-memory systems somehow, where the cache simply
ooms the machine or becomes ineffective because it's causing eviction
elsewhere.

It could perhaps all be implemented as an md or dm driver.

Or even as an IO scheduler.  I say this because IO schedulers can be
replaced on-the-fly, so the caching layer can be unloaded from the
stack once it is finished with.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html