Re: [Linux-cachefs] Re: NFS Patch for FSCache

David Howells <dhowells@xxxxxxxxxx> · Wed, 18 May 2005 18:49:44 +0100

Lever, Charles <Charles.Lever@xxxxxxxxxx> wrote:

> 1.  attributes cached on the disk are either up to date, or clearly out
> of date (otherwise there's no way to tell whether cached data is stale
> or not), and

We have to trust the netfs to know when an inode is obsolete. That's why
cachefs calls back to the netfs to validate the inodes it finds. With AFS this
checks the vnode version number and the data version number.

> in fact, you don't need to maintain data coherency up to the very last
> moment, since the client is pushing data to the server for permanent
> storage.  cached data in the local backing FS can be out of date after a
> client reboot without any harm whatever, so it doesn't matter a wit that
> the on-disk state of the backing FS trails the page cache.

True, but also consider that the fact that if a netfs wants to throw a page
into the cache, it must keep it around long enough for us to write it to disk.
So if the user is grabbing a file say twice the size as the maximum pagecache
size, being too lazy will hold up the read as the VM then tries to eject pages
that are pending writing to the cache.

Actually, the best way to do this would be to get the VM involved in the
caching, I think. Currently, the netfs has to issue a write to the disk, and
there're only certain points at which it's able to do that:

	- readpage completion
	- page release
	- writepage (if the page is altered locally)

The one that's going to impact performance least is when the netfs finishes
reading a page. Getting the VM involved would allow the VM to batch up writes
to the cache and to predict better when to do the writes.

One of the problems I've got is that I'd like to be able to gang up writes to
the cache, but that's difficult as the pages tend to be read individually
across the network, and thus complete individually.

Furthermore, consider the fact that the netfs records state tracking
information in the cache (such as AFS's data version). This must be modified
after the changed pages are written to the cache (or deleted from it) lest you
end up with old data for the version specified.

> (of course you do need to sync up completely with the server if you
> intend to use CacheFS for disconnected operation, but that can be
> handled by "umount" rather than keeping strict data coherency all the
> time).

Disconnected operation is a whole 'nother kettle of miscellaneous swimming
things.

One of the reasons I'd like to move to a wandering tree is that it makes data
journalling almost trivial; and if the tree is arranged correctly, it makes it
possible to get a free inode update too - thus allowing the netfs coherency
data to be updated simultaneously.

> it also doesn't matter if the backing FS can't keep up with the server.
> the failure mode can be graceful, so that as the backing FS becomes
> loaded, it passes more requests back to the server and caches less data
> and fewer requests.  this is how it works when there is more data to
> cache than there is space to cache it; it should work the same way if
> the I/O rate is higher than the backing FS can handle.

True. I've defined the interface to return -ENOBUFS if we can't cache a file
right now, or just to silently drop the thing and tell the netfs we did
it. The read operation then would return -ENODATA next time, thus indicating
we need to fetch it from the server again.

The best way to do it is probably to have hysteresis on allocation for
insertion: suspend insertion if the number of free blocks falls below some
limit and re-enable insertion if the number of free blocks rises above a
higher count. Then set the culler running if we drop below the higher limit.

And then, if insertion is suspended, we start returning -ENOBUFS on requests
to cache something. Not only that, but if a netfs wants to update a block, we
can also return -ENOBUFS and steal the block that held the old data (with a
wandering tree that's fairly easy to do). The stolen block can then be
laundered and made available to the allocator again.

> > Actually, probably the biggest bottleneck is the disk block allocator.
> 
> in my experience with the AFS cache manager, this is exactly the
> problem.  the ideal case is where the backing FS behaves a lot like swap
> -- just get the bits down onto disk in any location, without any
> sophisticated management of free space.  the problem is keeping track of
> the data blocks during a client crash/reboot.

Which is something swap space doesn't need to worry about. It's reinitialised
on boot. Filesystem integrity is not an issue. If we don't care about
integrity, life is easy.

The main problem in the allocator is one of tentative allocation vs journal
update moving the free list pointer. If I can hold off on the latter or just
discard the former and send the tentative block for relaundering, then I can
probably reduce the serialisation problems.

> the real problem arises when the cache is full and you want to cache a
> new file.  the cache manager must choose a file to reclaim, release all
> the blocks for that file, then immediately reallocate them for the new
> file.  all of this is synchronous activity.

Not exactly. I plan to have cachefs anticipate the need by keeping a float of
free blocks. Whilst this reduces the utilisation of the cache, it should
decrease the allocator latency.

> are there advantages to a log-structured file system for this purpose?

Yes, but there are a lot more problems, and the problems increase with cache
size:

 (1) You need to know what's where in the cache; that means scanning the cache
     on mount. You could offset this by storing your map in the block device
     and only scanning on power failure (when the blockdev wasn't properly
     unmounted).

 (2) You need to keep a map in memory. I suppose you could keep the map on
     disk and rebuild it on umount and mount after power failure. But this
     does mean scanning the entire block device.

 (3) When the current point in the cache catches up with the tail, what do you
     do? Do you just discard the block at the tail? Or do you move it down if
     it's not something you want to delete yet? (And how do you decide which?)

     This will potentially have the effect of discarding regularly used items
     from the cache at regular intervals; particularly if someone uses a data
     set larger than the the size of the cache.

 (4) How do you decide where the current point is? This depends on whether
     you're willing to allow pages to overlap "page boundaries" or not. You
     could take a leaf out of JFFS2's book and divide the cache into erase
     blocks, each of several pages. This would then cut down on the amount of
     scanning you need to do, and would make handling small files trivial.

If you can get this right it would be quite cute, but it would make handling
of pinned files awkward. You can't throw away anything that's pinned, but must
slide it down instead. Now imagine half your cache is pinned - you're
potentially going to end up spending a lot of time shovelling stuff down,
unless you can skip blocks that are fully pinned.

> is there a good way to trade disk space for the performance of your
> block allocator?

Potentially. If I can keep a list of tentative allocations and add that to the
journal, then it's easy to zap then during replay. It does, however,
complicate journalling.

> in fact, with an infinitely fast server and network, there would be no
> need for local caching at all.  so maybe that's not such an interesting
> thing to consider.

Not really. The only thing it guards against is the server becoming
unavailable.

> it might be more appropriate to design, configure, and measure CacheFS
> with real typical network and server latency numbers in mind.

Yes. What I'm currently using as the basis of my design is accessing a kernel
source tree over the network. That's on the order of 22000 files these days
and 320MB of disk space. That's an average occupancy of about 14.5KB of space
per file.

As an alternative load, I consider what it would take to cache /usr. That's
about 373000 files on my box and 11GB of disk space; that's about 29KB per
file.

> david, what is the behavior when the file that needs to be cached is
> larger than the backing file system?  for example, what happens when
> some client application starts reading a large media file that won't fit
> entirely in the cache?

Well, it depends. If it's a large sparse file that we're only going to grab a
few blocks from then it's not a problem, but if it's something we're going to
read all of then obviously we've got a problem. I think I have to set a limit
as to the maximum number of blocks a file can occupy in a cache.

Beyond that, if a file occupies all the blocks of cache it can, then we have
to refuse allocation of more blocks for it. What happens then depends. If the
file is officially pinned by the user, we can't actually get rid of it, and
all we can do is give them rude error messages. If it's merely pinned by
virtue of being held by an fscache cookie, then we could just keep it around
until the cookie is released or we could just start silently recycling the
space it's currently occupying whilst returning -ENOBUFS to all further
attempts to cache more of that file.

But unless the user gives as a hint, we can't judge in advance what the best
acttion is. I'd consider O_DIRECT as being a hint, of course, but we may want
to make other options available.

David