Re: [Linux-cachefs] Re: NFS Patch for FSCache

David Howells <dhowells@xxxxxxxxxx> · Wed, 18 May 2005 11:28:40 +0100

David Masover <ninja@xxxxxxxxxxxx> wrote:

> Does the cache call sync/fsync overly often?

Not at all.

> If not, we can gain something by using an underlying FS with lazy writes.

Yes, to some extent. There's still the problem of filesystem integrity to deal
with, and lazy writes hold up journal closure. This isn't necessarily a
problem, except when you want to delete and launder a block that has a write
hanging over it. It's not unsolvable, just tricky.

Besides, what do you mean by lazy?

Also consider: you probably want to start netfs data writes as soon as
possible as not having cached the page yet restricts the netfs's activities on
that page; but you want to defer metadata writes as long as possible because
they may become obsolete, it may be possible to batch them and it may be
possible to merge them.

> I think the caching should be done asynchronously.  As stuff comes in,
> it should be handed off both to the app requesting it and to a queue to
> write it to the cache.  If the queue gets too full, start dropping stuff
> from it the same way you do from cache -- probably LRU or LFU or
> something similar.

That's not a bad idea; we need a rate limit on throwing stuff at the cache in
the situation where there's not much disk space available.

Actually, probably the biggest bottleneck is the disk block allocator. Given
that I'm using lists of free blocks, it's difficult to place a tentative
reservation on a block, and it very much favours allocating blocks for one
transaction at a time. However, free lists make block recycling a lot easier.

I could use a bitmap instead; but that requires every block allocated or
deleted be listed in the journal. Not only that but it complicates deletion
and journal replay. Also, under worst case conditions it's really nasty
because you could end up with a situation where you've got one a whole set of
bitmaps, each with one free block; that means you've got to read a whole lot
of bitmaps to allocate the blocks you require, and you have to modify several
of them to seal an allocation. Furthermore, you end up losing a chunk of space
statically allocated to the maintenance of these things, unless you want to
allocate the bitmaps dynamically also...

> Another question -- how much performance do we lose by caching, assuming
> that both the network/server and the local disk are infinitely fast?
> That is, how many cycles do we lose vs. local disk access?  Basically,
> I'm looking for something that does what InterMezzo was supposed to --
> make cache access almost as fast as local access, so that I can replace
> all local stuff with a cache.

Well, with infinitely fast disk and network, very little - you can afford to
be profligate on your turnover of disk space, and this affects the options you
might choose in designing your cache.

The real-world case is more interesting as you have to compromise. With
CacheFS as it stands, it attempts not to lose any data blocks, and it attempts
not to return uninitialised data, and these two constraints work counter to
each other. There's a second journal (the validity journal) to record blocks
that have been allocated but that don't yet have data stored therein. This
permits advance allocation, but requires a second update journal entry to
clear the validity journal entry after the data has been stored. It also
requires the validity journal to be replayed upon mounting.

Reading one really big file (bigger than the memory available) over AFS, with
a cold cache it took very roughly 107% of the time it took with no cache; but
using a warm cache, it took 14% of the time it took with no cache. However,
this is on my particular test box, and it varies a lot from box to box.

This doesn't really demonstrate the latency on indexing, however; that we have
to do before we even consider touching the network. I don't have numbers on
that, but in the worst case they're going to be quite bad.

I'm currently working on mark II CacheFS, using a wandering tree to maintain
the index. I'm not entirely sure whether I want to include the data pointers
in this tree. There are advantages to doing so: namely that I can use the same
tree maintenance routines for everything, but also disadvantages: namely that
it complicates deletion a lot.

Using a wandering tree will cut the latency on index lookups (because it's a
tree), and simplify journalling (wandering) and mean I can just grab a block,
write to it and then connect it (wandering). Block allocation is still
unpleasant though...

David