Jörn Engel wrote:
On Wed, 29 October 2008 01:49:56 +0000, Phillip Lougher wrote:
+/*
+ * Blocks in Squashfs are compressed. To avoid repeatedly decompressing
+ * recently accessed data Squashfs uses two small metadata and fragment caches.
+ *
+ * This file implements a generic cache implementation used for both caches,
+ * plus functions layered ontop of the generic cache implementation to
+ * access the metadata and fragment caches.
+ */
I tend to agree with Andrew that a lot of this should be done by the
page cache instead. One of the problems seems to be that your blocksize
can exceed page size and there really isn't any infrastructure to deal
with such cases yet. Bufferheads deal with blocks smaller than a page,
not the other way around.
I thought about using the page cache, but, the fact that blocksizes
exceed the page cache size is only one of a number of reasons why I
prefer the current solution, there is also simplicity and speed to consider.
There are three types of compressed block in Squashfs: datablocks,
fragments, and metadata blocks. Of these datablocks (by far the largest
number of blocks) are decompressed and pushed into the page cache, and
are not otherwise cached by Squashfs. This, obviously (?), is because
they contain data for only one file, and so at time of access there is a
readily available address space to push the data into.
Metadata and fragment blocks are different in that when accessed and
decompressed (say for an inode or for a particular tail-end block) they
will contain metadata or tail-ends for other files. This data could be
thrown away but due to locality of reference it makes sense to
temporarily cache it in-case a near future file access references the
data. But it doesn't make much sense to cache it more than temporarily,
much of the data will probably not be reused, and it exists compressed
in the buffer cache.
The squashfs cache is therefore designed to cache only the last couple
of metadata and fragment blocks. It is also designed to be simple and
extremely fast. The maximum size of the metadata cache is only 64 KiB.
Simplicity and speed is extremely important. The
squashfs_metadata_read() wrapper around the cache is designed to step
through the metadata a structure at a time (20 or so bytes), updating
the read position in the metadata each call, with more metadata cache
blocks being read and decompressed as necessary. The common case where
the metadata is already in the cache (because we're stepping through it
20 or so bytes at a time), is designed to be extremely fast - a spinlock
and array search only. I recently optimised the cache to use spinlocks
rather than mutexes and reduced the lock holding time (necessary to move
to spinlocks), and this resulted in a 20%+ speed improvement in reading
squashfs filesystems.
Given the above using an address space in the page cache will result in
greater complexity, more memory overhead, and much slower operation.
There's a number of reasons for this.
1. The amount and life-span of the data stored in the page cache is
outside of Squashfs' control. As explained above it only makes sense to
temporarily cache the last couple of metadata and fragment blocks.
Using the page cache (if a 'large' address space is used) for these
keeps more of them around for longer than necessary, and will
potentially cause more worthy datablocks to be flushed on memory pressure.
2. The address space will be caching uncompressed data, the squashfs
references to this data are the compressed locations within the
filesystem. There doesn't exist a one-to-one linear mapping from
compressed location to decompressed location in the address space. This
means a lookup table still needs to be used to store the mapping from
compressed location to decompressed location in the address space. Now
this lookup table (however implemented) is itself at least as complex as
my current cache implementation.
3. Once the locations of the decompressed pages in the address space
have been found, they'll need to be looked up in the page cache, and
this has to be done for every 4K page. With the default fragment size
of 128 KiB this means 32 separate lookups. Somewhat slower than one
spinlock and array search per 128 KiB block in the squashfs cache
implementation.
Comments, especially those of the form "you've got this completely
wrong, and you can use the page cache like this, which will be simpler
and faster than your current implementation" welcome :) I'm not adverse
to using the page cache, but I can't see how it will be simpler or
faster than the current implementation.
Phillip
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html