Re: [RFC] TileFS - a proposal for scalable integrity checking

Jörn Engel <joern@xxxxxxxxxxxxxxx> · Wed, 9 May 2007 12:12:58 +0200

On Tue, 8 May 2007 22:56:09 -0700, Valerie Henson wrote:
> 
> I like it too, especially the rmap stuff, but I don't think it solves
> some of the problems chunkfs solves.  The really nice thing about
> chunkfs is that it tries hard to isolate each chunk from all the other
> chunks.  You can think of regular file systems as an OS one big shared
> address space - any process can potentially modify any other process's
> address space, including the kernel's - and chunkfs as the modern UNIX
> private address space model.  Except in rare worst case models (the
> equivalent of a kernel bug or writing /dev/mem), the only way one
> chunk can affect another chunk is through the narrow little interface
> of the continuation inode.  This severely limits the ability of one
> chunk to corrupt another - the worse you can do is end up with the
> wrong link count on an inode pointed to from another chunk.

This leaves me a bit confused.  Imo a filesystem equivalent of process's
address spaces would be permissions and quotas.  Indeed there is no
guarantee where any address spaces pages may physically reside.  They
can be in any zone, node or even swap or regular files.

Otoh, each physical page does have an rmap of some sorts - enough to
figure out why currently owns this page.  Does your own analogy work
against you?

Back to chunkfs, the really smart idea behind it imo is to take just a
small part of the filesystem, assume that everything else is flawless,
and check the small part under that assumption.  The assumption may be
wrong.  If that wrongness would effect the minimal fsck, it should get
detected as well.  Otherwise it doesn't matter right now.

What I never liked about chunkfs were two things.  First it splits the
filesystem into an array of chunks.  With sufficiently large devices,
either the number or the size of chunks will come close to problematic
again.  Some sort of tree arrangement intuitively makes more sense.

Secondly, the cnodes are... weird, complicated, not well understood, a
hack.  Pick a term.  Avoiding cnodes is harder than avoiding regular
fragmentation and the recent defragment patches seem to imply we're
doing a bad job at that already.  Linked lists of cnodes - yuck.

Not directly a chunkfs problem, but still unfortunate is that it still
cannot detect medium errors.  There are no checksums.  Checksums cost
performance, so they obviously have to be optional at user's choice.
But not even having the option is quite 80's.

Matt's proposal is an alternative solution that can address all of my
concerns.  Instead of cnodes it has the rmap.  That is a very simple
structure I can explain to my nephews.  It allows for checksums, which
is nice as well.  And it does allow for a tree structure of tiles.

Tree structure means that each tile can have free space counters.  A
supertile (or whatever one may call it) can have a free space counter
that is the sum of all member free space counters.  And so forth
upwards.  Same for dirty bits and anything else I've forgotten.

So individual tiles can be significantly smaller than chunks in chunkfs.
Checking them is significantly faster than checking a chunk.  There will
be more dirty tiles at any given time, but a better way to look at it is
to say that for any dirty chunk in chunkfs, tilefs has some dirty and
some clean tiles.  So the overall ratio of dirty space is never higher
and almost always lower.

Overall I almost envy Matt for having this idea.  In hindsight it should
have been obvious to me.  But then again, in hindsight the fsck problem
and using divide and conquer should have been obvious to everyone and
iirc you were the only one who seriously persued the idea and got all
this frenzy started. :)

Jörn

-- 
Rules of Optimization:
Rule 1: Don't do it.
Rule 2 (for experts only): Don't do it yet.
-- M.A. Jackson
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html