Re: topics for the file system mini-summit

Ric Wheeler <ric@xxxxxxx> · Sun, 28 May 2006 22:07:03 -0400

Matthew Wilcox wrote:

On Thu, May 25, 2006 at 02:44:50PM -0700, Ric Wheeler wrote:

The obvious alternative to this is to break up these big disks into
multiple small file systems, but there again we hit several issues.

As an example, in one of the boxes that I work with we have 4 drives,
each 500GBs, with limited memory and CPU resources. To address the
issues above, we break each drive into 100GB chunks which gives us 20
(reiserfs) file systems per box.  The set of new problems that arise
from this include:

  (1) no forced unmount - one file system goes down, you have to
reboot the box to recover.
  (2) worst case memory consumption for the journal scales linearly
with the number of file systems (32MB/per file system).
  (3) we take away the ability of the file system to do intelligent
head movement on the drives (i.e., I end up begging the application team
to please only use one file system per drive at a time for ingest ;-)).
The same goes for allocation - we basically have to push this up to the
application to use the capacity in an even way.
  (4) pain of administration of multiple file systems.

I know that other file systems deal with scale better, but the question
is really how to move the mass of linux users onto these large and
increasingly common storage devices in a way that handles these challenges.

How do you handle the inode number space?  Do you partition it across
the sub-filesystems, or do you prohibit hardlinks between the sub-fses?

I think that the namespace needs to present a normal file system set of 
operations - support for hardlinks, no magic directories, etc. so that 
applications don't need to load balance (or even be aware) of the 
sub-units that provide storage.  If we removed that requirement, we 
would be back to today's collection of various file systems mounted on a 
single host.

I know that lustre aggregates full file systems, but you could build a 
file system on top of a collection of disk partitions/LUN's and then 
your inode would could be extended to encode the partition number and 
the internal mapping. You could even harden the block groups to the 
point that fsck could heal one group while the file system was (mostly?) 
online backed up by the rest of the block groups...

ric

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html