Re: topics for the file system mini-summit

Andreas Dilger <adilger@xxxxxxxxxxxxx> · Fri, 26 May 2006 10:48:56 -0600

On May 25, 2006  14:44 -0700, Ric Wheeler wrote:
> With both ext3 and with reiserfs, running a single large file system
> translates into several practical limitations before we even hit the
> existing size limitations:
> 
>    (1) repair/fsck time can take hours or even days depending on the
> health of the file system and its underlying disk as well as the number
> of files.  This does not work well for large servers and is a disaster
> for "appliances" that need to run these commands buried deep in some
> data center without a person watching...
>    (2) most file system performance testing is done on "pristine" file
> systems with very few files.  Performance over time, especially with
> very high file counts, suffers very noticeable performance degradation
> with very large file systems.
>     (3) very poor fault containment for these very large devices - it
> would be great to be able to ride through a failure of a segment of the
> underlying storage without taking down the whole file system.
> 
> The obvious alternative to this is to break up these big disks into
> multiple small file systems, but there again we hit several issues.
> 
> As an example, in one of the boxes that I work with we have 4 drives,
> each 500GBs, with limited memory and CPU resources. To address the
> issues above, we break each drive into 100GB chunks which gives us 20
> (reiserfs) file systems per box.  The set of new problems that arise
> from this include:
> 
>    (1) no forced unmount - one file system goes down, you have to
> reboot the box to recover.
>    (2) worst case memory consumption for the journal scales linearly
> with the number of file systems (32MB/per file system).
>    (3) we take away the ability of the file system to do intelligent
> head movement on the drives (i.e., I end up begging the application team
> to please only use one file system per drive at a time for ingest ;-)).
> The same goes for allocation - we basically have to push this up to the
> application to use the capacity in an even way.
>    (4) pain of administration of multiple file systems.
> 
> I know that other file systems deal with scale better, but the question
> is really how to move the mass of linux users onto these large and
> increasingly common storage devices in a way that handles these challenges.

In a way what you describe is Lustre - it aggregates multiple "smaller"
filesystems into a single large filesystem from the application POV
(though in many cases "smaller" filesystems are 2TB).  It runs e2fsck
in parallel if needed, has smart object allocation (clients do delayed
allocation, can load balance across storage targets, etc), can run with
down storage targets.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html