IRON filesystem papers - development of robust and fault tolerant filesystems

sftf <sftf-misc@xxxxxxx> · Fri, 22 Jun 2007 10:28:30 +0600

Hello!
I suggest developers to consider ext4 design from the point of view of these papers:
IRON FILE SYSTEMS -
http://www.cs.wisc.edu/wind/Publications/vijayan-thesis06.pdf
IMHO - very impressive paper and  developers of close future filesystems can't
ignore these problems and solutions.

and "Failure Analysis of SGI XFS File System"
http://www.cs.wisc.edu/~vshree/xfs.pdf  

>From IRON FILE SYSTEMS:
"Disk drives are widely used as a primary medium for storing information.
While commodity file systems trust disks to either work or fail completely, modern
disks exhibit complex failure modes such as latent sector faults and block corruptions,
where only portions of a disk fail.
...
First, we design new low-level redundancy techniques that a file
system can use to handle disk faults.
We begin by qualitatively and quantitatively
evaluating various redundancy information such as checksum, parity, and replica,
Finally, we describe
two update strategies: a overwrite and no-overwrite approach that a file system can
use to update its data and parity blocks atomically without NVRAM support.
Over all, we show that low-level redundant information can greatly enhance file system
robustness while incurring modest time and space overheads.

Second, to remedy the problem of failure handling diffusion, we develop amodified
ext3 that unifies all failure handling in a Centralized Failure Handler (CFH).
We then showcase the power of centralized failure handling in ext3c, a modified
IRON version of ext3 that uses CFH by demonstrating its support for flexible, consistent,
and fine-grained policies. By carefully separating policy from mechanism,
ext3c demonstrates how a file system can provide a thorough, comprehensive, and
easily understandable failure-handling policy.
...
The importance of building dependable systems cannot be overstated. One of the
fundamental requirements in computer systems is to store and retrieve information
reliably.
...
The fault model presented by modern disk drives, however, is much more complex.
For example, modern drives can exhibit latent sector faults [14, 28, 45, 60,
100], where a block or set of blocks are inaccessible. Under latent sector fault,
the sector fault occurs sometime in the past but the fault is detected only when the
sector is accessed for storing or retrieving information [59]. Blocks sometimes become
corrupted [16] and worse, this can happen silently without the disk being able
to detect it [47, 74, 126]. Finally, disks sometimes exhibit transient performance
problems [11, 115].

There are several reasons for these complex disk failure modes. First, a trend
that is common in the drive industry is to pack more bits per square inch (BPS)
as the areal densities of disk drives are growing at a rapid rate [48].
...
In addition, increased density can also increase the complexity of the
logic, that is the firmware that manages the data [7], which can result in increased
number of bugs. For example, buggy firmwares are known to issue misdirected
writes [126], where correct data is placed on disk but in the wrong location.

Second, increased use of low-end desktop drives such as the IDE/ATA drives
worsens the reliability problem. Low cost dominates the design of personal storage
drives [7] and therefore, they are less tested and have less machinery to handle
disk errors [56].

Finally, amount of software used on the storage stack has increased. Firmware
on a desktop drive contains about 400 thousand lines of code [33]. Moreover,
the storage stack consists of several layers of low-level device driver code that
have been considered to have more bugs than the rest of the operating system
code [38, 113]. As Jim Gray points out in his study of Tandem Availability, “As the
other components of the system become increasingly reliable, software necessarily
becomes the dominant cause of outages” [44].
...
Our study focuses on four important and substantially different open-source file
systems, ext3 [121], ReiserFS [89], IBM’s JFS [19], and XFS [112] and one closedsource
file system, Windows NTFS [109]. From our analysis results, we find that
the technology used by high-end systems (e.g., checksumming, disk scrubbing, and
so on) has not filtered down to the realm of commodity file systems. Across all platforms,
we find ad hoc failure handling and a great deal of illogical inconsistency in
failure policy, often due to the diffusion of failure handling code through the kernel;
such inconsistency leads to substantially different detection and recovery strategies
under similar fault scenarios, resulting in unpredictable and often undesirable
fault-handling strategies. Moreover, failure handling diffusion makes it difficult to
examine any one or few portions of the code and determine how failure handling
is supposed to behave. Diffusion also implies that failure handling is inflexible;
policies that are spread across so many locations within the code base are hard to
change. In addition, we observe that failure handling is quite coarse-grained; it is
challenging to implement nuanced policies in the current system.

We also discover that most systems implement portions of their failure policy
incorrectly; the presence of bugs in the implementations demonstrates the difficulty
and complexity of correctly handling certain classes of disk failure.
...
We show that none of the file
systems can recover from partial disk failures, due to a lack of in-disk redundancy.
...

We found a number of bugs and inconsistencies in the ext3 failure policy.
First,errors are not always propagated to the user (e.g., truncate and rmdir fail
silently).

Second, ext3 does not always perform sanity checking; for example,
unlink does not check the linkscount field before modifying it and therefore
a corrupted value can lead to a system crash.

Third, although ext3 has redundant
copies of the superblock (RRedundancy), these copies are never updated after file
system creation and hence are not useful. Finally, there are important cases when
ext3 violates the journaling semantics, committing or checkpointing invalid transactions."

Thanks for attention!

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html