Re: ext2/3: document conditions when reliable operation is possible

Pavel Machek <pavel@xxxxxx> · Mon, 23 Mar 2009 11:45:25 +0100

On Mon 2009-03-16 14:26:23, Rob Landley wrote:
> On Monday 16 March 2009 07:28:47 Pavel Machek wrote:
> > Hi!
> > > > +	Fortunately writes failing are very uncommon on traditional
> > > > +	spinning disks, as they have spare sectors they use when write
> > > > +	fails.
> > >
> > > I vaguely recall that the behavior of when a write error _does_ occur is
> > > to remount the filesystem read only?  (Is this VFS or per-fs?)
> >
> > Per-fs.
> 
> Might be nice to note that in the doc.

Ok, can you suggest a patch? I believe remount-ro is already
documented ... somewhere :-).

> > > I'm aware write errors shouldn't happen, and by the time they do it's too
> > > late to gracefully handle them, and all we can do is fail.  So how do we
> > > fail?
> >
> > Well, even remount-ro may be too late, IIRC.
> 
> Care to elaborate?  (When a filesystem is mounted RO, I'm not sure what 
> happens to the pages that have already been dirtied...)

Well, fsync() error reporting does not really work properly, but I
guess it will save you for the remount-ro case. So the data will be in
the journal, but it will be impossible to replay it...

> > > (Writes aren't always cleanly at the start of an erase block, so critical
> > > data _before_ what you touch is endangered too.)
> >
> > Well, flashes do remap, so it is actually "random blocks".
> 
> Fun.

Yes.

> > > > +	otherwise, disks may write garbage during powerfail.
> > > > +	Not sure how common that problem is on generic PC machines.
> > > > +
> > > > +	Note that atomic write is very hard to guarantee for RAID-4/5/6,
> > > > +	because it needs to write both changed data, and parity, to
> > > > +	different disks.
> > >
> > > These days instead of "atomic" it's better to think in terms of
> > > "barriers".
> >
> > This is not about barriers (that should be different topic). Atomic
> > write means that either whole sector is written, or nothing at all is
> > written. Because raid5 needs to update both master data and parity at
> > the same time, I don't think it can guarantee this during powerfail.
> 
> Good point, but I thought that's what journaling was for?

I believe journaling operates on assumption that "either whole sector
is written, or nothing at all is written".

> I'm aware that any flash filesystem _must_ be journaled in order to work 
> sanely, and must be able to view the underlying erase granularity down to the 
> bare metal, through any remapping the hardware's doing.  Possibly what's 
> really needed is a "flash is weird" section, since flash filesystems can't be 
> mounted on arbitrary block devices.

> Although an "-O erase_size=128" option so they _could_ would be nice.  There's 
> "mtdram" which seems to be the only remaining use for ram disks, but why there 
> isn't an "mtdwrap" that works with arbitrary underlying block devices, I have 
> no idea.  (Layering it on top of a loopback device would be most
> useful.)

I don't think that works. Compactflash (etc) cards basically randomly
remap the data, so you can't really run flash filesystem over
compactflash/usb/SD card -- you don't know the details of remapping.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html