Re: Atomic non-durable file write API

Neil Brown <neilb@xxxxxxx> · Sun, 26 Dec 2010 08:40:07 +1100

On Fri, 24 Dec 2010 12:17:46 +0100 Olaf van der Spek <olafvdspek@xxxxxxxxx>
wrote:

> On Thu, Dec 23, 2010 at 10:51 PM, Neil Brown <neilb@xxxxxxx> wrote:
> > You are asking for something that doesn't exist, which is why no-one can tell
> > you want the answer is.
> 
> It seems like a very common and basic operation. If it doesn't exist
> IMO it should be created.
> 
> > The only mechanism for synchronising different filesystem operations is
> > fsync.  You should use that.
> >
> > If it is too slow, use data journalling, and place your journal on a
> > small low-latency device (NVRAM??)
> 
> This isn't about some DB-like app, it's about normal file writes, like
> archive extractions, compiling, editors, etc.
> 

Yes, it might be nice to have a very low cost way to make those safer against
corruption during a crash.
It would have to be *very* low cost as in most cases the cost of cleaning up
after the crash instead (e.g. 'make clean') is quite low.  But people do
sometime edit /etc/init.d files with an ordinary editor and it would be
rather embarrassing if a crash just at the wrong time left some critical file
incomplete, and maybe it would be easier to teach editors to fsync before
rename for files in /etc .....

So what would this mechanism really look like?  I think the proposal is to
delay committing the rename until the writeout of the file is complete,
without accelerating the writeout.
That would probably require delaying all updates to the directory until the
writeout was complete, as trying to reason about which changes were dependent
and which were independent is unlikely to be easy.

So as soon as you rename a file, you create a dependency between the file and
the directory such that no update for the directory may be written while any
page in the file is dirty.  Conversely, any fsync of the directory would
fsync the file as well.

Any write to the file should probably break the dependency as you can no
longer be sure what exactly the rename was supposed to protect.

I suspect that much of the infrastructure for this could be implemented in
the VFS/VM.  Certainly the dependency linkage between inodes, created on
rename, destroyed on write or fsync or when writeout on the inode completes,
and the fsync dependency could be common code.  Preventing writeout of
directories with dependent files would need some fs interaction. You could
probably prototype in ext2 quite easily to do some testing and collection
some numbers on overhead.

I think this would be an interesting project for someone to do and I would be
happy to review any patches.  Whether it ever got further than an interesting
project would depend very much on how intrusive it was to other filesystems,
how much over head it caused, and what actual benefits resulted.
If anyone wanted to pursue this idea, they would certainly need to address
each of those in their final proposal.

I think there could be room for improved transactional semantics in Linux
filesystems.  This might be what they should look like ... don't know yet.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html