Re: Atomic non-durable file write API

Christian Stroetmann <stroetmann@xxxxxxxxxxxxx> · Fri, 24 Dec 2010 12:21:30 +0100

On 24.12.2010 10:51, Ted Ts'o wrote:
On Fri, Dec 24, 2010 at 02:00:13AM +0100, Christian Stroetmann wrote:
I really do know what you want to say, despite that this example is
based on a bug in another system than the FS. But there will be
other examples, for sure.
Sure, but this thread started because someone wanted an "atomic
non-durable file write API", apparently because it was too slow to use
fsync().  If people use databases, it's not a problem; databases use
fsync(), but they use it properly and they provide the proper
transactional interfaces that people want.

That's why I agreed with you on this technical operating system level 
and would like to give the additional information that the database 
management system (DBMS) handles this in the interplay with the FS and 
that a database is stored in a file often with a propritary format for 
efficiency.

The problem comes when people try to implement their own databases
using small files for each row and column of the database, or for each
registry variable.  Then they complain when fsync() is to expensive,
because they need to use fsync() for every single 3 bytes of data they
store in their badly implemented database.

Yes, agreed (see above).

The bottom line is that if you want atomic updates of state
information, you need to use fsync() or fdatasync().  If this is a
performance bottleneck, then you're doing something wrong.  Maybe you
shouldn't be writing a third of a megabyte on every URL click, on the
main GUI thread; maybe the user doesn't need to remember every single
URL that was visited even if the power suddenly fails (maybe it's
enough if you write that information to disk every 3-5 minutes, and
less if you're running on battery).  Or maybe you shouldn't be using
hundreds of small state files, and screw up the dirty flag handling.
But regardless, you're doing something wrong/stupid.

Here we are on the application level. And here it starts where I say 
that to use an FS as a DBMS is not the true problem.

Potentially off-topic:
And while we are at this point, from my point of view the wrong/stupid 
acting is how an FS is used from the operating system level. That's 
because, as said above, a database is stored in a file and the only 
functionality that is missing in an FS managemant system is exactly that 
what in this case is added by the DBMS. If you programm in a clever way 
it must be faster than the standard concept, which is a file that 
represents a database is stored in an FS, because some FS functions 
don't really have to be called.
And to do such a special FS handling seen from the kernel level is not 
uncommon, because backup systems do it already and an FS that you don't 
like does it as well, and already the A of ACID. The rest can be handled 
by an appropriated FS plug-in system. So we come back to the point again 
where this functionality has to be, in the FS or the VFS. You say VFS, I 
say FS, like R4, and OntoFS #1 (R4- and ontology-based) and #2 (ext2/3-, 
sqlite- and ontology-based conversion from fuse-sqlite).

						- Ted

Have fun
Christian Stroetmann
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html