Re: Atomic non-durable file write API

Nick Piggin <npiggin@xxxxxxxxx> · Mon, 27 Dec 2010 03:43:00 +1100

On Mon, Dec 27, 2010 at 2:08 AM, Olaf van der Spek <olafvdspek@xxxxxxxxx> wrote:
> On Sat, Dec 25, 2010 at 6:25 PM, Nick Piggin <npiggin@xxxxxxxxx> wrote:
>>> No, not arbitrary writes. It's about complete file writes.
>>
>> You still haven't defined exactly what you want.
>
> Do you not understand what is meant by a complete file write?

It is not a rigourous definition. What I understand it to mean may be
different than what you understand it to mean. Particularly when you
consider what the actual API should look like and interact with the rest
of the apis.

>>> Atomic semantics are not (that) complex.
>>
>> That is something to be argued over patches. What is not in question
>> is that an atomic API is more complex than none :)
>
> That's implementation complexity, not concept/semantics complexity.

It is both. "atomic complete file write" is not sufficient at all.

>>> Like I said before, it's not about DB-like functionality but about
>>> complete file writes/updates. For example, I've got a file in an
>>> editor and I want to save it.
>>
>> I don't understand your example, because in that case you surely
>> want durability.
>
> Hmm, true, bad example, although it depends on editor/user.
> Let's take archive extraction instead.

OK, so please show how it helps.

If it is a multi-file/dir archive, then you could equally well come back in
an inconsistent state after crashing with some files extracted and
some not, without atomic-write-multiple-files-and-directories API.

>>> Let me copy the original post:
>>> Writing a temp file, fsync, rename is often proposed. However, the
>>> durable aspect of fsync isn't always required
>>
>> So you want a way to atomically replace the contents of a file with
>> new contents, in a way which completes asynchronously and lazily,
>> and your new contents will eventually just appear sometime after
>> they are guaranteed to be on disk?
>
> Almost. Visibility to other process should be normal (I don't know the
> exact rules), but commit to disk may be deferred.

That's pretty important detail. What is "normal"? Will a process
see old or new data from the atomic write before atomic write has
committed to disk?

Is the atomic write guaranteed to take an atomic snapshot of file
and only specified updates?

What happens to subsequent atomic and non atomic writes to the
file?

>> You would need to create an unlinked inode with dirty data, and then
>> have callbacks from pagecache writeback checking when the inode
>> is cleaned, and then call appropriate filesystem routines to sync and
>> issue barriers etc, and rename the old name to the new inode.
>
> That's an implementation detail, but yes, something like that.
>
>> You will also need to have a chain of inodes representing ordering of
>> the updates so the renames can be performed in the right order. And
>> add some hooks to solve the metadata issue.
>>
>> Then what happens when you fsync the original file? What if the
>> original file is renamed or unlinked? How do you sync the outstanding
>> queue of updates?
>
> Logically those actions would happen after the atomic data update.
> The fsync would be done on a now unlinked file (if done via fd). The
> rename would be done on the new file. Same for unlink.
>
>> Once you solve all those problems, then people will ask you to now
>> solve them for multiple files at once because they also have some
>> great use-case that is surely nothing like databases.
>
> I don't want to play the what if game.

You must if you want to design a sane API.

>> Please tell us what for. If you have immediate need to replace the
>> name, then you need the durability of fsync. If you don't have
>> immediate need, then you can use another name, surely (until it
>> comes time you want to switch names, at that point you want
>> durability so you fsync then rename).
>
> Temp file, rename has issues with losing meta-data.

How about solving that easier issue?

>>> and this way has other
>>> issues, like losing file meta-data.
>>
>> Yes that's true, if you're not owner you may not be able to recreate
>> most of it. Did you need to?
>
> Yes
>
>>
>>> What is the recommended way for atomic non-durable (complete) file writes?
>>
>> There really isn't one. Like I said, there is not much atomicity
>> semantics in the API, which works really well because it is simple
>> to implement and to use (although apparently still far too complex
>> for some programmers to get right).
>
> It's simple to implement but it's not simple to use right.

You do not have the ability to have arbitrary atomic transactions to the
filesystem. If you show a problem of a half completed write after crash,
then I can show you a problem of any half completed multi-syscall
operation after crash.

The simple thing is to properly clean up such things after a crash, and
just use an atomic commit somewhere to say whether the file operations
that just completed are now in a durable state. Either that or use an
existing code that does it right.

>> If we start adding atomicity beyond fundamental requirement of
>> namespace operations, then where does it end? Why would it make
>> sense to add atomicity for writes to one file, but not writes to 2 files?
>> What if you require atomic multiple modifications to directory
>> structure as well as file updates? And why only writes? What about
>> atomic reads of several things? What isolation level should all of that
>> have, and how to solve deadlocks?
>>
>>
>>> I'm also wondering why FSs commit after open/truncate but before
>>> write/close. AFAIK this isn't necessary and thus suboptimal.
>>
>> I don't know, can you expand on this? What fses are you talking
>> about, and what behaviour.
>
> The zero size issues of ext4 (before some patch). Presumably because
> some apps do open, truncate, write, close on a file. I'm wondering why
> an FS commits between truncate and write.

I'm still not clear what you mean. Filesystem state may get updated
between any 2 syscalls because the kernel has no oracle or infinite
storage.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html