Re: Atomic non-durable file write API

Olaf van der Spek <olafvdspek@xxxxxxxxx> · Sun, 26 Dec 2010 16:08:12 +0100

On Sat, Dec 25, 2010 at 6:25 PM, Nick Piggin <npiggin@xxxxxxxxx> wrote:
>> No, not arbitrary writes. It's about complete file writes.
>
> You still haven't defined exactly what you want.

Do you not understand what is meant by a complete file write?

>> Atomic semantics are not (that) complex.
>
> That is something to be argued over patches. What is not in question
> is that an atomic API is more complex than none :)

That's implementation complexity, not concept/semantics complexity.

>> Like I said before, it's not about DB-like functionality but about
>> complete file writes/updates. For example, I've got a file in an
>> editor and I want to save it.
>
> I don't understand your example, because in that case you surely
> want durability.

Hmm, true, bad example, although it depends on editor/user.
Let's take archive extraction instead.

>> Let me copy the original post:
>> Writing a temp file, fsync, rename is often proposed. However, the
>> durable aspect of fsync isn't always required
>
> So you want a way to atomically replace the contents of a file with
> new contents, in a way which completes asynchronously and lazily,
> and your new contents will eventually just appear sometime after
> they are guaranteed to be on disk?

Almost. Visibility to other process should be normal (I don't know the
exact rules), but commit to disk may be deferred.

> You would need to create an unlinked inode with dirty data, and then
> have callbacks from pagecache writeback checking when the inode
> is cleaned, and then call appropriate filesystem routines to sync and
> issue barriers etc, and rename the old name to the new inode.

That's an implementation detail, but yes, something like that.

> You will also need to have a chain of inodes representing ordering of
> the updates so the renames can be performed in the right order. And
> add some hooks to solve the metadata issue.
>
> Then what happens when you fsync the original file? What if the
> original file is renamed or unlinked? How do you sync the outstanding
> queue of updates?

Logically those actions would happen after the atomic data update.
The fsync would be done on a now unlinked file (if done via fd). The
rename would be done on the new file. Same for unlink.

> Once you solve all those problems, then people will ask you to now
> solve them for multiple files at once because they also have some
> great use-case that is surely nothing like databases.

I don't want to play the what if game.

> Please tell us what for. If you have immediate need to replace the
> name, then you need the durability of fsync. If you don't have
> immediate need, then you can use another name, surely (until it
> comes time you want to switch names, at that point you want
> durability so you fsync then rename).

Temp file, rename has issues with losing meta-data.

>
>> and this way has other
>> issues, like losing file meta-data.
>
> Yes that's true, if you're not owner you may not be able to recreate
> most of it. Did you need to?

Yes

>
>> What is the recommended way for atomic non-durable (complete) file writes?
>
> There really isn't one. Like I said, there is not much atomicity
> semantics in the API, which works really well because it is simple
> to implement and to use (although apparently still far too complex
> for some programmers to get right).

It's simple to implement but it's not simple to use right.

> If we start adding atomicity beyond fundamental requirement of
> namespace operations, then where does it end? Why would it make
> sense to add atomicity for writes to one file, but not writes to 2 files?
> What if you require atomic multiple modifications to directory
> structure as well as file updates? And why only writes? What about
> atomic reads of several things? What isolation level should all of that
> have, and how to solve deadlocks?
>
>
>> I'm also wondering why FSs commit after open/truncate but before
>> write/close. AFAIK this isn't necessary and thus suboptimal.
>
> I don't know, can you expand on this? What fses are you talking
> about, and what behaviour.

The zero size issues of ext4 (before some patch). Presumably because
some apps do open, truncate, write, close on a file. I'm wondering why
an FS commits between truncate and write.

Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html