RFC: O_PONIES semantics (well O_REWRITE)

Rik van Riel <riel@xxxxxxxxxx> · Wed, 10 Jun 2009 21:03:25 -0400

The ext4 automatic-fsync-on-rename discussion has shown that
many applications simply Do It Wrong when it comes to rewriting
configuration files.

Some of the common failures are:
- program overwrites the old config file
- program writes a new file, but forgets to fsync before rename
- program writes the new file in /tmp, so the rename fails on
  some systems
- program writes a new file and fsyncs, but forgets to give the
  new file the same file ownership, permission and/or extended
  attributes as the old file

Magically taking care of filesystem semantics for every use may
not be possible (no O_PONIES for you!), but I believe we can
help the applications that just want to completely rewrite a
file and atomically replace it.

The semantics for O_REWRITE would be:

1) When opening a file O_REWRITE, the file handle points at
   a freshly allocated, empty file.  The original file is
   still available to programs that open the file without
   O_REWRITE.

2) O_REWRITE can only be used in conjunction with O_WRONLY,
   because the file descriptor is not associated with the
   original file (which has data), but with an empty inode.

3) The code that implements O_REWRITE (kernel?  glibc?)
   makes sure that:
   - the new file is on the same filesystem as the original file
   - the new file is not linked (so it is automatically freed
     after a process or system crash)
   - the new file's ownership, permissions and extended attributes
     match that of the original file

4) The application that opens a file O_REWRITE is required
   to rewrite the entire file.

5) On close(), the code that implements O_REWRITE makes sure that
   the file is atomically renamed, so that if a system crash happens,
   the user will see either the old or the new file contents, but
   never an empty file.

6) After close(), processes that open the file will get the new
   content.  Processes that previously opened the file will hold
   on to the old inode and get old contents.

Here are my questions:

- Are these semantics useful for programs that want to replace
  config (or other) files with new content?

- Are these semantics sane?

- What would be the best place to implement these semantics?

Relying on application developers to get it right seems to
not have worked out well, so I'm thinking kernel or glibc.
Glibc has the advantage of it not being in the kernel, but
implementing it in-kernel might give us the opportunity for
performance enhancements, like reducing step (5) to merely
enforcing ordering between filesystem operations, instead
of requiring an fsync.

--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html