Re: write() semantics (Re: Orangefs ABI documentation)

Mike Marshall <hubcap@xxxxxxxxxxxx> · Thu, 3 Mar 2016 17:25:08 -0500

Here is what I have come up with to try and make our return to
interrupted writes more acceptable... in my tests it seems to
work. My test involved running the client-core with a tiny IO
buffer size (4k) and a C program with a large write buffer(32M).
That way there was plenty of time for me to fire off the C program
and hit Ctrl-C while the write was chugging along.

The return value from do_readv_writev always matches the size
of the file written by the aborted C program in my tests.

I changed the C program around so that sometimes I ran it with
(O_CREAT | O_RDWR) on open and sometimes with (O_CREAT | O_RDWR | O_APPEND)
and it seemed to do the right thing. I didn't try to set up a
signal handler so that the signal wasn't fatal to the process,
I guess that would be a test to actually see and verify the correct
short return code to write...

Do you all think this looks like it should work in principle?

BTW: in the distant past someone else attempted to solve this problem
the "nfs intr" way - we have an intr mount option, and that's why there's
all that sweating over whether or not stuff is "interruptible" in
waitqueue.c... I'm not sure if our intr mount option is relevant anymore
given the way the op handling code has evolved...

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index 6f2e0f7..4349c9d 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -180,21 +180,54 @@ populate_shared_memory:
  }

  if (ret < 0) {
- /*
- * don't write an error to syslog on signaled operation
- * termination unless we've got debugging turned on, as
- * this can happen regularly (i.e. ctrl-c)
- */
- if (ret == -EINTR)
+ if (ret == -EINTR) {
+ /*
+ * We can't return EINTR if any data was written,
+ * it's not POSIX. It is minimally acceptable
+ * to give a partial write, the way NFS does.
+ *
+ * It would be optimal to return all or nothing,
+ * but if a userspace write is bigger than
+ * an IO buffer, and the interrupt occurs
+ * between buffer writes, that would not be
+ * possible.
+ */
+ switch (new_op->op_state - OP_VFS_STATE_GIVEN_UP) {
+ /*
+ * If the op was waiting when the interrupt
+ * occurred, then the client-core did not
+ * trigger the write.
+ */
+ case OP_VFS_STATE_WAITING:
+ ret = 0;
+ break;
+ /*
+ * If the op was in progress when the interrupt
+ * occurred, then the client-core was able to
+ * trigger the write.
+ */
+ case OP_VFS_STATE_INPROGR:
+ ret = total_size;
+ break;
+ default:
+ gossip_err("%s: unexpected op state :%d:.\n",
+   __func__,
+   new_op->op_state);
+ ret = 0;
+ break;
+ }
  gossip_debug(GOSSIP_FILE_DEBUG,
-     "%s: returning error %ld\n", __func__,
-     (long)ret);
- else
+     "%s: got EINTR, state:%d: %p\n",
+     __func__,
+     new_op->op_state,
+     new_op);
+ } else {
  gossip_err("%s: error in %s handle %pU, returning %zd\n",
  __func__,
  type == ORANGEFS_IO_READ ?
  "read from" : "write to",
  handle, ret);
+ }
  if (orangefs_cancel_op_in_progress(new_op))
  return ret;


On Sat, Jan 23, 2016 at 6:35 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Sat, Jan 23, 2016 at 2:46 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
>>
>> What should we get?  -EINTR, despite having written some data?
>
> No, that's not acceptable.
>
> Either all or nothing (which is POSIX) or the NFS 'intr' mount
> behavior (partial write return, -EINTR only when nothing was written
> at all). And, like NFS, a mount option might be a good thing.
>
> And of course, for the usual reasons, fatal signals are special in
> that for them we generally say "screw posix, nobody sees the return
> value anyway", but even there the filesystem might as well still
> return the partial return value (just to not introduce yet another
> special case).
>
> In fact, I think that with our "fatal signals interrupt" behavior,
> nobody should likely use the "intr" mount option on NFS. Even if the
> semantics may be "better", there are likely simply just too many
> programs that don't check the return value of "write()" at all, much
> less handle partial writes correctly.
>
> (And yes, our "screw posix" behavior wrt fatal signals is strictly
> wrong even _despite_ the fact that nobody sees the return value -
> other processes can still obviously see that the whole write wasn't
> done. But blocking on a fatal signal is _so_ annoying that it's one of
> those things where we just say "posix was wrong on this one, and if we
> squint a bit we look _almost_ like we're compliant").
>
>               Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html