Re: write() semantics (Re: Orangefs ABI documentation)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I added a signal handler to my test program today, and now I see the
giant difference between wait_for_completion_interruptible_timeout,
which is what you get when you use the -intr mount option, and
wait_for_completion_killable_timeout, which is what you get when
you don't, so I retract what I said about the -intr mount option
not being relevant <g>...

It also seems like it would be easy (and correct) to modify the
patch a little so that -EINTR would still be returned on the off-chance
that the interrupt was caught before any data was written...

-Mike

On Thu, Mar 3, 2016 at 5:25 PM, Mike Marshall <hubcap@xxxxxxxxxxxx> wrote:
> Here is what I have come up with to try and make our return to
> interrupted writes more acceptable... in my tests it seems to
> work. My test involved running the client-core with a tiny IO
> buffer size (4k) and a C program with a large write buffer(32M).
> That way there was plenty of time for me to fire off the C program
> and hit Ctrl-C while the write was chugging along.
>
> The return value from do_readv_writev always matches the size
> of the file written by the aborted C program in my tests.
>
> I changed the C program around so that sometimes I ran it with
> (O_CREAT | O_RDWR) on open and sometimes with (O_CREAT | O_RDWR | O_APPEND)
> and it seemed to do the right thing. I didn't try to set up a
> signal handler so that the signal wasn't fatal to the process,
> I guess that would be a test to actually see and verify the correct
> short return code to write...
>
> Do you all think this looks like it should work in principle?
>
> BTW: in the distant past someone else attempted to solve this problem
> the "nfs intr" way - we have an intr mount option, and that's why there's
> all that sweating over whether or not stuff is "interruptible" in
> waitqueue.c... I'm not sure if our intr mount option is relevant anymore
> given the way the op handling code has evolved...
>
> diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
> index 6f2e0f7..4349c9d 100644
> --- a/fs/orangefs/file.c
> +++ b/fs/orangefs/file.c
> @@ -180,21 +180,54 @@ populate_shared_memory:
>   }
>
>   if (ret < 0) {
> - /*
> - * don't write an error to syslog on signaled operation
> - * termination unless we've got debugging turned on, as
> - * this can happen regularly (i.e. ctrl-c)
> - */
> - if (ret == -EINTR)
> + if (ret == -EINTR) {
> + /*
> + * We can't return EINTR if any data was written,
> + * it's not POSIX. It is minimally acceptable
> + * to give a partial write, the way NFS does.
> + *
> + * It would be optimal to return all or nothing,
> + * but if a userspace write is bigger than
> + * an IO buffer, and the interrupt occurs
> + * between buffer writes, that would not be
> + * possible.
> + */
> + switch (new_op->op_state - OP_VFS_STATE_GIVEN_UP) {
> + /*
> + * If the op was waiting when the interrupt
> + * occurred, then the client-core did not
> + * trigger the write.
> + */
> + case OP_VFS_STATE_WAITING:
> + ret = 0;
> + break;
> + /*
> + * If the op was in progress when the interrupt
> + * occurred, then the client-core was able to
> + * trigger the write.
> + */
> + case OP_VFS_STATE_INPROGR:
> + ret = total_size;
> + break;
> + default:
> + gossip_err("%s: unexpected op state :%d:.\n",
> +   __func__,
> +   new_op->op_state);
> + ret = 0;
> + break;
> + }
>   gossip_debug(GOSSIP_FILE_DEBUG,
> -     "%s: returning error %ld\n", __func__,
> -     (long)ret);
> - else
> +     "%s: got EINTR, state:%d: %p\n",
> +     __func__,
> +     new_op->op_state,
> +     new_op);
> + } else {
>   gossip_err("%s: error in %s handle %pU, returning %zd\n",
>   __func__,
>   type == ORANGEFS_IO_READ ?
>   "read from" : "write to",
>   handle, ret);
> + }
>   if (orangefs_cancel_op_in_progress(new_op))
>   return ret;
>
>
> On Sat, Jan 23, 2016 at 6:35 PM, Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>> On Sat, Jan 23, 2016 at 2:46 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
>>>
>>> What should we get?  -EINTR, despite having written some data?
>>
>> No, that's not acceptable.
>>
>> Either all or nothing (which is POSIX) or the NFS 'intr' mount
>> behavior (partial write return, -EINTR only when nothing was written
>> at all). And, like NFS, a mount option might be a good thing.
>>
>> And of course, for the usual reasons, fatal signals are special in
>> that for them we generally say "screw posix, nobody sees the return
>> value anyway", but even there the filesystem might as well still
>> return the partial return value (just to not introduce yet another
>> special case).
>>
>> In fact, I think that with our "fatal signals interrupt" behavior,
>> nobody should likely use the "intr" mount option on NFS. Even if the
>> semantics may be "better", there are likely simply just too many
>> programs that don't check the return value of "write()" at all, much
>> less handle partial writes correctly.
>>
>> (And yes, our "screw posix" behavior wrt fatal signals is strictly
>> wrong even _despite_ the fact that nobody sees the return value -
>> other processes can still obviously see that the whole write wasn't
>> done. But blocking on a fatal signal is _so_ annoying that it's one of
>> those things where we just say "posix was wrong on this one, and if we
>> squint a bit we look _almost_ like we're compliant").
>>
>>               Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux