Re: Problematic interaction of io_uring and CIFS

Fiona Ebner <f.ebner@xxxxxxxxxxx> · Fri, 26 Aug 2022 10:21:41 +0200

Am 11.07.22 um 15:40 schrieb Fabian Ebner:
> Am 09.07.22 um 05:39 schrieb Shyam Prasad N:
>> On Sat, Jul 9, 2022 at 9:00 AM Shyam Prasad N <nspmangalore@xxxxxxxxx> wrote:
>>>
>>> On Fri, Jul 8, 2022 at 11:22 PM Enzo Matsumiya <ematsumiya@xxxxxxx> wrote:
>>>>
>>>> On 07/08, Fabian Ebner wrote:
>>>>> (Re-sending without the log from the older kernel, because the mail hit
>>>>> the 100000 char limit with that)
>>>>>
>>>>> Hi,
>>>>> it seems that in kernels >= 5.15, io_uring and CIFS don't interact
>>>>> nicely sometimes, leading to IO errors. Unfortunately, my reproducer is
>>>>> a QEMU VM with a disk on CIFS (original report by one of our users [0]),
>>>>> but I can try to cook up something simpler if you want.
>>>>>
>>>>> Bisecting got me to 8ef12efe26c8 ("io_uring: run regular file
>>>>> completions from task_work") being the first bad commit.
>>>>>

I finally got around to taking another look at this issue (still present
in 5.19.3) and I think I've finally figured out the root cause:

After commit 8ef12efe26c8, for my reproducer, the write completion is
added to task_work with notify_method being TWA_SIGNAL and thus
TIF_NOTIFY_SIGNAL is set for the task.

After that, if we end up in sk_stream_wait_memory() via sock_sendmsg(),
signal_pending(current) will evaluate to true and thus -EINTR is
returned all the way up to sock_sendmsg() in smb_send_kvec().

Related: in __smb_send_rqst() there too is a signal_pending(current)
check leading to the -ERESTARTSYS return value.

To verify that this is the cause, I wasn't able to trigger the issue
anymore with this hack applied (i.e. excluding the TIF_NOTIFY_SIGNAL check):

> diff --git a/net/core/stream.c b/net/core/stream.c
> index 06b36c730ce8..58e3825930bb 100644
> --- a/net/core/stream.c
> +++ b/net/core/stream.c
> @@ -134,7 +134,7 @@ int sk_stream_wait_memory(struct sock *sk, long *timeo_p)
>                         goto do_error;
>                 if (!*timeo_p)
>                         goto do_eagain;
> -               if (signal_pending(current))
> +               if (task_sigpending(current))
>                         goto do_interrupted;
>                 sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
>                 if (sk_stream_memory_free(sk) && !vm_wait)

In __cifs_writev() we have

>     /*
>      * If at least one write was successfully sent, then discard any rc
>      * value from the later writes. If the other write succeeds, then
>      * we'll end up returning whatever was written. If it fails, then
>      * we'll get a new rc value from that.
>      */

so it can happen that collect_uncached_write_data() will (correctly)
report a short write when calling ctx->iocb->ki_complete().

But QEMU's io_uring backend treats a short write as an -ENOSPC error,
which also is a bug? Or does the kernel give any guarantees in that
direction?

Still, it doesn't seem ideal that the "interrupt" happens and in fact
__smb_send_rqst() tries to avoid it, but fails to do so, because of the
unexpected TIF_NOTIFY_SIGNAL:
>     /*
>      * We should not allow signals to interrupt the network send because
>      * any partial send will cause session reconnects thus increasing
>      * latency of system calls and overload a server with unnecessary
>      * requests.
>      */
> 
>     sigfillset(&mask);
>     sigprocmask(SIG_BLOCK, &mask, &oldmask);

Do you have any suggestions for how to proceed?

Best Regards,
Fiona