Re: The sk_err mechanism is infuriating in userspace

Andy Lutomirski <luto@xxxxxxxxxxxxxx> · Mon, 5 Feb 2024 15:22:15 -0800

> On Feb 5, 2024, at 3:03 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> 
> Hi all-
> 
> I encounter this issue every couple of years, and it still seems to be
> an issue, and it drives me nuts every time I see it.
> 
> I write software that uses unconnected datagram-style sockets.  Errors
> happen for all kinds of reasons, and my software knows it.  My
> software even handles the errors and moves on with its life.  I use
> MSG_ERRQUEUE to understand the errors.  But the kernel fights back:
> 
> struct sk_buff *__skb_try_recv_datagram(struct sock *sk,
>                                        struct sk_buff_head *queue,
>                                        unsigned int flags, int *off, int *err,
>                                        struct sk_buff **last)
> {
>        struct sk_buff *skb;
>        unsigned long cpu_flags;
>        /*
>         * Caller is allowed not to check sk->sk_err before skb_recv_datagram()
>         */
>        int error = sock_error(sk);
> 
>        if (error)
>                goto no_packet;
>        ^^^^^^^^^^ <----- EXCUSE ME?
> 
> The kernel even fights back on the *send* path?!?
> 
> static long sock_wait_for_wmem(struct sock *sk, long timeo)
> {
>        DEFINE_WAIT(wait);
> 
>        sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
>        for (;;) {
>                if (!timeo)
>                        break;
>                if (signal_pending(current))
>                        break;
>                set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>                ...
>                if (READ_ONCE(sk->sk_err))
>                        break;  <-- KERNEL HATES UNCONNECTED SOCKETS!
> 
> This is IMO just broken.  I realize it's legacy behavior, but it's
> BROKEN legacy behavior.  sk_err does not (at least for an unconnected
> socket) indicate that anything is wrong with the socket.  It indicates
> that something is worthy of notice, and it wants to tell me.
> 
> So:
> 
> 1. sock_wait_for_wmem should IMO just not do that on an unconnected
> socket.  AFAICS it's simply a bug.
> 
> 2. How, exactly, am I supposed to call recvmsg() and, unambiguously,
> find out whether recvmsg() actually failed?  There are actual errors
> (something that indicates that the kernel malfunctioned or the socket
> is broken), errors indicating that the packet being received is busted
> (skb_copy_datagram_msg, for example), and also errors indicating that
> there's an error queued up.
> 
> I would like to know that there's an error queued up.  That's what
> poll and epoll are for, right?  Or a hint from recvmsg() that I should
> call MSG_RECVERR too.  Or it could have a mode where it returns a
> normal datagram *or* an error as appropriate.  But the current state
> of affairs is just brittle and racy.
> 
> Are there any reasonably implementable, non-breaking ways to improve
> the API so that programs that understand socket errors can actually
> function fully correctly without gnarly retry loops in userspace and
> silly heuristics about what errors are actually errors?

Contemplating this, recvmsg() can sent status information back via msg_flags.  Maybe we could characterize a recvmsg() call as doing one of the following things:

1. Actually fails, via -EFAULT or otherwise.  Userspace can get an errno but doesn’t know beyond that what actually went wrong. Should never happen in a correct program. ENOMEM is not in this category.

2. There is nothing to receive. This is -EAGAIN.

3. Received an sk_err error. This is a *success*, and it comes with an error code. Users of RECVERR can’t reliably correlate this with an ERRQUEUE message.  Maybe they don’t care.

4. Received a datagram.

5. Received a queued error message a la ERRQUEUE.

6. Dequeued a datagram (or ERRQUEUE) but did *not* receive it due to a checksum error or other error. (And there should be a clear indication of whether the call succeeded but something was wrong with the message or whether the call *failed* for an unexpected reason but the offending message was nonetheless removed from the socket’s queue).

Maybe 7: Received a message (or ERRQUEUE), and the checksum was wrong, but the data is being returned anyway.

I suppose that a flag could enable this mode and then all but #1 would return a *success* code from the syscall.  And msg_flags would contain an indication as to what actually happened.

Thoughts?  Does io_uring affect any of this?

> 
> Grumpily,
> Andy