The sk_err mechanism is infuriating in userspace

Andy Lutomirski <luto@xxxxxxxxxxxxxx> · Mon, 5 Feb 2024 15:03:05 -0800

Hi all-

I encounter this issue every couple of years, and it still seems to be
an issue, and it drives me nuts every time I see it.

I write software that uses unconnected datagram-style sockets.  Errors
happen for all kinds of reasons, and my software knows it.  My
software even handles the errors and moves on with its life.  I use
MSG_ERRQUEUE to understand the errors.  But the kernel fights back:

struct sk_buff *__skb_try_recv_datagram(struct sock *sk,
                                        struct sk_buff_head *queue,
                                        unsigned int flags, int *off, int *err,
                                        struct sk_buff **last)
{
        struct sk_buff *skb;
        unsigned long cpu_flags;
        /*
         * Caller is allowed not to check sk->sk_err before skb_recv_datagram()
         */
        int error = sock_error(sk);

        if (error)
                goto no_packet;
        ^^^^^^^^^^ <----- EXCUSE ME?

The kernel even fights back on the *send* path?!?

static long sock_wait_for_wmem(struct sock *sk, long timeo)
{
        DEFINE_WAIT(wait);

        sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
        for (;;) {
                if (!timeo)
                        break;
                if (signal_pending(current))
                        break;
                set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
                ...
                if (READ_ONCE(sk->sk_err))
                        break;  <-- KERNEL HATES UNCONNECTED SOCKETS!

This is IMO just broken.  I realize it's legacy behavior, but it's
BROKEN legacy behavior.  sk_err does not (at least for an unconnected
socket) indicate that anything is wrong with the socket.  It indicates
that something is worthy of notice, and it wants to tell me.

So:

1. sock_wait_for_wmem should IMO just not do that on an unconnected
socket.  AFAICS it's simply a bug.

2. How, exactly, am I supposed to call recvmsg() and, unambiguously,
find out whether recvmsg() actually failed?  There are actual errors
(something that indicates that the kernel malfunctioned or the socket
is broken), errors indicating that the packet being received is busted
(skb_copy_datagram_msg, for example), and also errors indicating that
there's an error queued up.

I would like to know that there's an error queued up.  That's what
poll and epoll are for, right?  Or a hint from recvmsg() that I should
call MSG_RECVERR too.  Or it could have a mode where it returns a
normal datagram *or* an error as appropriate.  But the current state
of affairs is just brittle and racy.

Are there any reasonably implementable, non-breaking ways to improve
the API so that programs that understand socket errors can actually
function fully correctly without gnarly retry loops in userspace and
silly heuristics about what errors are actually errors?

Grumpily,
Andy