Re: can: isotp: epoll breaks isotp_sendmsg

Oliver Hartkopp <socketcan@xxxxxxxxxxxx> · Sun, 20 Aug 2023 21:51:37 +0200

Hi Lukas,

On 18.08.23 13:53, Lukas Magel wrote:
On 13.08.23 13:23, Lukas Magel wrote:
Hi Maxime, hi Michal,

On 01.07.23 00:35, Michal Sojka wrote:
Hi Maxime,

On Fri, Jun 30 2023, Maxime Jayat wrote:
Hi,

There is something not clear happening with the non-blocking behavior
of ISO-TP sockets in the TX path, but more importantly, using epoll now
completely breaks isotp_sendmsg.
I believe it is related to
79e19fa79c ("can: isotp: isotp_ops: fix poll() to not report false
EPOLLOUT events"),
but actually is probably deeper than that.

I don't completely understand what is exactly going on, so I am sharing
the problem I face:

With an ISO-TP socket in non-blocking mode, using epoll seems to make
isotp_sendmsg always return -EAGAIN.
That's definitely not expected behavior. I tested the patch only with
poll, hoping that epoll would behave the same.

[...]

I am writing to report that we have been witnessing a behavior very similar
to what you describe. ISO-TP send breaks with EAGAIN if a poll (for a read)
occurs at the same time.

Our Python stack uses two threads to do concurrent, blocking reads & writes
from and to the ISO-TP socket. The socket has a timeout of 0.1s to facilitate
shutdown when requested by the application. Notably, the blocking semantics
are handled by CPython, i.e. the underlying kernel socket is non-blocking.
CPython polls until the requested operation (read or write) can be executed
or the timeout occurs.

What happens during execution is that the socket is continuously being
polled by the read thread, i.e. so->wait is always filled with one task.
This process repeats until the socket receives a frame from the bus and the
poll returns successsfully. The app reads the data from the socket and
sends a response. Since the send occurs in a different thread, the reader
thread will have already returned to its poll loop and to populating
so->wait. When the send occurs, isotp_sendmsg checks so->wait for sleepers
and returns EAGAIN because the socket is non-blocking although there is no
concurrent send operation. This dance continues until the timeout occurs for
either the read or the write operation. If the write times out first, a
timeout error causes the Python app to break. If the read times out first,
there is a race that the write goes through or the reader puts in
another poll.

This behavior can be seen rather nicely in strace:
110580 poll([{fd=5, events=POLLIN}], 1, 100) = 0 (Timeout)
110580 poll([{fd=5, events=POLLIN}], 1, 100) = 0 (Timeout)
110580 poll([{fd=5, events=POLLIN}], 1, 100) = 1 ([{fd=5, revents=POLLIN}])
110580 recvfrom(5, ">\0", 4095, 0, NULL, NULL) = 2
110580 poll([{fd=5, events=POLLIN}], 1, 100 <unfinished ...>
110569 poll([{fd=5, events=POLLOUT}], 1, 100) = 1 ([{fd=5, revents=POLLOUT}])
110569 sendto(5, "~\0", 2, 0, NULL, 0)  = -1 EAGAIN (Resource temporarily unavailable)
110569 poll([{fd=5, events=POLLOUT}], 1, 100) = 1 ([{fd=5, revents=POLLOUT}])
110569 sendto(5, "~\0", 2, 0, NULL, 0)  = -1 EAGAIN (Resource temporarily unavailable)
110569 poll([{fd=5, events=POLLOUT}], 1, 100) = 1 ([{fd=5, revents=POLLOUT}])
110569 sendto(5, "~\0", 2, 0, NULL, 0)  = -1 EAGAIN (Resource temporarily unavailable)
[....]
110569 poll([{fd=5, events=POLLOUT}], 1, 6) = 1 ([{fd=5, revents=POLLOUT}])
110569 sendto(5, "~\0", 2, 0, NULL, 0)  = -1 EAGAIN (Resource temporarily unavailable)
110569 poll([{fd=5, events=POLLOUT}], 1, 6 <unfinished ...>
110580 <... poll resumed>)              = 0 (Timeout)
110569 <... poll resumed>)              = 1 ([{fd=5, revents=POLLOUT}])
110580 poll([{fd=5, events=POLLIN}], 1, 100 <unfinished ...>
110569 sendto(5, "~\0", 2, 0, NULL, 0)  = -1 EAGAIN (Resource temporarily unavailable)
110569 poll([{fd=5, events=POLLOUT}], 1, 5) = 1 ([{fd=5, revents=POLLOUT}])
110569 sendto(5, "~\0", 2, 0, NULL, 0)  = -1 EAGAIN (Resource temporarily unavailable)

I believe this is consistent to the behavior you're witnessing with epoll
because epoll also places a sleeper in so->wait that is left in the queue
until the epoll descriptor is closed.

By reverting 79e19fa79c, I get better results but still incorrect:
[...]

It is then possible to write on the socket but the write is blocking,
which is not the expected behavior for a non-blocking socket.
Yes, incorrect behavior was why we made the commit in question, however
we saw write() returning -EAGAIN when it shouldn't.

I don't know how to solve the problem. To me, using wq_has_sleeper seems
weird.
Agreed. I've never tried to understand how synchronization works here.
Hopefully, Oliver knows more.

The implementation of isotp_poll feels weird too (calling both
datagram_poll and
poll_wait?). But I am not sure what would be the correct
implementation.
I understand it as follows (which might be wrong - someone, please
correct me), isotp_poll() should register the file with all waitqueues
it can wait on. so->wait is one and sock->sq.wait (used by
datagram_poll) is another. The former is definitely used for TX, the
latter is probably used because skb_recv_datagram() is called for RX.
But so->wait is also used for RX and there might proabbly be be some
inconsistency between those.

AFAIK, it is correct behavior for isotp_poll to register all wait queues
with the poller.

Before 79e19fa79c, I assume datagram_poll always returned
EPOLLOUT because its corresponding send buffer is unused and empty. This
return value can be incorrect if so->tx.state is not IDLE and a send would
block. With the patch, this behavior is now suppressed. I believe that
the inconsistency could have rather been introduced with:
can: isotp: fix race between isotp_sendsmg() and isotp_release() 0517374

With this patch, the behavior of isotp_sendmsg was changed to only check
so->wait for sleepers instead of consulting so->tx.state to see if the
socket is busy. Since the wait queue can also have sleepers only interested
in read operations, I believe the return value is not a valid indicator
of send readiness. Additionally, in this state, the behavior in isotp_sendmsg
is inconsistent with isotp_poll. I will try to test next week if reverting this
part of the patch could fix the race condition.

I submitted a patch to the mailing list and would very much
appreciate your feedback:
https://lore.kernel.org/linux-can/20230818114345.142983-1-lukas.magel@xxxxxxxxxx/

@Maxime, I tried your isotprecv / isotpsend snippet and the issue did
not occur with the supplied patch. The send operation went through right
away.

@Oliver I adjusted the exit path for the case where the initial wait is
interrupted to return immediately instead of jumping to err_event_drop.
Could you please check if you would agree with this change?

The code has really won with your change! Thanks!

But as you already assumed I have a problem with the handling of the 
cleanup when a signal interrupts the wait_event_interruptible() statement.

I think it should still be:

/* wait for complete transmission of current pdu */
err = wait_event_interruptible(so->wait, so->tx.state == ISOTP_IDLE);
if (err)
        goto err_event_drop;

as we need to make sure that the state machine is set to defined values 
and states for the next isotp_sendmsg() attempt.

Best regards,
Oliver