If there is data and the thread didn't wake up then that is a libc or
kernel problem;
but if there is no data, then look for cases where earlier interrupted
io actually
consumed the data already or blame the sending process not the receiver.
Also are the sockets blocking or non-blocking?
The sockets are non-blocking.
Sorry, I made a spelling mistake here.
I wanted to tell that the sockets ARE blocking (default behavior).
In a practical case, we have a thread blocked in recv() for more than 12
hours, which is way beyond the timeout of the sender connection. The
socket has already been closed by the sender so recv() should at least
be noticed and returns 0.
To provide more informations :
Doing a lsof on the receiver, we can see that it has several ESTABLISHED
sockets connected to a given host/sender. Doing a lsof on the host does
not give any socket connected to the receiver (since they have been
closed due to a timeout).
Also, the application correctly handles 0.
The pseudo-code is the following :
loop:
ret = recv()
if( ret == -1 ) {
if( errno == EINTR ) goto loop;
return -1;
}
return ret;
Then, on the higher level, in case we get an error ( ret <= 0 ) then we
close the socket.
At first, we were using the libmysqlclient but since we had the bug with
it we rewrote a mysql client so we can more easily check what's
occurring. The same bug seems to occur with both implementations.
Best,
Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-net" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html