Hi Wengang,
On 3/30/2016 9:19 AM, Leon Romanovsky wrote:
On Wed, Mar 30, 2016 at 05:08:22PM +0800, Wengang Wang wrote:
Problem is found that some among a lot of parallel RDS communications hang.
In my test ten or so among 33 communications hang. The send requests got
-ENOBUF error meaning the peer socket (port) is congested. But meanwhile,
peer socket (port) is not congested.
The congestion map updating can happen in two paths: one is in rds_recvmsg path
and the other is when it receives packets from the hardware. There is no
synchronization when updating the congestion map. So a bit operation (clearing)
in the rds_recvmsg path can be skipped by another bit operation (setting) in
hardware packet receving path.
Fix is to add a spin lock per congestion map to sync the update on it.
No performance drop found during the test for the fix.
I assume that this change fixed your issue, however it looks suspicious
that performance wasn't change.
First of all thanks for finding the issue and posting patch
for it. I do agree with Leon on performance comment.
We shouldn't need locks for map updates.
Moreover the parallel receive path on which this patch
is based of doesn't exist in upstream code. I have kept
that out so far because of similar issue like one you
encountered.
Anyways lets discuss offline about the fix even for the
downstream kernel. I suspect we can address it without locks.
Reagrds,
Santosh
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html