在 2023/10/18 1:09, Bob Pearson 写道:
On 9/25/23 20:17, Daisuke Matsuda (Fujitsu) wrote:
On Tue, Sep 26, 2023 12:01 AM Bart Van Assche:
On 9/24/23 21:47, Daisuke Matsuda (Fujitsu) wrote:
As Bob wrote above, nobody has found any logical failure in rxe
driver.
That's wrong. In case you would not yet have noticed my latest email in
this thread, please take a look at
https://lore.kernel.org/linux-rdma/e8b76fae-780a-470e-8ec4-c6b650793d10@xxxxxxxxxxxxx/T/#m0fd8ea8a4cbc27b37
b042ae4f8e9b024f1871a73.
I think the report in that email is a 100% proof that there is a
use-after-free issue in the rdma_rxe driver. Use-after-free issues have
security implications and also can cause data corruption. I propose to
revert the commit that introduced the rdma_rxe use-after-free unless
someone comes up with a fix for the rdma_rxe driver.
Bart.
Thank you for the clarification. I see your intention.
I hope the hang issue will be resolved by addressing this.
Thanks,
Daisuke
I have made some progress in understanding the cause of the srp/002 etc. hang.
The two attached files are traces of activity for two qp's qp#151 and qp#167. In my runs of srp/002
All the qp's pass before 167 and all fail after 167 which is the first to fail.
It turns out that all the passing qp's call srp_post_send() some number of times and also call
srp_send_done() the same number of times. Starting at qp#167 the last call to srp_send_done() does
not take place leaving the srp driver waiting for the final completion and causing the hang I believe.
Thanks, Bob
I will delve into your findings and the source code to find the root cause.
BTW, what linux distribution are you using to find this? Ubuntu, Fedora
or Debian?
From the above, sometings this problem is difficult to reproduce on
Ubuntu. But it can be reproduced in Ubuntu and Debian.
So can you let me know what linux distribution you are using?
Thanks
Zhu Yanjun
There are four cq's involved in each pair of qp's in the srp test. Two in ib_srp and two in ib_srpt
for the two qp's. Three of them execute completion processing in a soft irq context so the code in
core/cq.c gathers the completions and calls back to the srp drivers. The send side cq in srp uses
cq_direct which requires srp to call ib_process_direct() in order to collect the completions. This
happens in __srp_get_tx_iu() which is called in several places in the srp driver. But only as a side effect
since the purpose of this routine is to get an iu to start a new command.
In the attached files for qp#151 the final call to srp_post_send is followed by the rxe requester and
completer work queues processing the send packet and the ack before a final call to __srp_get_rx_iu()
which gathers the final send side completion and success.
For qp#167 the call to srp_post_send() is followed by the rxe driver processing the send operation and
generating a work completion which is posted to the send cq but there is never a following call to
__srp_get_rx_iu() so the cqe is not received by srp and failure.
I don't yet understand the logic of the srp driver to fix this but the problem is not in the rxe driver
as far as I can tell.
Bob