Re: rxe panic

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



hi leon

I can not get what you means, do you say the rxe_add_ref(qp) is not needed?
My kernel is old, and I found some bugs of rxe on 4.14.97, especially
the rnr errors.
I can not upgrade whole kernel because there are many dependencies.
Finally , I sync the fixed from newest kernel version to the 4.14.97.

When I compare my rxe_resp.c with kernel 5.2.9 , I found the snippet
of duplicate_request is changed.
and rxe_xmit_packet will call rxe_send,enter the log "rdma_rxe:
Unknown layer 3 protocol: 0"

  1137 } else {
  1138 struct resp_res *res;
  1139
  1140 /* Find the operation in our list of responder resources. */
  1141 res = find_resource(qp, pkt->psn);
  1142 if (res) {
  1143 struct sk_buff *skb_copy;
  1144
  1145 skb_copy = skb_clone(res->atomic.skb, GFP_ATOMIC);
  1146 if (skb_copy) {
  1147 rxe_add_ref(qp); /* for the new SKB */
  1148 } else {
  1149 pr_warn("Couldn't clone atomic resp\n");
  1150 rc = RESPST_CLEANUP;
  1151 goto out;
  1152 }
  1153
  1154 /* Resend the result. */
  1155 rc = rxe_xmit_packet(to_rdev(qp->ibqp.device), qp,
  1156      pkt, skb_copy);
  1157 if (rc) {
  1158 pr_err("Failed resending result. This flow is not handled - skb
ignored\n");
  1159 rxe_drop_ref(qp);
  1160 rc = RESPST_CLEANUP;
  1161 goto out;
  1162 }
  1163 }
  1164
  1165 /* Resource not found. Class D error. Drop the request. */
  1166 rc = RESPST_CLEANUP;
  1167 goto out;
  1168 }
  1169 out:
  1170 return rc;
  1171 }

On Wed, Dec 25, 2019 at 2:33 PM Leon Romanovsky <leon@xxxxxxxxxx> wrote:
>
> On Wed, Dec 25, 2019 at 12:55:35PM +0800, Frank Huang wrote:
> > hi, there is a panic on rdma_rxe module when the restart
> > network.service or shutdown the switch.
> >
> > it looks like a use-after-free error.
> >
> > everytime it happens, there is the log "rdma_rxe: Unknown layer 3 protocol: 0"
>
> The error print itself is harmless.
> >
> > is it a known error?
> >
> > my kernel version is 4.14.97
>
> Your kernel is old enough and doesn't include refcount,
> so I can't say for sure that it is the case, but the
> following code is not correct and with refcount debug
> it will be seen immediately.
>
> 1213 int rxe_responder(void *arg)
> 1214 {
> 1215         struct rxe_qp *qp = (struct rxe_qp *)arg;
> 1216         struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
> 1217         enum resp_states state;
> 1218         struct rxe_pkt_info *pkt = NULL;
> 1219         int ret = 0;
> 1220
> 1221         rxe_add_ref(qp); <------ USE-AFTER-FREE
> 1222
> 1223         qp->resp.aeth_syndrome = AETH_ACK_UNLIMITED;
> 1224
> 1225         if (!qp->valid) {
> 1226                 ret = -EINVAL;
> 1227                 goto done;
> 1228         }
>
> Thanks




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux