Suspicious code in rxe requester and completer tasklets

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I am continuing to chase some bugs seen here at HPE which were centered on spurious retries.
In the course of this I have been looking carefully at data sharing between the request side
tasklets (requester and completer) and have noticed the following.

Under light load the two tasklets call/schedule each other so tend to run on the same cpu
most of the time. Under heavy load the completer tasklet can get scheduled from the NAPI thread
and run multiple cycles while the requester tasklet gets scheduled from the post send verbs API
and can also run multiple cycles thus you can see two different cpus remaining active. So
basically we can make no assumption about the relative timing of shared data accesses between
the two tasklets and any other threads that also access shared data.

The code is written as though protection was an afterthought. There are a few cases where
atomic_t are used and in one case a lock is added but in other cases no effort is made to
protect shared data accesses. A good example is in complete_ack() in rxe_comp.c which has

        if (wqe->has_rd_atomic) {

                wqe->has_rd_atomic = 0;

                atomic_inc(&qp->req.rd_atomic);

                if (qp->req.need_rd_atomic) {

                        qp->comp.timeout_retry = 0;

                        qp->req.need_rd_atomic = 0;

                        rxe_run_task(&qp->req.task, 0);

                }

        }


and check_init_depth() in rxe_req.c

static inline int check_init_depth(struct rxe_qp *qp, struct rxe_send_wqe *wqe)

{

        int depth;



        if (wqe->has_rd_atomic)

                return 0;



        qp->req.need_rd_atomic = 1;

        depth = atomic_dec_return(&qp->req.rd_atomic);



        if (depth >= 0) {

                qp->req.need_rd_atomic = 0;

                wqe->has_rd_atomic = 1;

                return 0;

        }



        atomic_inc(&qp->req.rd_atomic);

        return -EAGAIN;

}


It makes my head hurt trying to figure out all the combinations in this mess. Both of
these can run multiple times as well with retries with arbitrary relative timing.

This code may be perfect but it feels quite unsafe to me.

Bob 



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux