Re: nvme-tcp: fix a possible UAF when failing to send request

"Maurizio Lombardi" <mlombard@xxxxxxxxxxxxxxx> · Wed, 12 Feb 2025 16:33:41 +0100

On Mon Feb 10, 2025 at 8:41 AM CET, zhang.guanghui@xxxxxxxx wrote:
> Hello 
>
>     When using the nvme-tcp driver in a storage cluster, the driver may trigger a null pointer causing the host to crash several times.
> By analyzing the vmcore, we know the direct cause is that  the request->mq_hctx was used after free. 
>
> CPU1                                                                   CPU2
>
> nvme_tcp_poll                                                          nvme_tcp_try_send  --failed to send reqrest 13 
>
>     nvme_tcp_try_recv                                                      nvme_tcp_fail_request
>
>         nvme_tcp_recv_skb                                                      nvme_tcp_end_request
>
>             nvme_tcp_recv_pdu                                                      nvme_complete_rq 
>
>                 nvme_tcp_handle_comp                                                   nvme_retry_req -- request->mq_hctx have been freed, is NULL.               
>                     nvme_tcp_process_nvme_cqe                                                                                    
>
>                         nvme_complete_rq
>
>                             nvme_end_req
>
>                                   blk_mq_end_request

Taking a step back. Let's take a different approach and try to avoid the
double completion.

The problem here is that apparently we received a nvme_tcp_rsp capsule
from the target, meaning that the command has been processed (I guess
the capsule has an error status?)

So maybe only part of the command has been sent?
Why we receive the rsp capsule at all? Shouldn't this be treated as a fatal
error by the controller?

Maurizio