Re: [PATCH for-next 9/9] RDMA/rxe: Protect pending send packets

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 2 Aug 2023 11:57:55 -0300

On Wed, Aug 02, 2023 at 09:39:55AM -0500, Bob Pearson wrote:
> On 8/1/23 17:56, Jason Gunthorpe wrote:
> > On Mon, Jul 31, 2023 at 01:44:47PM -0500, Bob Pearson wrote:
> >> On 7/31/23 13:32, Jason Gunthorpe wrote:
> >>> On Mon, Jul 31, 2023 at 01:26:23PM -0500, Bob Pearson wrote:
> >>>> On 7/31/23 13:17, Jason Gunthorpe wrote:
> >>>>> On Fri, Jul 21, 2023 at 03:50:22PM -0500, Bob Pearson wrote:
> >>>>>> Network interruptions may cause long delays in the processing of
> >>>>>> send packets during which time the rxe driver may be unloaded.
> >>>>>> This will cause seg faults when the packet is ultimately freed as
> >>>>>> it calls the destructor function in the rxe driver. This has been
> >>>>>> observed in cable pull fail over fail back testing.
> >>>>>
> >>>>> No, module reference counts are only for code that is touching
> >>>>> function pointers.
> >>>>
> >>>> this is exactly the case here. it is the skb destructor function that
> >>>> is carried by the skb.
> >>>
> >>> It can't possibly call it correctly without also having the rxe
> >>> ib_device reference too though??
> >>
> >> Nope. This was causing seg faults in testing when there was a long network
> >> hang and the admin tried to reload the rxe driver. The skb code doesn't care
> >> about the ib device at all.
> > 
> > I don't get it, there aren't globals in rxe, so WTF is it doing if it
> > isn't somehow tracing back to memory that is under the ib_device
> > lifetime?
> > 
> > Jason
> 
> When the rxe driver builds a send packet it puts the address of its destructor
> subroutine in the skb before calling ip_local_out and sending it. The address of
> driver software is now hanging around. If you don't delay the module exit routine
> until all the skb's are freed you can cause seg faults. The only way to cause this to
> happen is to call rmmod on the driver too early but people have done this occasionally
> and report it as a bug.

You are missing the point, the destructor currently does this:

static void rxe_skb_tx_dtor(struct sk_buff *skb)
{
	struct sock *sk = skb->sk;
	struct rxe_qp *qp = sk->sk_user_data;

So you've already UAF'd because rxe_qp is freed memory well before you
get to unloading the module.

This series changed it to do this:

 static void rxe_skb_tx_dtor(struct sk_buff *skb)
 {
	struct rxe_dev *rxe;
	unsigned int index;
	struct rxe_qp *qp;
	int skb_out;

	/* takes a ref on ib device if success */
	rxe = get_rxe_from_skb(skb);
	if (!rxe)
		goto out;

static struct rxe_dev *get_rxe_from_skb(struct sk_buff *skb)
{
	struct rxe_dev *rxe;
	struct net_device *ndev = skb->dev;

	rxe = rxe_get_dev_from_net(ndev);
	if (!rxe && is_vlan_dev(ndev))
		rxe = rxe_get_dev_from_net(vlan_dev_real_dev(ndev));

Which seems totally nutz, you are now relying on the global hash table
in ib_core to resolve the ib device.

Again, why can't this code do something sane like refcount the qp or
ib_device so the destruction doesn't progress until all the SKBs are
flushed out?

Jason