On Thu, Apr 01, 2021 at 01:39:05PM +0200, Toke Høiland-Jørgensen wrote: > Vladimir Oltean <olteanv@xxxxxxxxx> writes: > > > On Thu, Apr 01, 2021 at 01:26:02PM +0200, Toke Høiland-Jørgensen wrote: > >> > +int enetc_xdp_xmit(struct net_device *ndev, int num_frames, > >> > + struct xdp_frame **frames, u32 flags) > >> > +{ > >> > + struct enetc_tx_swbd xdp_redirect_arr[ENETC_MAX_SKB_FRAGS] = {0}; > >> > + struct enetc_ndev_priv *priv = netdev_priv(ndev); > >> > + struct enetc_bdr *tx_ring; > >> > + int xdp_tx_bd_cnt, i, k; > >> > + int xdp_tx_frm_cnt = 0; > >> > + > >> > + tx_ring = priv->tx_ring[smp_processor_id()]; > >> > >> What mechanism guarantees that this won't overflow the array? :) > > > > Which array, the array of TX rings? > > Yes. > The problem isn't even accessing an out-of-bounds element in the TX ring array. As it turns out, I had a relatively superficial understanding of how things are organized, but let me try to explain. The number of TX rings is a configurable resource (between PFs and VFs) and we read the capability at probe time: enetc_get_si_caps: val = enetc_rd(hw, ENETC_SICAPR0); si->num_rx_rings = (val >> 16) & 0xff; si->num_tx_rings = val & 0xff; enetc_init_si_rings_params: priv->num_tx_rings = si->num_tx_rings; In any case, the TX array is declared as: struct enetc_ndev_priv { struct enetc_bdr *tx_ring[16]; struct enetc_bdr *rx_ring[16]; }; because that's the maximum hardware capability. The priv->tx_ring array is populated in: enetc_alloc_msix: /* # of tx rings per int vector */ v_tx_rings = priv->num_tx_rings / priv->bdr_int_num; for (i = 0; i < priv->bdr_int_num; i++) { for (j = 0; j < v_tx_rings; j++) { if (priv->bdr_int_num == ENETC_MAX_BDR_INT) idx = 2 * j + i; /* 2 CPUs */ else idx = j + i * v_tx_rings; /* default */ priv->tx_ring[idx] = bdr; } } priv->bdr_int_num is set to "num_online_cpus()". On LS1028A, it can be either 1 or 2 (and the ENETC_MAX_BDR_INT macro is equal to 2). Otherwise said, the convoluted logic above does the following: - It affines an MSI interrupt vector per CPU - It affines an RX ring per MSI vector, hence per CPU - It balances the fixed number of TX rings (say 8) among the available MSI vectors, hence CPUs (say 2). It does this by iterating with i through the RX MSI interrupt vectors, and with j through the number of TX rings per MSI vector. This logic maps: - the even TX rings to CPU 0 and the odd TX rings to CPU 1, if 2 CPUs are used - all TX rings to CPU 0, if 1 CPU is used This is done because we have this logic in enetc_poll: for (i = 0; i < v->count_tx_rings; i++) if (!enetc_clean_tx_ring(&v->tx_ring[i], budget)) complete = false; for processing the TX completions of a given group of TX rings in the RX MSI interrupt handler of a certain CPU. Otherwise said, priv->tx_ring[i] is always BD ring i, and that mapping never changes. All 8 TX rings are enabled and available for use. What I knew about tc-taprio and tc-mqprio is that they only enqueue to TX queues [0, num_tc-1] because of this, as it turns out: enetc_xmit: tx_ring = priv->tx_ring[skb->queue_mapping]; where skb->queue_mapping is given by: err = netif_set_real_num_tx_queues(ndev, priv->num_tx_rings); and by this, respectively, from the mqprio code path: netif_set_real_num_tx_queues(ndev, num_tc); As for why XDP works, and priv->tx_ring[smp_processor_id()] is: - TX ring 0 for CPU 0 and TX ring 1 for CPU 1, if 2 CPUs are used - TX ring 0, if 1 CPU is used The TX completions in the first case are handled by: - CPU 0 for TX ring 0 (because it is even) and CPU 1 for TX ring 1 (because it is odd), if 2 CPUs are used, due to the mapping I talked about earlier - CPU 0 if only 1 CPU is used > > You mean that it's possible to receive a TC_SETUP_QDISC_MQPRIO or > > TC_SETUP_QDISC_TAPRIO with num_tc == 1, and we have 2 CPUs? > > Not just that, this ndo can be called on arbitrary CPUs after a > redirect. The code just calls through from the XDP receive path so which > CPU it ends up on depends on the RSS+IRQ config of the other device, > which may not even be the same driver; i.e., you have no control over > that... :) > What do you mean by "arbitrary" CPU? You can't plug CPUs in, it's a dual core system... Why does the source ifindex matter at all? I'm using the TX ring affined to the CPU that ndo_xdp_xmit is currently running on. > > Well, yeah, I don't know what's the proper way to deal with that. Ideas? > > Well the obvious one is just: > > tx_ring = priv->tx_ring[smp_processor_id() % num_ring_ids]; > > and then some kind of locking to deal with multiple CPUs accessing the > same TX ring... By multiple CPUs accessing the same TX ring, you mean locking between ndo_xdp_xmit and ndo_start_xmit? Can that even happen if the hardware architecture is to have at least as many TX rings as CPUs? Because otherwise, I see that ndo_xdp_xmit is only called from xdp_do_flush, which is in softirq context, which to my very rudimentary knowledge run with bottom halves, thus preemption, disabled? So I don't think it's possible for ndo_xdp_xmit and ndo_xmit, or even two ndo_xdp_xmit instances, to access the same TX ring? Sorry, I'm sure these are trivial questions, but I would like to really understand what I need to change and why :D