On Mon, Dec 5, 2022 at 9:44 PM Justin Iurman <justin.iurman@xxxxxxxxx> wrote: > > On 12/5/22 19:34, Eric Dumazet wrote: > > On Mon, Dec 5, 2022 at 7:24 PM Justin Iurman <justin.iurman@xxxxxxxxx> wrote: > >> > >> On 12/5/22 17:50, Eric Dumazet wrote: > >>> On Mon, Dec 5, 2022 at 5:30 PM Eric Dumazet <edumazet@xxxxxxxxxx> wrote: > >>>> > >>>> Patch title seems > >>>> > >>>> On Mon, Dec 5, 2022 at 4:36 PM Justin Iurman <justin.iurman@xxxxxxxxx> wrote: > >>>>> > >>>>> This patch fixes a NULL qdisc pointer when retrieving the TX queue depth > >>>>> for IOAM. > >>>>> > >>>>> IMPORTANT: I suspect this fix is local only and the bug goes deeper (see > >>>>> reasoning below). > >>>>> > >>>>> Kernel panic: > >>>>> [...] > >>>>> RIP: 0010:ioam6_fill_trace_data+0x54f/0x5b0 > >>>>> [...] > >>>>> > >>>>> ...which basically points to the call to qdisc_qstats_qlen_backlog > >>>>> inside net/ipv6/ioam6.c. > >>>>> > >>>>> From there, I directly thought of a NULL pointer (queue->qdisc). To make > >>>>> sure, I added some printk's to know exactly *why* and *when* it happens. > >>>>> Here is the (summarized by queue) output: > >>>>> > >>>>> skb for TX queue 1, qdisc is ffff8b375eee9800, qdisc_sleeping is ffff8b375eee9800 > >>>>> skb for TX queue 2, qdisc is ffff8b375eeefc00, qdisc_sleeping is ffff8b375eeefc00 > >>>>> skb for TX queue 3, qdisc is ffff8b375eeef800, qdisc_sleeping is ffff8b375eeef800 > >>>>> skb for TX queue 4, qdisc is ffff8b375eeec800, qdisc_sleeping is ffff8b375eeec800 > >>>>> skb for TX queue 5, qdisc is ffff8b375eeea400, qdisc_sleeping is ffff8b375eeea400 > >>>>> skb for TX queue 6, qdisc is ffff8b375eeee000, qdisc_sleeping is ffff8b375eeee000 > >>>>> skb for TX queue 7, qdisc is ffff8b375eee8800, qdisc_sleeping is ffff8b375eee8800 > >>>>> skb for TX queue 8, qdisc is ffff8b375eeedc00, qdisc_sleeping is ffff8b375eeedc00 > >>>>> skb for TX queue 9, qdisc is ffff8b375eee9400, qdisc_sleeping is ffff8b375eee9400 > >>>>> skb for TX queue 10, qdisc is ffff8b375eee8000, qdisc_sleeping is ffff8b375eee8000 > >>>>> skb for TX queue 11, qdisc is ffff8b375eeed400, qdisc_sleeping is ffff8b375eeed400 > >>>>> skb for TX queue 12, qdisc is ffff8b375eeea800, qdisc_sleeping is ffff8b375eeea800 > >>>>> skb for TX queue 13, qdisc is ffff8b375eee8c00, qdisc_sleeping is ffff8b375eee8c00 > >>>>> skb for TX queue 14, qdisc is ffff8b375eeea000, qdisc_sleeping is ffff8b375eeea000 > >>>>> skb for TX queue 15, qdisc is ffff8b375eeeb800, qdisc_sleeping is ffff8b375eeeb800 > >>>>> skb for TX queue 16, qdisc is NULL, qdisc_sleeping is NULL > >>>>> > >>>>> What the hell? So, not sure why queue #16 would *never* have a qdisc > >>>>> attached. Is it something expected I'm not aware of? As an FYI, here is > >>>>> the output of "tc qdisc list dev xxx": > >>>>> > >>>>> qdisc mq 0: root > >>>>> qdisc fq_codel 0: parent :10 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > >>>>> qdisc fq_codel 0: parent :f limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > >>>>> qdisc fq_codel 0: parent :e limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > >>>>> qdisc fq_codel 0: parent :d limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > >>>>> qdisc fq_codel 0: parent :c limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > >>>>> qdisc fq_codel 0: parent :b limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > >>>>> qdisc fq_codel 0: parent :a limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > >>>>> qdisc fq_codel 0: parent :9 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > >>>>> qdisc fq_codel 0: parent :8 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > >>>>> qdisc fq_codel 0: parent :7 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > >>>>> qdisc fq_codel 0: parent :6 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > >>>>> qdisc fq_codel 0: parent :5 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > >>>>> qdisc fq_codel 0: parent :4 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > >>>>> qdisc fq_codel 0: parent :3 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > >>>>> qdisc fq_codel 0: parent :2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > >>>>> qdisc fq_codel 0: parent :1 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > >>>>> > >>>>> By the way, the NIC is an Intel XL710 40GbE QSFP+ (i40e driver, firmware > >>>>> version 8.50 0x8000b6c7 1.3082.0) and it was tested on latest "net" > >>>>> version (6.1.0-rc7+). Is this a bug in the i40e driver? > >>>>> > >>>> > >>>>> Cc: stable@xxxxxxxxxxxxxxx > >>>> > >>>> Patch title is mangled. The Fixes: tag should appear here, not in the title. > >>>> > >>>> > >>>> Fixes: b63c5478e9cb ("ipv6: ioam: Support for Queue depth data field") > >>>> > >>>>> Signed-off-by: Justin Iurman <justin.iurman@xxxxxxxxx> > >>>>> --- > >>>>> net/ipv6/ioam6.c | 11 +++++++---- > >>>>> 1 file changed, 7 insertions(+), 4 deletions(-) > >>>>> > >>>>> diff --git a/net/ipv6/ioam6.c b/net/ipv6/ioam6.c > >>>>> index 571f0e4d9cf3..2472a8a043c4 100644 > >>>>> --- a/net/ipv6/ioam6.c > >>>>> +++ b/net/ipv6/ioam6.c > >>>>> @@ -727,10 +727,13 @@ static void __ioam6_fill_trace_data(struct sk_buff *skb, > >>>>> *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE); > >>>>> } else { > >>>>> queue = skb_get_tx_queue(skb_dst(skb)->dev, skb); > >>>> > >>>> Are you sure skb_dst(skb)->dev is correct at this stage, what about > >>>> stacked devices ? > >>>> > >>>>> - qdisc = rcu_dereference(queue->qdisc); > >>>>> - qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog); > >>>>> - > >>>>> - *(__be32 *)data = cpu_to_be32(backlog); > >>>>> + if (!queue->qdisc) { > >>>> > >>>> This is racy. > >>>> > >>>>> + *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE); > >>>>> + } else { > >>>>> + qdisc = rcu_dereference(queue->qdisc); > >>>>> + qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog); > >>>>> + *(__be32 *)data = cpu_to_be32(backlog); > >>>>> + } > >>>>> } > >>>>> data += sizeof(__be32); > >>>>> } > >>>>> -- > >>>>> 2.25.1 > >>>>> > >>>> > >>>> Quite frankly I suggest to revert b63c5478e9cb completely. > >>>> > >>>> The notion of Queue depth can not be properly gathered in Linux with a > >>>> multi queue model, > >>>> so why trying to get a wrong value ? > >>> > >>> Additional reason for a revert is that qdisc_qstats_qlen_backlog() is > >>> reserved for net/sched > >> > >> If by "reserved" you mean "only used by at the moment", then yes (with > >> the only exception being IOAM). But some other functions are defined as > >> well, and some are used elsewhere than the "net/sched" context. So I > >> don't think it's really an issue to use this function "from somewhere else". > >> > >>> code, I think it needs the qdisc lock to be held. > >> > >> Good point. But is it really needed when called with rcu_read_lock? > > > > It is needed. > > So I guess I could just submit a patch to wrap that part with: > > spin_lock(qdisc_lock(qdisc)); > [...] > spin_unlock(qdisc_lock(qdisc)); > > Anyway, there'd still be that queue-without-qdisc bug out there that I'd > like to discuss but which seems to be avoided in the conversation somehow. > > > Please revert this patch. > > > > Many people use FQ qdisc, where packets are waiting for their Earliest > > Departure Time to be released. > > The IOAM queue depth is a very important value and is already used. Just > reverting the patch is not really an option. Besides, if IOAM is not > used, then what you described above would not be a problem (probably > most of the time as IOAM is enabled on limited domains). Not to mention > that you probably don't need the queue depth in every packet. > > > Also, the draft says: > > > > 5.4.2.7. queue depth > > > > The "queue depth" field is a 4-octet unsigned integer field. This > > field indicates the current length of the egress interface queue of > > the interface from where the packet is forwarded out. The queue > > depth is expressed as the current amount of memory buffers used by > > the queue (a packet could consume one or more memory buffers, > > depending on its size). > > > > 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > > | queue depth | > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > > > > > > It is relatively clear that the egress interface is the aggregate > > egress interface, > > not a subset of the interface. > > Correct, even though the definition of an interface in RFC 9197 is quite > abstract (see the end of section 4.4.2.2: "[...] could represent a > physical interface, a virtual or logical interface, or even a queue"). > > > If you have 32 TX queues on a NIC, all of them being backlogged (line rate), > > sensing the queue length of one of the queues would give a 97% error > > on the measure. > > Why would it? Not sure I get your idea based on that example. > > > To properly implement that 'Queue depth' metric, you would need to use > > the backlog of the MQ qdisc. > > That would be very expensive. > > Could be a solution, thanks for the suggestion. But if it's too > expensive, I'd rather have the current solution (with locks) than > nothing at all. At least it has provided an acceptable solution so far. I will let other maintainers decide. I gave my feedback, I won't change it.