On Thu, Aug 17, 2023 at 12:30 PM Jamal Hadi Salim <jhs@xxxxxxxxxxxx> wrote: > > On Wed, Aug 16, 2023 at 6:58 PM Vladimir Oltean <vladimir.oltean@xxxxxxx> wrote: > > > > Hi Jakub, > > > > On Tue, Aug 15, 2023 at 02:28:21PM +0300, Vladimir Oltean wrote: > > > On Mon, Aug 14, 2023 at 04:03:03PM -0700, Jakub Kicinski wrote: > > > > Hi Vladimir, any ideas for this one? > > > > The bisection looks pooped, FWIW, looks like a taprio inf loop. > > > > > > I'm looking into it. > > > > Here's what I've found out and what help I'll need going forward. > > > > Indeed there is an infinite loop in taprio_dequeue() -> taprio_dequeue_tc_priority(), > > leading to an RCU stall. > > > > Short description of taprio_dequeue_tc_priority(): it cycles > > q->cur_txq[tc] in the range between [ offset, offset + count ), where: > > > > int offset = dev->tc_to_txq[tc].offset; > > int count = dev->tc_to_txq[tc].count; > > > > with the initial q->cur_txq[tc], aka the "first_txq" variable, being set > > by the control path: taprio_change(), also called by taprio_init(): > > > > if (mqprio) { > > (...) > > for (i = 0; i < mqprio->num_tc; i++) { > > (...) > > q->cur_txq[i] = mqprio->offset[i]; > > } > > } > > > > In the buggy case that leads to the RCU stall, the line in taprio_change() > > which sets q->cur_txq[i] never gets executed. So first_txq will be 0 > > (pre-initialized memory), and if that's outside of the [ offset, offset + count ) > > range that taprio_dequeue_tc_priority() -> taprio_next_tc_txq() expects > > to cycle through, the kernel is toast. > > > > The nitty gritty of that is boring. What's not boring is how come the > > control path skips the q->cur_txq[i] assignment. It's because "mqprio" > > is NULL, and that's because taprio_change() (important: also tail-called > > from taprio_init()) has this logic to detect a change in the traffic > > class settings of the device, compared to the passed TCA_TAPRIO_ATTR_PRIOMAP > > netlink attribute: > > > > /* no changes - no new mqprio settings */ > > if (!taprio_mqprio_cmp(q, dev, mqprio)) > > mqprio = NULL; > > > > And what happens is that: > > - we go through taprio_init() > > - a TCA_TAPRIO_ATTR_PRIOMAP gets passed to us > > - taprio_mqprio_cmp() sees that there's no change compared to the > > netdev's existing traffic class config > > - taprio_change() sets "mqprio" to NULL, ignoring the given > > TCA_TAPRIO_ATTR_PRIOMAP > > - we skip modifying q->cur_txq[i], as if it was a taprio_change() call > > that came straight from Qdisc_ops :: change(), rather than what it > > really is: one from Qdisc_ops :: init() > > > > So the next question: why does taprio_mqprio_cmp() see that there's no > > change? Because there is no change. When Qdisc_ops :: init() is called, > > the netdev really has a non-zero dev->num_tc, prio_tc_map, tc_to_txq and > > all that. > > > > But why? A previous taprio, if that existed, will call taprio_destroy() > > -> netdev_reset_tc(), so it won't leave state behind that will hinder > > the current taprio. Checking for stuff in the netdev state is just so > > that taprio_change() can distinguish between a direct Qdisc_ops :: change() > > call vs one coming from init(). > > > > Finally, here's where the syzbot repro becomes relevant. It crafts the > > RTM_NEWQDISC netlink message in such a way, that it makes tc_modify_qdisc() > > in sch_api.c call a Qdisc_ops sequence with which taprio wasn't written > > in mind. > > > > With "tc qdisc replace && tc qdisc replace", tc_modify_qdisc() is > > supposed to call init() the first time and replace() the second time. > > What the repro does is make the above sequence call two init() methods > > back to back. > > > > To create an iproute2-based reproducer rather than the C one provided by > > syzbot, we need this iproute2 change: > > > > diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c > > index 56086c43b7fa..20d9622b6bf3 100644 > > --- a/tc/tc_qdisc.c > > +++ b/tc/tc_qdisc.c > > @@ -448,6 +448,8 @@ int do_qdisc(int argc, char **argv) > > return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_EXCL|NLM_F_CREATE, argc-1, argv+1); > > if (matches(*argv, "change") == 0) > > return tc_qdisc_modify(RTM_NEWQDISC, 0, argc-1, argv+1); > > + if (strcmp(*argv, "replace-exclusive") == 0) > > + return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_CREATE|NLM_F_REPLACE|NLM_F_EXCL, argc-1, argv+1); > > if (matches(*argv, "replace") == 0) > > return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_CREATE|NLM_F_REPLACE, argc-1, argv+1); > > if (matches(*argv, "link") == 0) > > > > which basically implements a crafted alternative of "tc qdisc replace" > > which also sets the NLM_F_EXCL flag in n->nlmsg_flags. > > > > Then, the minimal repro script can simply be expressed as: > > > > #!/bin/bash > > > > ip link add veth0 numtxqueues 16 numrxqueues 16 type veth peer name veth1 > > ip link set veth0 up && ip link set veth1 up > > > > for ((i = 0; i < 2; i++)); do > > tc qdisc replace-exclusive dev veth0 root stab overhead 24 taprio \ > > num_tc 2 map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \ > > queues 8@0 4@8 \ > > clockid REALTIME \ > > base-time 0 \ > > cycle-time 61679 \ > > sched-entry S 0 54336 \ > > sched-entry S 0x8a27 7343 \ > > max-sdu 18343 18343 \ > > flags 0 > > done > > > > ip link del veth0 > > > > Here's how things go sideways if sch_api.c goes through the Qdisc_ops :: init() > > code path instead of change() for the second Qdisc. > > > > The first taprio_attach() (i=0) will attach the root taprio Qdisc (aka itself) > > to all netdev TX queues, and qdisc_put() the existing pfifo default Qdiscs. > > > > When the taprio_init() method executes for i=1, taprio_destroy() hasn't > > been called yet. So neither has netdev_reset_tc() been called, and > > that's part of the problem (the one that causes the infinite loop in > > dequeue()). > > > > But, taprio_destroy() will finally get called for the initial taprio > > created at i=0. The call trace looks like this: > > > > rtnetlink_rcv_msg() > > -> tc_modify_qdisc() > > -> qdisc_graft() > > -> taprio_attach() for i=1 > > -> qdisc_put() for the old Qdiscs attached to the TX queues, aka the taprio from i=0 > > -> __qdisc_destroy() > > -> taprio_destroy() > > > > What's more interesting is that the late taprio_destroy() for i=0 > > effectively destroys the netdev state - the netdev_reset_tc() call - > > done by taprio_init() -> taprio_change() for i=1, and that can't be > > too good, either. Even if there's no immediately observable hang, the > > traffic classes are reset even though the Qdisc thinks they aren't. > > > > Taprio isn't the only one affected by this. Mqprio also has the pattern > > of calling netdev_set_num_tc() from Qdisc_ops :: init() and destroy(). > > But with the possibility of destroy(i=0) not being serialized with > > init(i=1), that's buggy. > > > > Sorry for the long message. This is where I'm at. For me, this is the > > bottom of where things are intuitive. I don't understand what is > > considered to be expected behavior from tc_modify_qdisc(), and what is > > considered to be sane Qdisc-facing API, and I need help. > > > > I've completely stopped debugging when I saw that the code enters > > through this path at i=1, so I really can't tell you more: > > > > /* This magic test requires explanation. > > * > > * We know, that some child q is already > > * attached to this parent and have choice: > > * either to change it or to create/graft new one. > > * > > * 1. We are allowed to create/graft only > > * if CREATE and REPLACE flags are set. > > * > > * 2. If EXCL is set, requestor wanted to say, > > * that qdisc tcm_handle is not expected > > * to exist, so that we choose create/graft too. > > * > > * 3. The last case is when no flags are set. > > * Alas, it is sort of hole in API, we > > * cannot decide what to do unambiguously. > > * For now we select create/graft, if > > * user gave KIND, which does not match existing. > > */ > > if ((n->nlmsg_flags & NLM_F_CREATE) && > > (n->nlmsg_flags & NLM_F_REPLACE) && > > ((n->nlmsg_flags & NLM_F_EXCL) || > > (tca[TCA_KIND] && > > nla_strcmp(tca[TCA_KIND], q->ops->id)))) { > > netdev_err(dev, "magic test\n"); > > goto create_n_graft; > > } > > > > I've added more Qdisc people to the discussion. The problem description > > is pretty much self-contained in this email, and going to the original > > syzbot report won't bring much else. > > > > I will take a look tommorow. > Can you try the attached patchlet? cheers, jamal > cheers, > jamal > > > There are multiple workarounds that can be done in taprio (and mqprio) > > depending on what is considered as being sane API. Though I don't want > > to get ahead of myself. Maybe there is a way to fast-forward the > > qdisc_destroy() of the previous taprio so it doesn't overlap with the > > new one's qdisc_create().
Attachment:
patchlet-qdisc
Description: Binary data