Re: [PATCH wpan-next 08/11] net: mac802154: Add a warning in the hot path

Alexander Aring <aahringo@xxxxxxxxxx> · Fri, 13 May 2022 10:26:08 -0400

Hi,

On Thu, May 12, 2022 at 10:33 AM Miquel Raynal
<miquel.raynal@xxxxxxxxxxx> wrote:
>
> Hi Alexander,
>
> alex.aring@xxxxxxxxx wrote on Sun, 1 May 2022 20:21:18 -0400:
>
> > Hi,
> >
> > On Thu, Apr 28, 2022 at 3:58 AM Miquel Raynal <miquel.raynal@xxxxxxxxxxx> wrote:
> > >
> > > Hi Alexander,
> > >
> > > alex.aring@xxxxxxxxx wrote on Wed, 27 Apr 2022 14:01:25 -0400:
> > >
> > > > Hi,
> > > >
> > > > On Wed, Apr 27, 2022 at 12:47 PM Miquel Raynal
> > > > <miquel.raynal@xxxxxxxxxxx> wrote:
> > > > >
> > > > > We should never start a transmission after the queue has been stopped.
> > > > >
> > > > > But because it might work we don't kill the function here but rather
> > > > > warn loudly the user that something is wrong.
> > > > >
> > > > > Signed-off-by: Miquel Raynal <miquel.raynal@xxxxxxxxxxx>
> > > > > ---
> > >
> > > [...]
> > >
> > > > > diff --git a/net/mac802154/tx.c b/net/mac802154/tx.c
> > > > > index a8a83f0167bf..021dddfea542 100644
> > > > > --- a/net/mac802154/tx.c
> > > > > +++ b/net/mac802154/tx.c
> > > > > @@ -124,6 +124,8 @@ bool ieee802154_queue_is_held(struct ieee802154_local *local)
> > > > >  static netdev_tx_t
> > > > >  ieee802154_hot_tx(struct ieee802154_local *local, struct sk_buff *skb)
> > > > >  {
> > > > > +       WARN_ON_ONCE(ieee802154_queue_is_stopped(local));
> > > > > +
> > > > >         return ieee802154_tx(local, skb);
> > > > >  }
> > > > >
> > > > > diff --git a/net/mac802154/util.c b/net/mac802154/util.c
> > > > > index 847e0864b575..cfd17a7db532 100644
> > > > > --- a/net/mac802154/util.c
> > > > > +++ b/net/mac802154/util.c
> > > > > @@ -44,6 +44,24 @@ void ieee802154_stop_queue(struct ieee802154_local *local)
> > > > >         rcu_read_unlock();
> > > > >  }
> > > > >
> > > > > +bool ieee802154_queue_is_stopped(struct ieee802154_local *local)
> > > > > +{
> > > > > +       struct ieee802154_sub_if_data *sdata;
> > > > > +       bool stopped = true;
> > > > > +
> > > > > +       rcu_read_lock();
> > > > > +       list_for_each_entry_rcu(sdata, &local->interfaces, list) {
> > > > > +               if (!sdata->dev)
> > > > > +                       continue;
> > > > > +
> > > > > +               if (!netif_queue_stopped(sdata->dev))
> > > > > +                       stopped = false;
> > > > > +       }
> > > > > +       rcu_read_unlock();
> > > > > +
> > > > > +       return stopped;
> > > > > +}
> > > >
> > > > sorry this makes no sense, you using net core functionality to check
> > > > if a queue is stopped in a net core netif callback. Whereas the sense
> > > > here for checking if the queue is really stopped is when 802.15.4
> > > > thinks the queue is stopped vs net core netif callback running. It
> > > > means for MLME-ops there are points we want to make sure that net core
> > > > is not handling any xmit and we should check this point and not
> > > > introducing net core functionality checks.
> > >
> > > I think I've mixed two things, your remark makes complete sense. I
> > > should instead here just check a 802.15.4 internal variable.
> > >
> >
> > I am thinking about this patch series... and I think it still has bugs
> > or at least it's easy to have bugs when the context is not right
> > prepared to call a synchronized transmission. We leave here the netdev
> > state machine world for transmit vs e.g. start/stop netif callback...
> > We have a warning here if there is a core netif xmit callback running
> > when 802.15.4 thinks it shouldn't (because we take control of it) but
> > I also think about a kind of the other way around. A warning if
> > 802.15.4 transmits something but the netdev core logic "thinks" it
> > shouldn't.
> >
> > That requires some checks (probably from netcore functionality) to
> > check if we call a 802.15.4 sync xmit but netif core already called
> > stop() callback. The last stop() callback - means the driver_ops
> > stop() callback was called, we have some "open_count" counter there
> > which MUST be incremented before doing any looping of one or several
> > sync transmissions. All I can say is if we call xmit() but the driver
> > is in stop() state... it will break things.
> >
> > My concern is also here that e.g. calling netif down or device
> > suspend() are only two examples I have in my mind right now. I don't
> > know all cases which can occur, that's why we should introduce another
> > WARN_ON_ONCE() for the case that 802.15.4 transmits something but we
> > are in a state where we can't transmit something according to netif
> > state (driver ops called stop()).
> >
> > Can you add such a check as well?
>
> That is a good idea, I have added such a check: if the interface is
> supposed to be down I'll warn and return because I don't think there is
> much we can do in this situation besides avoiding trying to transmit
> anything.
>

ok...

> > And please keep in mind to increment
> > the open count when implementing MLME-ops (or at least handle it
> > somehow), otherwise I guess it's easy to hit the warning. If another
> > user reports warnings and tells us what they did we might know more
> > other "cases" to fix.
>
> I don't think incrementing the open_count counter is the right solution
> here just because the stop call is not supposed to fail and has no
> straightforward ways to be deferred. In particular, just keeping the
> open_count incremented will just avoid the actual driver stop operation
> to be executed and the core will not notice it.
>

the stop callback can sleep, it's the job of the driver to synchronize
it somehow with the transceiver state.

> I came out with another solution: acquiring the rtnl when performing a
> MLME Tx operation to serialize these operations. We can easily have a
> version which just checks the rtnl was acquired as well for situations
> when the MLME operations are called by eg. the nl layer (and thus, with
> the rtnl lock taken automatically).

The rtnl lock needs definitely to be held during such operation.

- Alex