Re: [PATCH net-next v5 04/10] ethtool: Add flashing transceiver modules' firmware notifications ability

Ido Schimmel <idosch@xxxxxxxxxx> · Mon, 27 May 2024 19:10:55 +0300

On Wed, May 22, 2024 at 07:22:12AM -0700, Jakub Kicinski wrote:
> On Wed, 22 May 2024 13:56:11 +0000 Danielle Ratson wrote:
> > > > 4. Add a new netlink notifier that when the relevant event takes place,  
> > > deletes the node from the list, wait until the end of the work item, with
> > > cancel_work_sync() and free allocations.
> > > 
> > > What's the "relevant event" in this case? Closing of the socket that user had
> > > issued the command on?  
> > 
> > The event should match the below:
> > event == NETLINK_URELEASE && notify->protocol == NETLINK_GENERIC
> > 
> > Then iterate over the list to look for work that matches the dev and portid.
> > The socket doesn’t close until the work is done in that case. 
> 
> Okay, good, yes. I think you can use one of the callbacks I mentioned
> below to achieve the same thing with less complexity than the notifier.

Danielle already has a POC with the notifier and it's not that
complicated. I wasn't aware of the netlink notifier, but we found it
when we tried to understand how other netlink families get notified
about a socket being closed.

Which advantages do you see in the sock_priv_destroy() approach? Are you
against the notifier approach?

> > > Easiest way to "notice" the socket got closed would probably be to add some
> > > info to genl_sk_priv_*(). ->sock_priv_destroy() will get called. But you can also
> > > get a close notification in the family  
> > > ->unbind callback.  

Isn't the unbind callback only for multicast (whereas we are using
unicast)?

> > > 
> > > I'm on the fence whether we should cancel the work. We could just mark the
> > > command as 'no socket present' and stop sending notifications.
> > > Not sure which is better..  
> > 
> > Is there a scenario that we hit this event and won't intend to cancel the work? 
> 
> I think it's up to us. I don't see any legit reason for user space to
> intentionally cancel the flashing. So the only option is that user space
> is either buggy or has crashed, and the socket got closed before
> flashing finished. Right?

We don't think that closing the socket / killing the process mid
flashing is a legitimate scenario. We looked into it in order to avoid
sending unicast notifications to a socket that did not ask for them but
gets them because it was bound to the port ID that was used by the old
socket.

I agree that we don't need to cancel the work and can simply have the
work item stop sending notifications. User space will get an error if it
tries to flash a module that is already being flashed in the background.
WDYT?