RE: [PATCH] ibacm: fix ibacm service if ib_umad is not loaded

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I'm going to top post this because it seems like what I have to say is not directly related to any of the comments so far.

It seems like ibacm is broken.  I think in 2 ways.

1) It should be able to start without ib_umad being loaded.
2) It should try to load ib_umad should requests come in on the socket.  Or at least fail the request with a log message of the reason for the failure.  Because this is a misconfiguration of a running system.

I'm not opposed to not starting ibacm on install.  I think systemd can auto start through the socket mechanism discussed.

But I do have a question regarding the netlink interface from the kernel.  Would systemd start ibacm on a query from the kernel?  Can it?  I'm far from a systemd expert.

Thanks,
Ira

> -----Original Message-----
> From: linux-rdma-owner@xxxxxxxxxxxxxxx [mailto:linux-rdma-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Christian Ehrhardt
> Sent: Monday, October 15, 2018 12:16 AM
> To: jgg@xxxxxxxx
> Cc: linux-rdma@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH] ibacm: fix ibacm service if ib_umad is not loaded
> 
> On Fri, Oct 12, 2018 at 5:58 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> >
> > On Thu, Oct 11, 2018 at 05:37:42PM +0200, Christian Ehrhardt wrote:
> > >    On Thu, Oct 11, 2018 at 4:04 PM Jason Gunthorpe <[1]jgg@xxxxxxxx>
> > >    wrote:
> > >
> > >      On Tue, Oct 09, 2018 at 04:43:06PM +0200, Christian Ehrhardt wrote:
> > >      > The default modules config that is processed is
> > >      kernel-boot/modules/rdma.conf
> > >      > which does not contain ib_umad (infiniband.conf and opa.conf
> > >      would). But
> > >      > no matter what the default configs are - they could be modified by
> > >      an admin,
> > >      > due to that today there are cases the service would start ibacm
> > >      with the module
> > >      > not loaded.
> > >      >
> > >      > That will trigger the service to immediately fail with:
> > >      > ibacm[1796]: ibwarn: [1796] umad_init: can't read ABI version from
> > >      >   /sys/class/infiniband_mad/abi_version (No such file or
> > >      directory): is
> > >      >   ib_umad module loaded?
> > >      > systemd[1]: ibacm.service: Main process exited, code=exited,
> > >      status=255/n/a
> > >      Why is ibacm.service even starting?
> > >
> > >    Hi and thanks for your Feedback Jason.
> > >    This question was just right to lead me to the problem.
> > >    It is still handled in Debian packaging with old dh_installinit scripts
> > >    for ibacm which will try to start the service no matter what.
> > >    Obviously that will fail and needs to be fixed in the packaging, I'll
> > >    do some experiments and come back with a fix for that.
> >
> > Ah, yes, that seems like a problem for sure
> 
> Unfortunately so far I found no other option working reliably than to not
> install the sysV init script.
> You can follow some of the alternatives tried in [4].
> So I wonder, would dropping the (no more used) sysV init script these days
> be an option for upstream to consider at least in the Debian/Ubuntu
> packaging?
> 
> [4]: https://bugs.launchpad.net/ubuntu/+source/rdma-
> core/+bug/1794825/comments/11
> 
> > >      if ib_umad is not loaded or not working
> > >      > +ConditionPathExists=/sys/class/infiniband_mad/abi_version
> > >      This is not a good solution, it will not support later hotplug of
> > >      devices that need ibacm.
> > >
> > >    I actually think this would still be the right thing to do (in addition
> > >    to fix the Deb packaging discussed above).
> > >    I did not at the condition to the socket, but the service.
> > >    If the service is started either manually or by the socket and this
> > >    path does not exist, then it will fail fatally.
> > >    It seems safer to check than to fail to me.
> >
> > What happens to the message on the socket at this point? Does the
> > requestor hang? Seems sketchy, would be better if ACM started and then
> > NAK'd the message.
> 
> At first I didn't know what would happen to the message without the
> ConditionPathExists check either, but the problem doesn't seem very
> different to me.
> As right now if the sysfs path doesn't exist the service would try to start but
> fail to initialize, so it won't pick up the message either.
> But in any of those cases, the one picking up the initial message is systemd
> anyway.
> 
> I experimented a bit as I found it interesting, based on [1] I created a test [2]
> which can serve to check multiple cases of this discussion (not perfect but
> better than guessing).
> 
> Case A: sysfs file does exists + no ConfitionPathExists check
> - The systemd socket will receive the connection and try to start the service.
> - Service initializes, stays up and consumes messages through the FD taken
> over from systemd
> 
> Case B: sysfs file does exists + added ConfitionPathExists check
> - the same as Case A
> 
> Case C: sysfs file does not exists + no ConfitionPathExists check
> - The systemd socket will receive the connection and try to start the service.
> - The service will fail to initialize (due to the sysfs path missing)
> - systemd will retry to start it a few times until "Start request repeated too
> quickly" is hit
> - at that point it will give up, the service stays in "failed" and is not tried to
> be started on later calls to the socket
> - the caller will not block as traffic was "accepted" by systemd already (bit
> also not processed)
> 
> Case D: sysfs file does not exists + added ConfitionPathExists check
> - The systemd socket will receive the connection and try to start the service.
> - The condition check of the .service will fail
> - the .socket realizes the problem "foo.socket: Trigger limit hit, refusing
> further activation."
> - See TriggerLimit in [3]
> - that will make the socket fail
> 
> In BOTH cases (with and without the check I suggested) the service is lost
> forever and further messages will vanish on the socket :-/ So I agree that the
> proposed solution of the Condition check isn't really fixing it, we are just as
> dead as before :-/.
> 
> But right now I fail to see a better solution - I hope this at least helped to
> make clear the current problem with the socket and service even though the
> Condition check alone seems not to be the solution.
> 
> [1]: https://gist.github.com/drmalex07/333d8a88c4918954e8e4
> [2]: https://gist.github.com/cpaelzer/fc3abd28f81eda55ffb317bb4091bf48
> [3]:
> https://www.freedesktop.org/software/systemd/man/systemd.socket.html#
> 
> >
> > >    For Hotplug, at the time something uses the socket late in the
> > >    lifecycle it would try to start the service.
> > >    At that time it would evaluate if that is reasonable (path exists) and
> > >    do so or not.
> >
> > Does systemd try to start after every socket message or does it latch
> > into some kind of failure? Would be good to confirm this before
> > relying on it.. And a comment..
> 
> When the .socket receives a message it will try to start, there we seem to
> only have two options:
> 1. it works, afterwards the service is up and has taken over the socket to
> consume further messages 2. it fails to start the service (e.g. no module
> loaded), which will make the service failed and not tried to start again later
> 
> As shown above the ConditionPathExists didn't really help to fix this :-/ But
> what would we actually expect?:
> - Would we expect the socket not being available when we know it will fail
> to start the service (Could be a Condition check to the .socket then).
> OR
> - Would we better prefer that ibacm would actually start - sort of - fine,
> consume and nack the message and then either spin until it can fully
> initialize fully or exit (if it will then be restarted later on the next connect).
> 
> The latter sounds promising but is out of my understanding of ibacm service
> to provide a fix for.
> I have modified my test that I created for the discussion above and it seems
> to work better [5].
> I have removed the ConditionPathExists and instead inside of the service
> itself when the condition of the missing path is met aborted in a safer way.
> 
> That means the service would:
> - consume and "properly" Nack the message.
> - Then go "out of service" with a RC=0 (to not have it in failed state and be
> started again on further calls)
> 
> I have verified that in this case later calls to the .socket will start it again.
> If at some point the path it ready it can fully initialize and keep running.
> 
> Not sure if one wants to code that up in the C code of the service, I can
> already see potential races if there are multiple messages to the socket :-/
> But I hope all the testing and the links help to better outline the issue.
> 
> At the moment I am not sure which way we should really go, but since I was
> so deep on the debug-examples already I thought I complete them and
> make the code available for the discussion.
> 
> [5]: https://gist.github.com/cpaelzer/80331eb7b2a74836b52522bd076a5296




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux