Re: [PATCH] ibacm: fix ibacm service if ib_umad is not loaded

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Oct 15, 2018 at 9:16 AM Christian Ehrhardt
<christian.ehrhardt@xxxxxxxxxxxxx> wrote:
>
> On Fri, Oct 12, 2018 at 5:58 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> >
> > On Thu, Oct 11, 2018 at 05:37:42PM +0200, Christian Ehrhardt wrote:
> > >    On Thu, Oct 11, 2018 at 4:04 PM Jason Gunthorpe <[1]jgg@xxxxxxxx>
> > >    wrote:
> > >
> > >      On Tue, Oct 09, 2018 at 04:43:06PM +0200, Christian Ehrhardt wrote:
> > >      > The default modules config that is processed is
> > >      kernel-boot/modules/rdma.conf
> > >      > which does not contain ib_umad (infiniband.conf and opa.conf
> > >      would). But
> > >      > no matter what the default configs are - they could be modified by
> > >      an admin,
> > >      > due to that today there are cases the service would start ibacm
> > >      with the module
> > >      > not loaded.
> > >      >
> > >      > That will trigger the service to immediately fail with:
> > >      > ibacm[1796]: ibwarn: [1796] umad_init: can't read ABI version from
> > >      >   /sys/class/infiniband_mad/abi_version (No such file or
> > >      directory): is
> > >      >   ib_umad module loaded?
> > >      > systemd[1]: ibacm.service: Main process exited, code=exited,
> > >      status=255/n/a
> > >      Why is ibacm.service even starting?
> > >
> > >    Hi and thanks for your Feedback Jason.
> > >    This question was just right to lead me to the problem.
> > >    It is still handled in Debian packaging with old dh_installinit scripts
> > >    for ibacm which will try to start the service no matter what.
> > >    Obviously that will fail and needs to be fixed in the packaging, I'll
> > >    do some experiments and come back with a fix for that.
> >
> > Ah, yes, that seems like a problem for sure
>
> Unfortunately so far I found no other option working reliably than to
> not install the sysV init script.
> You can follow some of the alternatives tried in [4].
> So I wonder, would dropping the (no more used) sysV init script these
> days be an option for upstream to consider at least in the
> Debian/Ubuntu packaging?

I tried a few less invasive approaches in [1] and latter comments on that bug.
But so far the only way to not unconditionally start it (other than
manually writing
much of the postinst) is dropping the sysV script and then instructing
dh_installsystemd accordingly.

Therefore I have abandoned [2] and opened a new PR [3] for that.
I have a few more ideas to make this less invasive and will update the
PR if I find one of them working better, for now the PR is up to collect
the opinion on maybe dropping the sysV init in the .deb packaging.

[1]: https://bugs.launchpad.net/ubuntu/+source/rdma-core/+bug/1794825/comments/11
[2]: https://github.com/linux-rdma/rdma-core/pull/393
[3] :https://github.com/linux-rdma/rdma-core/pull/401

> [4]: https://bugs.launchpad.net/ubuntu/+source/rdma-core/+bug/1794825/comments/11
>
> > >      if ib_umad is not loaded or not working
> > >      > +ConditionPathExists=/sys/class/infiniband_mad/abi_version
> > >      This is not a good solution, it will not support later hotplug of
> > >      devices that need ibacm.
> > >
> > >    I actually think this would still be the right thing to do (in addition
> > >    to fix the Deb packaging discussed above).
> > >    I did not at the condition to the socket, but the service.
> > >    If the service is started either manually or by the socket and this
> > >    path does not exist, then it will fail fatally.
> > >    It seems safer to check than to fail to me.
> >
> > What happens to the message on the socket at this point? Does the
> > requestor hang? Seems sketchy, would be better if ACM started and then
> > NAK'd the message.
>
> At first I didn't know what would happen to the message without the
> ConditionPathExists check either, but the problem doesn't seem very
> different to me.
> As right now if the sysfs path doesn't exist the service would try to
> start but fail to initialize, so it won't pick up the message either.
> But in any of those cases, the one picking up the initial message is
> systemd anyway.
>
> I experimented a bit as I found it interesting, based on [1] I created
> a test [2] which can serve to check multiple cases of this discussion
> (not perfect but better than guessing).
>
> Case A: sysfs file does exists + no ConfitionPathExists check
> - The systemd socket will receive the connection and try to start the service.
> - Service initializes, stays up and consumes messages through the FD
> taken over from systemd
>
> Case B: sysfs file does exists + added ConfitionPathExists check
> - the same as Case A
>
> Case C: sysfs file does not exists + no ConfitionPathExists check
> - The systemd socket will receive the connection and try to start the service.
> - The service will fail to initialize (due to the sysfs path missing)
> - systemd will retry to start it a few times until "Start request
> repeated too quickly" is hit
> - at that point it will give up, the service stays in "failed" and is
> not tried to be started on later calls to the socket
> - the caller will not block as traffic was "accepted" by systemd
> already (bit also not processed)
>
> Case D: sysfs file does not exists + added ConfitionPathExists check
> - The systemd socket will receive the connection and try to start the service.
> - The condition check of the .service will fail
> - the .socket realizes the problem "foo.socket: Trigger limit hit,
> refusing further activation."
> - See TriggerLimit in [3]
> - that will make the socket fail
>
> In BOTH cases (with and without the check I suggested) the service is
> lost forever and further messages will vanish on the socket :-/
> So I agree that the proposed solution of the Condition check isn't
> really fixing it, we are just as dead as before :-/.
>
> But right now I fail to see a better solution - I hope this at least
> helped to make clear the current problem with the socket and service
> even though the Condition check alone seems not to be the solution.
>
> [1]: https://gist.github.com/drmalex07/333d8a88c4918954e8e4
> [2]: https://gist.github.com/cpaelzer/fc3abd28f81eda55ffb317bb4091bf48
> [3]: https://www.freedesktop.org/software/systemd/man/systemd.socket.html#
>
> >
> > >    For Hotplug, at the time something uses the socket late in the
> > >    lifecycle it would try to start the service.
> > >    At that time it would evaluate if that is reasonable (path exists) and
> > >    do so or not.
> >
> > Does systemd try to start after every socket message or does it latch
> > into some kind of failure? Would be good to confirm this before
> > relying on it.. And a comment..
>
> When the .socket receives a message it will try to start, there we
> seem to only have two options:
> 1. it works, afterwards the service is up and has taken over the
> socket to consume further messages
> 2. it fails to start the service (e.g. no module loaded), which will
> make the service failed and not tried to start again later
>
> As shown above the ConditionPathExists didn't really help to fix this :-/
> But what would we actually expect?:
> - Would we expect the socket not being available when we know it will
> fail to start the service (Could be a Condition check to the .socket
> then).
> OR
> - Would we better prefer that ibacm would actually start - sort of -
> fine, consume and nack the message and then either spin until it can
> fully initialize fully or exit (if it will then be restarted later on
> the next connect).
>
> The latter sounds promising but is out of my understanding of ibacm
> service to provide a fix for.
> I have modified my test that I created for the discussion above and it
> seems to work better [5].
> I have removed the ConditionPathExists and instead inside of the
> service itself when the condition of the missing path is met aborted
> in a safer way.
>
> That means the service would:
> - consume and "properly" Nack the message.
> - Then go "out of service" with a RC=0 (to not have it in failed state
> and be started again on further calls)
>
> I have verified that in this case later calls to the .socket will
> start it again.
> If at some point the path it ready it can fully initialize and keep running.
>
> Not sure if one wants to code that up in the C code of the service, I
> can already see potential races if there are multiple messages to the
> socket :-/
> But I hope all the testing and the links help to better outline the issue.
>
> At the moment I am not sure which way we should really go, but since I
> was so deep on the debug-examples already I thought I complete them
> and make the code available for the discussion.
>
> [5]: https://gist.github.com/cpaelzer/80331eb7b2a74836b52522bd076a5296



-- 
Christian Ehrhardt
Software Engineer, Ubuntu Server
Canonical Ltd



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux