Re: [PATCH] ibacm: fix ibacm service if ib_umad is not loaded

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Oct 12, 2018 at 5:58 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
>
> On Thu, Oct 11, 2018 at 05:37:42PM +0200, Christian Ehrhardt wrote:
> >    On Thu, Oct 11, 2018 at 4:04 PM Jason Gunthorpe <[1]jgg@xxxxxxxx>
> >    wrote:
> >
> >      On Tue, Oct 09, 2018 at 04:43:06PM +0200, Christian Ehrhardt wrote:
> >      > The default modules config that is processed is
> >      kernel-boot/modules/rdma.conf
> >      > which does not contain ib_umad (infiniband.conf and opa.conf
> >      would). But
> >      > no matter what the default configs are - they could be modified by
> >      an admin,
> >      > due to that today there are cases the service would start ibacm
> >      with the module
> >      > not loaded.
> >      >
> >      > That will trigger the service to immediately fail with:
> >      > ibacm[1796]: ibwarn: [1796] umad_init: can't read ABI version from
> >      >   /sys/class/infiniband_mad/abi_version (No such file or
> >      directory): is
> >      >   ib_umad module loaded?
> >      > systemd[1]: ibacm.service: Main process exited, code=exited,
> >      status=255/n/a
> >      Why is ibacm.service even starting?
> >
> >    Hi and thanks for your Feedback Jason.
> >    This question was just right to lead me to the problem.
> >    It is still handled in Debian packaging with old dh_installinit scripts
> >    for ibacm which will try to start the service no matter what.
> >    Obviously that will fail and needs to be fixed in the packaging, I'll
> >    do some experiments and come back with a fix for that.
>
> Ah, yes, that seems like a problem for sure

Unfortunately so far I found no other option working reliably than to
not install the sysV init script.
You can follow some of the alternatives tried in [4].
So I wonder, would dropping the (no more used) sysV init script these
days be an option for upstream to consider at least in the
Debian/Ubuntu packaging?

[4]: https://bugs.launchpad.net/ubuntu/+source/rdma-core/+bug/1794825/comments/11

> >      if ib_umad is not loaded or not working
> >      > +ConditionPathExists=/sys/class/infiniband_mad/abi_version
> >      This is not a good solution, it will not support later hotplug of
> >      devices that need ibacm.
> >
> >    I actually think this would still be the right thing to do (in addition
> >    to fix the Deb packaging discussed above).
> >    I did not at the condition to the socket, but the service.
> >    If the service is started either manually or by the socket and this
> >    path does not exist, then it will fail fatally.
> >    It seems safer to check than to fail to me.
>
> What happens to the message on the socket at this point? Does the
> requestor hang? Seems sketchy, would be better if ACM started and then
> NAK'd the message.

At first I didn't know what would happen to the message without the
ConditionPathExists check either, but the problem doesn't seem very
different to me.
As right now if the sysfs path doesn't exist the service would try to
start but fail to initialize, so it won't pick up the message either.
But in any of those cases, the one picking up the initial message is
systemd anyway.

I experimented a bit as I found it interesting, based on [1] I created
a test [2] which can serve to check multiple cases of this discussion
(not perfect but better than guessing).

Case A: sysfs file does exists + no ConfitionPathExists check
- The systemd socket will receive the connection and try to start the service.
- Service initializes, stays up and consumes messages through the FD
taken over from systemd

Case B: sysfs file does exists + added ConfitionPathExists check
- the same as Case A

Case C: sysfs file does not exists + no ConfitionPathExists check
- The systemd socket will receive the connection and try to start the service.
- The service will fail to initialize (due to the sysfs path missing)
- systemd will retry to start it a few times until "Start request
repeated too quickly" is hit
- at that point it will give up, the service stays in "failed" and is
not tried to be started on later calls to the socket
- the caller will not block as traffic was "accepted" by systemd
already (bit also not processed)

Case D: sysfs file does not exists + added ConfitionPathExists check
- The systemd socket will receive the connection and try to start the service.
- The condition check of the .service will fail
- the .socket realizes the problem "foo.socket: Trigger limit hit,
refusing further activation."
- See TriggerLimit in [3]
- that will make the socket fail

In BOTH cases (with and without the check I suggested) the service is
lost forever and further messages will vanish on the socket :-/
So I agree that the proposed solution of the Condition check isn't
really fixing it, we are just as dead as before :-/.

But right now I fail to see a better solution - I hope this at least
helped to make clear the current problem with the socket and service
even though the Condition check alone seems not to be the solution.

[1]: https://gist.github.com/drmalex07/333d8a88c4918954e8e4
[2]: https://gist.github.com/cpaelzer/fc3abd28f81eda55ffb317bb4091bf48
[3]: https://www.freedesktop.org/software/systemd/man/systemd.socket.html#

>
> >    For Hotplug, at the time something uses the socket late in the
> >    lifecycle it would try to start the service.
> >    At that time it would evaluate if that is reasonable (path exists) and
> >    do so or not.
>
> Does systemd try to start after every socket message or does it latch
> into some kind of failure? Would be good to confirm this before
> relying on it.. And a comment..

When the .socket receives a message it will try to start, there we
seem to only have two options:
1. it works, afterwards the service is up and has taken over the
socket to consume further messages
2. it fails to start the service (e.g. no module loaded), which will
make the service failed and not tried to start again later

As shown above the ConditionPathExists didn't really help to fix this :-/
But what would we actually expect?:
- Would we expect the socket not being available when we know it will
fail to start the service (Could be a Condition check to the .socket
then).
OR
- Would we better prefer that ibacm would actually start - sort of -
fine, consume and nack the message and then either spin until it can
fully initialize fully or exit (if it will then be restarted later on
the next connect).

The latter sounds promising but is out of my understanding of ibacm
service to provide a fix for.
I have modified my test that I created for the discussion above and it
seems to work better [5].
I have removed the ConditionPathExists and instead inside of the
service itself when the condition of the missing path is met aborted
in a safer way.

That means the service would:
- consume and "properly" Nack the message.
- Then go "out of service" with a RC=0 (to not have it in failed state
and be started again on further calls)

I have verified that in this case later calls to the .socket will
start it again.
If at some point the path it ready it can fully initialize and keep running.

Not sure if one wants to code that up in the C code of the service, I
can already see potential races if there are multiple messages to the
socket :-/
But I hope all the testing and the links help to better outline the issue.

At the moment I am not sure which way we should really go, but since I
was so deep on the debug-examples already I thought I complete them
and make the code available for the discussion.

[5]: https://gist.github.com/cpaelzer/80331eb7b2a74836b52522bd076a5296



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux