Re: soft-reboot and surviving it

Luca Boccassi <luca.boccassi@xxxxxxxxx> · Fri, 19 Apr 2024 10:47:51 +0100

On Fri, 19 Apr 2024 at 10:30, Thorsten Kukuk <kukuk@xxxxxxxx> wrote:
>
> Hi,
>
> we finished the integration of soft-reboot into openSUSE Tumbleweed
> and MicroOS (transactional-update), and the major problems except
> firewalld+podman are solved. Now we only need to do all the "fine
> tuning".
> Is there meanwhile any reliable/official way to detect that this was a
> soft-reboot? This would be very helpful in some cases for post mortem
> analysis and support.
> I'm aware of the SoftRebootsCount property in systemd v256, so
> applications could query that and I assume if the count is >0 it was a
> soft-reboot? Couldn't test that yet.

Yes, that's the purpose of the counter, you can use it for that.

> And now I started looking into how services can survive the
> soft-reboot. I know the FOSDEM talk from Luca about this topic, but I
> don't like to move the application into another image, as this would
> only move the update problem to a different level and not solve it. So
> I'm currently playing with it to find out if there isn't a better
> option, especially with btrfs.
> Is there already some documentation somewhere, what are the
> limitations or best practices for an application for surviving a
> soft-reboot?

It really needs to be a separate filesystem from a separate image, any
ties back to the host OS and the service will be hopefully correctly
stopped, or worse it will not be detected and it will leak the old
filesystem, which means you'll silently leak memory, mounts, etc. I
would strongly recommend to avoid fighting against this, and instead
spend time solving the root cause.

The best solution really is to figure out why there's a executable
from the host OS permanently running in the podman container cgroup
(what does it do, why it is necessary, why does it need to always run,
etc), and try to refactor that away. Make it started on demand for
example.

> The main task for me currently is, to find out what such an
> application can do, what will not work, and what they should do in
> case of a reboot. I saw there is the PrepareForShutdownWithMetadata
> signal (I didn't got that working, but since it seems to work with
> busctl, the problem is most likely between chair and keyboard ;) ),
> but I'm more interested about file descriptors and pipes. Currently
> stderr will be redirected to journald, but this will of course no
> longer work after a soft-reboot. While I can adjust my application to
> use sd_journal_print() instead, errors written by libraries or
> something else to stderr will go lost or trigger SIGPIPE.. Any ideas
> on how to solve that?

The soft-reboot manpage is the best we got for now - and the
recordings of my talks might be of some help too. The main gotcha so
far is D-Bus, if you publish a service you need to be resilient
against D-Bus going away and coming back, which is never a thing
normally, so applications usually aren't coded for that, but it can be
done and the soft-reboot manpage has a self-contained example showing
how.

However, logging should work out of the box as long as the journal is
used, what problem are you seeing exactly?