Re: soft-reboot and surviving it

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

attached is a better reproducer for the "broken pipe" problem, if
applications write to stderr. This time without buffered glibc
streams.
If writing to stderr (fd 2) fails, the error get's logged via
sd_journal_print().

Regards,
Thorsten

On Fri, Apr 19, 2024 at 11:48 AM Luca Boccassi <luca.boccassi@xxxxxxxxx> wrote:
>
> On Fri, 19 Apr 2024 at 10:30, Thorsten Kukuk <kukuk@xxxxxxxx> wrote:
> >
> > Hi,
> >
> > we finished the integration of soft-reboot into openSUSE Tumbleweed
> > and MicroOS (transactional-update), and the major problems except
> > firewalld+podman are solved. Now we only need to do all the "fine
> > tuning".
> > Is there meanwhile any reliable/official way to detect that this was a
> > soft-reboot? This would be very helpful in some cases for post mortem
> > analysis and support.
> > I'm aware of the SoftRebootsCount property in systemd v256, so
> > applications could query that and I assume if the count is >0 it was a
> > soft-reboot? Couldn't test that yet.
>
> Yes, that's the purpose of the counter, you can use it for that.
>
> > And now I started looking into how services can survive the
> > soft-reboot. I know the FOSDEM talk from Luca about this topic, but I
> > don't like to move the application into another image, as this would
> > only move the update problem to a different level and not solve it. So
> > I'm currently playing with it to find out if there isn't a better
> > option, especially with btrfs.
> > Is there already some documentation somewhere, what are the
> > limitations or best practices for an application for surviving a
> > soft-reboot?
>
> It really needs to be a separate filesystem from a separate image, any
> ties back to the host OS and the service will be hopefully correctly
> stopped, or worse it will not be detected and it will leak the old
> filesystem, which means you'll silently leak memory, mounts, etc. I
> would strongly recommend to avoid fighting against this, and instead
> spend time solving the root cause.
>
> The best solution really is to figure out why there's a executable
> from the host OS permanently running in the podman container cgroup
> (what does it do, why it is necessary, why does it need to always run,
> etc), and try to refactor that away. Make it started on demand for
> example.
>
> > The main task for me currently is, to find out what such an
> > application can do, what will not work, and what they should do in
> > case of a reboot. I saw there is the PrepareForShutdownWithMetadata
> > signal (I didn't got that working, but since it seems to work with
> > busctl, the problem is most likely between chair and keyboard ;) ),
> > but I'm more interested about file descriptors and pipes. Currently
> > stderr will be redirected to journald, but this will of course no
> > longer work after a soft-reboot. While I can adjust my application to
> > use sd_journal_print() instead, errors written by libraries or
> > something else to stderr will go lost or trigger SIGPIPE.. Any ideas
> > on how to solve that?
>
> The soft-reboot manpage is the best we got for now - and the
> recordings of my talks might be of some help too. The main gotcha so
> far is D-Bus, if you publish a service you need to be resilient
> against D-Bus going away and coming back, which is never a thing
> normally, so applications usually aren't coded for that, but it can be
> done and the soft-reboot manpage has a self-contained example showing
> how.
>
> However, logging should work out of the box as long as the journal is
> used, what problem are you seeing exactly?



-- 
Thorsten Kukuk, Distinguished Engineer, Senior Architect, Future Technologies
SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461
Nuernberg, Germany
Managing Director: Ivo Totev, Andrew McDonald, Werner Knoblich (HRB
36809, AG Nürnberg)
#define _GNU_SOURCE

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <systemd/sd-daemon.h>
#include <systemd/sd-journal.h>

int
main ()
{
  unsigned long counter = 0;
  char *buf = NULL;

  sd_notify (0, "READY=1");

  while (1)
    {
      sleep (1);
      counter++;

      asprintf (&buf, "Counter: %li seconds\n", counter);
      if (write (2, buf, strlen (buf)) == -1)
	{
	  sd_journal_print (LOG_ERR,
			    "Writing count %li to stderr failed: %m",
			    counter);
	}
      free (buf);
    }

  return 0;
}

[Index of Archives]     [LARTC]     [Bugtraq]     [Yosemite Forum]     [Photo]

  Powered by Linux