On Thu, Apr 30, 2020 at 07:36:54PM +0200, Christian Ehrhardt wrote: > On Thu, Apr 30, 2020 at 7:23 PM Daniel P. Berrangé <berrange@xxxxxxxxxx> wrote: > > > > On Thu, Apr 30, 2020 at 06:28:08PM +0200, Christian Ehrhardt wrote: > > > On Thu, Apr 30, 2020 at 5:10 PM Daniel P. Berrangé <berrange@xxxxxxxxxx> wrote: > > > > > > > > On Thu, Apr 30, 2020 at 04:58:25PM +0200, Christian Ehrhardt wrote: > > > > > On Thu, Apr 30, 2020 at 2:51 PM Daniel P. Berrangé <berrange@xxxxxxxxxx> wrote: > > > > > Well it seems I have a reproducible symptom and a fix, but not the > > > > > explanation why the latter fixes the former. > > > > > I'll need to dive into some debug & analysis myself to explain it better. > > > > > I'll be back here once I got time to do that in depth check. > > > > > > > > > > Until then whoever is affected (should be everyone) can give it a > > > > > thought as well. > > > > > Repro is as easy as > > > > > One console: > > > > > $ journalctl -f -u libvirt-guests > > > > > Other console: > > > > > $ systemctl stop libvirt-guests > > > > > $ systemctl start libvirt-guests > > > > > > > > > > I see it with 245.4-4ubuntu3 (18.04) I'm not seeing it on > > > > > 237-3ubuntu10.39 (20.04). > > > > > Maybe it is a systemd bug after all? > > > > > I'd be interested to hear if that is/isn't clobbering output for > > > > > anyone else and what your systemd versions are? > > > > > > > > FWIW, it works correctly on Fedora 31 with systemd 243. > > > > > > Eoan with 242-7ubuntu3.7 is good as well. > > > I might need to try to get some interim versions from somewhere. > > > > I've reproduced on Fedora 33 rawhide with systemd 245 - the first > > place where it lists running guests is screwed up slightly: > > > > Apr 30 17:16:13 libvirt-fedora-rawhide systemd[1]: Stopping Suspend/Resume Running libvirt Guests... > > Apr 30 17:16:13 libvirt-fedora-rawhide libvirt-guests.sh[69903]: Running guests on default URI: > > Apr 30 17:16:13 libvirt-fedora-rawhide libvirt-guests.sh[69892]: Runningcore2 > > Apr 30 17:16:13 libvirt-fedora-rawhide libvirt-guests.sh[69934]: Suspending guests on default URI... > > Apr 30 17:16:13 libvirt-fedora-rawhide libvirt-guests.sh[69892]: SSuspending core1: ... > > Apr 30 17:16:14 libvirt-fedora-rawhide libvirt-guests.sh[69892]: Suspending core1: done > > Apr 30 17:16:14 libvirt-fedora-rawhide libvirt-guests.sh[69892]: Suspending core2: ... > > Apr 30 17:16:15 libvirt-fedora-rawhide libvirt-guests.sh[69892]: Suspending core2: done > > Apr 30 17:16:15 libvirt-fedora-rawhide systemd[1]: libvirt-guests.service: Succeeded. > > Apr 30 17:16:15 libvirt-fedora-rawhide systemd[1]: Stopped Suspend/Resume Running libvirt Guests. > > > > On resume it is even worse > > > > Apr 30 17:19:40 libvirt-fedora-rawhide systemd[1]: Starting Suspend/Resume Running libvirt Guests... > > Apr 30 17:19:40 libvirt-fedora-rawhide libvirt-guests.sh[70041]: Resuming guests on default URI... > > Apr 30 17:19:40 libvirt-fedora-rawhide libvirt-guests.sh[70030]: R > > Apr 30 17:19:40 libvirt-fedora-rawhide libvirt-guests.sh[70048]: R > > Apr 30 17:19:41 libvirt-fedora-rawhide libvirt-guests.sh[70048]: esuming guest core1: > > Apr 30 17:19:41 libvirt-fedora-rawhide libvirt-guests.sh[70079]: esum > > Apr 30 17:19:41 libvirt-fedora-rawhide libvirt-guests.sh[70030]: e > > Apr 30 17:19:41 libvirt-fedora-rawhide libvirt-guests.sh[70086]: e > > Apr 30 17:19:42 libvirt-fedora-rawhide libvirt-guests.sh[70086]: esuming guest core2: > > Apr 30 17:19:42 libvirt-fedora-rawhide libvirt-guests.sh[70119]: esum > > Apr 30 17:19:42 libvirt-fedora-rawhide libvirt-guests.sh[70030]: e > > Apr 30 17:19:42 libvirt-fedora-rawhide systemd[1]: Finished Suspend/Resume Running libvirt Guests. > > > > > > It is possible it isn't systemd related - could be other packages that > > are co-incidentally affecting it. > > Since launchpad holds all former builds still ready for download I > could easily test a few versions. > > I up/downgraded just the following packages on an otherwise unmodified system: > - libnss-systemd > - libpam-systemd > - libsystemd0 > - systemd > - systemd-container > - systemd-sysv > > Those versions I did check: > > 245.4-4ubuntu3 bad > 245.2-1ubuntu1 bad > 244.3-1ubuntu1 bad > 244.1-0ubuntu3 bad (bad on retry) > 243-3ubuntu1 good (good (on retry) > 242-7ubuntu3 good > > Retry means that I went and installed from good -> bad -> good and the > behavior was the same. > So none of the installs fixed it to then be good forever. It was > consistently good <244 and bad >=244. Based on that, a possibly interesting change between v243..v244 was this: commit 09d0b46ab61bebafe5bdc1be95ee153dfb13d6bc Author: Lorenz Bauer <lmb@xxxxxxxxxxxxxx> Date: Mon Nov 4 16:35:46 2019 +0000 journal: refresh cached credentials of stdout streams journald assumes that getsockopt(SO_PEERCRED) correctly identifies the process on the remote end of the socket. However, this is incorrect according to man 7 socket: The returned credentials are those that were in effect at the time of the call to connect(2) or socketpair(2). This becomes a problem when a new process inherits the stdout stream from a parent. First, log messages from the child process will be attributed to the parent. Second, the struct ucred used by journald becomes invalid as soon as the parent exits. Further sendmsg calls then fail with ENOENT. Logs for the child process then vanish from the journal. Fix this by using recvmsg on the stdout stream, and refreshing the cached struct ucred if SCM_CREDENTIALS indicate a new process. but I can't see an obvious mistake the code which would cause the behaviour we see. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|