Re: Extreme startup delay on F34

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Mon, 19 Jul 2021 17:44:08 -0600

OK I'm skipping ahead because I got an email offlist from John with
the journal. Hopefully he will post a pastebin or use file sharing
service to share the journal so others can pitch in on the diagnosis.

I'm seeing quite a lot of gaps in the time stamps (why I like
monotonic time, makes such gaps obvious), related to audit...

just two examples:

[    6.466561] systemd-sysv-generator[503]: SysV service
'/etc/rc.d/init.d/livesys-late' lacks a native systemd unit file.
Automatically generating a unit file for compatibility. Please update
package to include a native systemd unit file, in order to make it
more safe and robust.
[   10.084941] kernel: audit: type=1334 audit(1626714308.750:61):
prog-id=17 op=LOAD

[   27.581467] avahi-daemon[684]: Found user 'avahi' (UID 70) and
group 'avahi' (GID 70).
[   36.211086] audit: BPF prog-id=30 op=LOAD

That's huge. But I don't know what's going on in during that time. If
it's really audit or something else? On the one hand I'd like to see
the journal again with boot parameter 'systemd.log_level=debug' and
maybe that will give more details in the gaps? The problem is that
it's extremely verbose and makes boot much slower, so it's possible
we'll start to see other issues, like even more delays from journald
flushing to disk.

The fact journald is flushing to disk so slowly makes me wonder if the
SSD is becoming slow. There's a variety of pre-death behaviors for
SSDs, slowness is one of them. Of course folks running Btrfs can see
this show up initially as transient corruptions of data or metadata
(or both). Given that data is huge in comparison to metadata, it has a
giant bullseye on its back, so data corruptions on Btrfs are often an
early warning sign of impending hardware issues.

Another possibility is 'rd.debug' to see what's going on with
dracut/initramfs but I think that's even more verbose and not yet the
direction to go in. One of the pieces of nearly weird ancient junk I
have is a 2011 macbook pro that I added a SATA SSD drive to ages ago,
and it boots to a login window in about 9 seconds. No special tuning
on my part.

[   44.282677] kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready
[   46.157461] audit[689]: NETFILTER_CFG
table=firewalld:1;mangle_PREROUTING_POLICIES_pre:240 family=1
entries=2 op=nft_register_chain pid=689
subj=system_u:system_r:firewalld_t:s0 comm="firewalld"

Only a couple of seconds, but again it's audit as the second listed
item following the delay. And 2 seconds is a ton of time, really!

[   46.274996] audit[689]: NETFILTER_CFG table=firewalld:1;__set1:260
family=1 entries=0 op=nft_register_set pid=689
subj=system_u:system_r:firewalld_t:s0 comm="firewalld"
[   51.976223] audit[689]: NETFILTER_CFG table=firewalld:1;__set1:260
family=1 entries=260 op=nft_register_setelem pid=689
subj=system_u:system_r:firewalld_t:s0 comm="firewalld"

5-6 seconds, it's really extreme. But no messages during the delay
period, so it's hard to tell if it's audit itself, or firewalld, or by
extension something network related. Or none of the above.

And at this point it doesn't matter what else the logs say because
this is already too long... but...

[   65.033121] gnome-session[1034]: gnome-session-binary[1034]:
WARNING: Falling back to non-systemd startup procedure due to error:
GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: Name
"org.freedesktop.systemd1" does not exist
[   68.291154] audit[1129]: NETFILTER_CFG table=filter family=2
entries=19 op=xt_replace pid=1129
subj=system_u:system_r:iptables_t:s0-s0:c0.c1023 comm="iptables"
[

Another delay, audit and networking related.

[   77.687346] org.gnome.Shell.desktop[1523]: Errors from xkbcomp are
not fatal to the X server
[   79.685364] abrt-server[1151]: Deleting problem directory
oops-2021-07-19-13:06:04-904-0 (dup of oops-2021-06-15-09:54:58-776-0)
[   80.745381] systemd[1]: Created slice Slice
/system/dbus-:1.13-org.freedesktop.problems.

None of these are obviously networking or audit related. Just slow. 3
seconds delay to discover and delete an abrt directory?

I'm getting suspicious of SSD performance (bandwidth and latency).
What's the output from:

smartctl -x /dev/sdX

Even going back through the log very superficially, I'm getting
suspicious that the times that aren't showing up as obvious gaps, are
just taking too long - multiple seconds for gnome shell stuff? That's
more like HDD bandwidth and (especially) latency.

[  129.512520] PackageKit[1220]: resolve transaction /979_adbedcab
from uid 1000 finished with success after 10999ms
[  137.008831] realmd[1457]: quitting realmd service after timeout

Yeah ok wow, so packagekit even says 11s to do some basic stuff? Okayyy...

uhh what is this??

[    1.903084] kernel: i915 0000:00:02.0: [drm] *ERROR* Stolen
reserved area [mem 0xdf100000-0xdf1fffff] outside stolen memory [mem
0x9f200000-0xc11fffff]
[    1.903856] kernel: i915 0000:00:02.0: vgaarb: changed VGA decodes:
olddecodes=io+mem,decodes=io+mem:owns=io+mem

hmmm.

Chris Murphy
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure