On Thu, Jul 11, 2019 at 08:35:38PM +0000, Zbigniew J??drzejewski-Szmek wrote: > On Thu, Jul 11, 2019 at 10:08:43AM -0400, Brian Reichert wrote: > > Does that sound like expected behavior? > > No, this shouldn't happen. > > What I was trying to say, is that if you have the strace log, you > can figure out what created the stale connection and what the dbus > call was, and from all that info it should be fairly simply to figure > out what the calling command was. Once you have that, it'll be much > easier to reproduce the issue in controlled setting and look for the > fix. I'm finally revisiting this. I haven't found a way to get a trace to start early enough to catch the initial open() on all of the targeted file descriptors, but I'm trying to make do with what I have. To sum up, in my naive analysis, I see close() called many times on a file descriptor. I then see more messages come in on that same descriptor. But the timestamp of the descriptor in /proc never changes. I created a service to launch strace as early as I can figure: localhost:~ # cat /usr/lib/systemd/system/systemd_strace.service [Unit] Description=strace systemd DefaultDependencies=no After=local-fs.target Before=sysinit.target ConditionPathExists=!/etc/initrd-release [Service] ExecStart=/usr/bin/strace -p1 -t -o /home/systemd.strace -e recvmsg,close,accept4,getsockname,getsockopt,sendmsg -s999 ExecStop=/bin/echo systemd_strace.service will soon exit Type=simple [Install] WantedBy=multi-user.target I introduced the '-t' flag, so I'd get timestamps on the recorded entries. I rebooted the server, and after ~20 minutes, I found stale descriptors, that seem to date to when the host first booted. Note the age of them, relative to the boot time, and they have no connected peers. localhost:~ # uptime 14:10pm up 0:21, 3 users, load average: 0.81, 0.24, 0.15 localhost:~ # date Tue Jul 30 14:10:09 EDT 2019 localhost:~ # lsof -nP /run/systemd/private | awk '/systemd/ { sub(/u/, "", $4); print $4}' | ( cd /proc/1/fd; xargs ls -t --full-time ) | tail -5 lrwx------ 1 root root 64 2019-07-30 13:49:25.458694632 -0400 14 -> socket:[28742] lrwx------ 1 root root 64 2019-07-30 13:49:25.458694632 -0400 16 -> socket:[35430] lrwx------ 1 root root 64 2019-07-30 13:49:25.458694632 -0400 17 -> socket:[37758] lrwx------ 1 root root 64 2019-07-30 13:49:25.458694632 -0400 18 -> socket:[41044] lrwx------ 1 root root 64 2019-07-30 13:49:25.458694632 -0400 19 -> socket:[43411] localhost:~ # ss -x | grep /run/systemd/private | grep -v -e '* 0' | wc -l 0 This is an XFS filesystem, so I can't directly get the creation time of my trace file, but I can see the first entry is timestamped '13:49:07'. I copied the trace file aside, and edited that copy to trim everything off after 14:10:09, when I ran that 'date' command above. As early as I tried to start this trace, dozens of file descriptors had already been created. Trying to focus on FD 19 (the oldest connection to /run/systemd/private): I see, between 13:49:30 and 13:50:01, I see 25 'successful' calls for close(), e.g.: 13:50:01 close(19) = 0 Followed by getsockopt(), and a received message on the supposedly-closed file descriptor: 13:50:01 getsockopt(19, SOL_SOCKET, SO_PEERCRED, {pid=3323, uid=0, gid=0}, [12]) = 0 13:50:01 getsockopt(19, SOL_SOCKET, SO_RCVBUF, [4194304], [4]) = 0 13:50:01 getsockopt(19, SOL_SOCKET, SO_SNDBUF, [262144], [4]) = 0 13:50:01 getsockopt(19, SOL_SOCKET, SO_PEERCRED, {pid=3323, uid=0, gid=0}, [12]) = 0 13:50:01 getsockopt(19, SOL_SOCKET, SO_ACCEPTCONN, [0], [4]) = 0 13:50:01 getsockname(19, {sa_family=AF_LOCAL, sun_path="/run/systemd/private"}, [23]) = 0 13:50:01 recvmsg(19, {msg_name(0)=NULL, msg_iov(1)=[{"\0AUTH EXTERNAL 30\r\nNEGOTIATE_UNIX_FD\r\nBEGIN\r\n", 256}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 45 13:50:01 sendmsg(19, {msg_name(0)=NULL, msg_iov(3)=[{"OK 9fcf621ece0a4fe897586e28058cd2fb\r\nAGREE_UNIX_FD\r\n", 52}, {NULL, 0}, {NULL, 0}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 52 13:50:01 sendmsg(19, {msg_name(0)=NULL, msg_iov(2)=[{"l\4\1\1P\0\0\0\1\0\0\0p\0\0\0\1\1o\0\31\0\0\0/org/freedesktop/systemd1\0\0\0\0\0\0\0\2\1s\0\0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\3\1s\0\7\0\0\0UnitNew\0\10\1g\0\2so\0", 128}, {"\20\0\0\0session-11.scope\0\0\0\0003\0\0\0/org/freedesktop/systemd1/unit/session_2d11_2escope\0", 80}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EPIPE (Broken pipe) I see a continuous stream of messages coming in on FD 19, though the end of the trace, but the age of the file descriptor in /proc never seems to change. Am I misinterpreting something? > Zbyszek -- Brian Reichert <reichert@xxxxxxxxxxx> BSD admin/developer at large _______________________________________________ systemd-devel mailing list systemd-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/systemd-devel