Re: f33: systemd-resolved hang on ip query

Dridi Boukelmoune <dridi.boukelmoune@xxxxxxxxx> · Sun, 13 Dec 2020 15:58:06 +0000

> That's totally on me, I read the resolved.conf(5) manual but didn't
> pay attention to the systemd-resolved.service(8) manual in the SEE
> ALSO section: that would have led me to resolvectl. Thanks for the
> pointer, I'll do some reading, probably figure what I mis-configured
> and otherwise file a BZ.

I think it's a bit of both mis-configuration and (potential) bugs (plural).

I have my global configuration:

DNS=address1
FallbackDNS=address2 address3

And with `resolvectl status` I can see my network's configuration:

DNS=192.168.something
FallbackDNS=address4

With debug logs enabled I can see 4 attempts at resolving com.example:

- A via address1
- AAAA via address1
- A via 192.168.something
- AAAA via 192.168.something

First (maybe) bug? all 4 attempts advertise cache misses.

> Cache miss for net.example IN A
> Cache miss for net.example IN AAAA
> Cache miss for net.example IN A
> Cache miss for net.example IN AAAA

Working on a cache myself, my intuition would tell me to check the cache once
and dispatch queries on a miss, not the other way around. Your mileage may
vary, I will not argue that point.

Then following the transactions by number I quickly see this:

> Processing incoming packet on transaction 42493 (rcode=NXDOMAIN).
> Added NXDOMAIN cache entry for net.example IN ANY 1388s
> Transaction 42493 for <net.example IN AAAA> on scope dns on */* now complete with <rcode-failure> from network (unsigned).
> Processing incoming packet on transaction 49810 (rcode=NXDOMAIN).
> Added NXDOMAIN cache entry for net.example IN ANY 1388s
> Transaction 49810 for <net.example IN A> on scope dns on */* now complete with <rcode-failure> from network (unsigned).

The answers came quickly from address1, and were allegedly cached.

Second bug? subsequent queries will still be cache misses.

Meanwhile, the two queries forwarded to 192.168.something are failing in a
loop. That is expected since this host is currently down. And since this is
UDP and there are retries, that's where it takes forever and turns an innocent
query into a DoS for blocking clients like getaddrinfo().

Very soon, it tries to fall back to address4:

> Switching to DNS server address4 for interface wlp2s0.
> Cache miss for net.example IN A
> Transaction 44836 for <net.example IN A> scope dns on wlp2s0/*.
> Using feature level TLS+EDNS0 for transaction 44836.
> Using DNS server address4 for transaction 44836.
> Sending query via TCP since UDP isn't supported.
> Using feature level TLS+EDNS0 for transaction 44836.

Third bug? systemd-resolved seems to have wrongfully recorded that UDP didn't
work for address4, where prior to that it failed for 192.168.something and
address4 was attempted for the very first time at this point.

It looks like during startup the primary DNS was probed and that led to the
following logs in a loop:

> Using degraded feature set UDP instead of UDP+EDNS0 for DNS server 192.168.something.
> Using degraded feature set TCP instead of UDP for DNS server 192.168.something.
> Using degraded feature set UDP instead of TCP for DNS server 192.168.something.
> Using degraded feature set TCP instead of UDP for DNS server 192.168.something.
> Using degraded feature set UDP instead of TCP for DNS server 192.168.something.
> Using degraded feature set TCP instead of UDP for DNS server 192.168.something.
> Using degraded feature set UDP instead of TCP for DNS server 192.168.something.

It might have resulted in a coin flip between UDP and TCP for the whole link
instead of the primary DNS since neither worked.

The fallback DNS for my network will also fail numerous times since only TCP
is attempted and only UDP is supported, but it fails faster thanks to TCP
being TCP.

Eventually the lookup fails:

> Transaction 44836 for <net.example IN A> on scope dns on wlp2s0/* now complete with <attempts-max-reached> from none (unsigned).

It's hard to tell which logs belong to what, because there doesn't seem to be
a parent transaction of the 4 started at the begining. It's difficult without
a correlation id to tell when exactly it finishes, but once both 44836 and its
AAAA counterpart are freed, I see this:

> Sent message type=error sender=n/a destination=:1.71134 path=n/a interface=n/a member=n/a cookie=100466 reply_cookie=2 signature=s error-name=org.freedesktop.DBus.Error.Timeout error-message=All attempts to contact name servers or networks failed
> Sent message type=method_call sender=n/a destination=org.freedesktop.DBus path=/org/freedesktop/DBus interface=org.freedesktop.DBus member=RemoveMatch cookie=100467 reply_cookie=0 signature=s error-name=n/a error-message=n/a

I can only assume that this is the final error of my attempt to resolve a
non existend domain. I think that this error triggers the second bug: since
the resolution itself failed because of one missing DNS server, nothing was
actually inserted in the cache. The cache entries were created, but not added.

I tried to have another resolution for the same domain while the first one
was busy waiting for UDP packets and it also blocked. This at least aligns
with the idea that cache entries are not made available until the very end
of a resolution.

I have removed the offending host from my network configuration and everything
is back to normal. So until I bring it back up I have a working setup again.
I'll try to come up with a deterministic reproducer, and submit a BZ.

Thanks again for the help!

Dridi
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx