Re: Resolver times out resending with same transaction ID

Petr Menšík <pemensik@xxxxxxxxxx> · Wed, 29 Mar 2023 14:19:51 +0200

This report led me to few checks and indeed. What systemd-resolved is 
doing with NXDOMAIN responses from clearly proper servers is plain 
terrible. It should stop doing it the current way ASAP. Instead of 
caching negative response it doubles each query resulting in NXDOMAIN 
response. Not once as a workaround requirement detection, but for every 
single name not existing. Even for repeated queries.

Created issue 26967 [1] requesting to stop doing so weird things. Aruba 
support were able to identify failing software versions and when they 
were fixed. I think this is exactly the kind of workaround DNS Flag Day 
2019 were about. Please stop doing it by default.

Regards,
Petr

[1] https://github.com/systemd/systemd/issues/26967

On 3/21/23 06:32, Vince Del Vecchio wrote:
Hi all,

I recently observed reverse IPv4 address lookups timing out on a newly
configured host.  (Ubuntu 22.04LTS, systemd 249.11-0ubuntu3.7).  I
tracked the problem to the DVE-2018-0001 mitigation code.

An example:

$ resolvectl query 151.101.1.164
151.101.1.164: resolve call failed: All attempts to contact name
servers or networks failed

tcpdump shows (in relevant part):
  00:00:00.000000 IP 192.168.1.48.35911 > 8.8.8.8.53: 26417+ [1au] PTR?
164.1.101.151.in-addr.arpa. (55)
  00:00:00.021127 IP 8.8.8.8.53 > 192.168.1.48.35911: 26417 NXDomain
0/1/1 (115)
  00:00:00.021252 IP 192.168.1.48.35911 > 8.8.8.8.53: 26417+ PTR?
164.1.101.151.in-addr.arpa. (44)

The first query gets an "NXDOMAIN", which is the correct answer for
this address.

However, NXDOMAIN triggers the DVE-2018-0001 mitigation code to send an
revised query without EDNS OPT (confirmed in debug log).  I **never see
a response to this revised query**.

If there is only a single DNS server, the resolver resends the OPT-less
query after a timeout, and *that* gets an NXDOMAIN which is returned.
However, if there are multiple DNS servers (e.g. 8.8.8.8 8.8.4.4), on
timing out, it sends another query with EDNS to the next server, and
the three-packet sequence repeats several times until it gives up.

Since the server *will* respond to a retransmit after 5s, my guess is
that the server, or maybe something in the network, is dropping close-
in-time requests with the same transaction id.  I tried a few public
DNSs that (surprisingly?) all behaved the same.  I haven't found a
simple way to rule out a firewall, router or my ISP.

Regardless, my thought is that resending a slightly different query
after we did get a response should not use the same transaction id.  I
patched systemd as follows and the problem goes away:

--- a/src/resolve/resolved-dns-transaction.c
+++ b/src/resolve/resolved-dns-transaction.c
@@ -1312,6 +1312,7 @@ void dns_transaction_process_reply(DnsTransaction
*t, DnsPacket *p, bool encrypt
                            FORMAT_DNS_RCODE(DNS_PACKET_RCODE(p)),
                            dns_server_feature_level_to_string(t-
clamp_feature_level_nxdomain));
  
+                dns_transaction_shuffle_id(t);
                  dns_transaction_retry(t, false /* use the same server
*/);
                  return;
          }


A few questions:

- Does anyone else see this?

- Does this look like a reasonable fix?  Any thoughts on whether the
one other place where dns_transaction_retry(..., false) is called to
retry the same server with a lower feature level (SERVFAIL etc) should
do the same?

- Any other issues with the patch?  Or would it be reasonable to (add
comments and) submit a pull request?

-Vince Del Vecchio

--
Petr Menšík
Software Engineer, RHEL
Red Hat, https://www.redhat.com/
PGP: DFCF908DB7C87E8E529925BC4931CA5B6C9FC5CB