Resolver times out resending with same transaction ID

Vince Del Vecchio <vince.sdoss@xxxxxxxxxx> · Tue, 21 Mar 2023 05:32:55 +0000

Hi all,

I recently observed reverse IPv4 address lookups timing out on a newly
configured host.  (Ubuntu 22.04LTS, systemd 249.11-0ubuntu3.7).  I
tracked the problem to the DVE-2018-0001 mitigation code.

An example:

$ resolvectl query 151.101.1.164
151.101.1.164: resolve call failed: All attempts to contact name
servers or networks failed

tcpdump shows (in relevant part):
 00:00:00.000000 IP 192.168.1.48.35911 > 8.8.8.8.53: 26417+ [1au] PTR?
164.1.101.151.in-addr.arpa. (55)
 00:00:00.021127 IP 8.8.8.8.53 > 192.168.1.48.35911: 26417 NXDomain
0/1/1 (115)
 00:00:00.021252 IP 192.168.1.48.35911 > 8.8.8.8.53: 26417+ PTR?
164.1.101.151.in-addr.arpa. (44)

The first query gets an "NXDOMAIN", which is the correct answer for
this address.

However, NXDOMAIN triggers the DVE-2018-0001 mitigation code to send an
revised query without EDNS OPT (confirmed in debug log).  I **never see
a response to this revised query**.

If there is only a single DNS server, the resolver resends the OPT-less
query after a timeout, and *that* gets an NXDOMAIN which is returned. 
However, if there are multiple DNS servers (e.g. 8.8.8.8 8.8.4.4), on
timing out, it sends another query with EDNS to the next server, and
the three-packet sequence repeats several times until it gives up.

Since the server *will* respond to a retransmit after 5s, my guess is
that the server, or maybe something in the network, is dropping close-
in-time requests with the same transaction id.  I tried a few public
DNSs that (surprisingly?) all behaved the same.  I haven't found a
simple way to rule out a firewall, router or my ISP.

Regardless, my thought is that resending a slightly different query
after we did get a response should not use the same transaction id.  I
patched systemd as follows and the problem goes away:

--- a/src/resolve/resolved-dns-transaction.c
+++ b/src/resolve/resolved-dns-transaction.c
@@ -1312,6 +1312,7 @@ void dns_transaction_process_reply(DnsTransaction
*t, DnsPacket *p, bool encrypt
                           FORMAT_DNS_RCODE(DNS_PACKET_RCODE(p)),
                           dns_server_feature_level_to_string(t-
>clamp_feature_level_nxdomain));
 
+                dns_transaction_shuffle_id(t);
                 dns_transaction_retry(t, false /* use the same server
*/);
                 return;
         }


A few questions:

- Does anyone else see this?

- Does this look like a reasonable fix?  Any thoughts on whether the
one other place where dns_transaction_retry(..., false) is called to
retry the same server with a lower feature level (SERVFAIL etc) should
do the same?

- Any other issues with the patch?  Or would it be reasonable to (add
comments and) submit a pull request?

-Vince Del Vecchio