Re: Resolver times out resending with same transaction ID

Petr Menšík <pemensik@xxxxxxxxxx> · Fri, 24 Mar 2023 04:28:43 +0100

On 3/21/23 06:32, Vince Del Vecchio wrote:
Hi all,

I recently observed reverse IPv4 address lookups timing out on a newly
configured host.  (Ubuntu 22.04LTS, systemd 249.11-0ubuntu3.7).  I
tracked the problem to the DVE-2018-0001 mitigation code.

An example:

$ resolvectl query 151.101.1.164
151.101.1.164: resolve call failed: All attempts to contact name
servers or networks failed

tcpdump shows (in relevant part):
  00:00:00.000000 IP 192.168.1.48.35911 > 8.8.8.8.53: 26417+ [1au] PTR?
164.1.101.151.in-addr.arpa. (55)
  00:00:00.021127 IP 8.8.8.8.53 > 192.168.1.48.35911: 26417 NXDomain
0/1/1 (115)
  00:00:00.021252 IP 192.168.1.48.35911 > 8.8.8.8.53: 26417+ PTR?
164.1.101.151.in-addr.arpa. (44)

The first query gets an "NXDOMAIN", which is the correct answer for
this address.

However, NXDOMAIN triggers the DVE-2018-0001 mitigation code to send an
revised query without EDNS OPT (confirmed in debug log).  I **never see
a response to this revised query**.

Frankly, it is wrong from systemd-resolved to try working around clearly 
broken resolvers. In this case, it delays correct response from 
well-behaving server. Just because some really broken servers send wrong 
replies. This should be enabled ONLY by manual configuration, if at all. 
Every user should know he has broken DNS servers if this (mis)feature helps.

Anyway, it should not require a timeout. If the response had correct 
name and type in question section and matching transaction id, it is 
cleary the response to our query. If it insist on those kinds of 
workarounds, do it right away, not after no response timeout. Better 
though do that only if requested. NXDOMAIN is a valid response and DNS 
folks are serious to deliver it only when it means requested name does 
not exist. Proper way to signal the server does not understand something 
in the query is only FORMERR response.

It is a shame ResolveUnicastSingleLabel=yes has to be configured 
manually to avoid some failures on correct names, but such tricks are 
enabled by default and cannot even be turned off manually. Please 
correct that!

If there is only a single DNS server, the resolver resends the OPT-less
query after a timeout, and *that* gets an NXDOMAIN which is returned.
However, if there are multiple DNS servers (e.g. 8.8.8.8 8.8.4.4), on
timing out, it sends another query with EDNS to the next server, and
the three-packet sequence repeats several times until it gives up.

Since the server *will* respond to a retransmit after 5s, my guess is
that the server, or maybe something in the network, is dropping close-
in-time requests with the same transaction id.  I tried a few public
DNSs that (surprisingly?) all behaved the same.  I haven't found a
simple way to rule out a firewall, router or my ISP.
Does the re-transmit keep the same source port and transaction id?

Regardless, my thought is that resending a slightly different query
after we did get a response should not use the same transaction id.  I
patched systemd as follows and the problem goes away:

--- a/src/resolve/resolved-dns-transaction.c
+++ b/src/resolve/resolved-dns-transaction.c
@@ -1312,6 +1312,7 @@ void dns_transaction_process_reply(DnsTransaction
*t, DnsPacket *p, bool encrypt
                            FORMAT_DNS_RCODE(DNS_PACKET_RCODE(p)),
                            dns_server_feature_level_to_string(t-
clamp_feature_level_nxdomain));
  
+                dns_transaction_shuffle_id(t);
                  dns_transaction_retry(t, false /* use the same server
*/);
                  return;
          }


A few questions:

- Does anyone else see this?

- Does this look like a reasonable fix?  Any thoughts on whether the
one other place where dns_transaction_retry(..., false) is called to
retry the same server with a lower feature level (SERVFAIL etc) should
do the same?
Yes, to me it is. Only unmodified retries should keep original 
transaction ids. If it modifies sent query, it should get a new id for 
it. It also ensures that the EDNS removal were the thing which helped, 
not just pure retransmit. I think it should change transaction id every 
time it got any response. SERVFAIL is a response too.
- Any other issues with the patch?  Or would it be reasonable to (add
comments and) submit a pull request?
I think pull requests are in general a better way to request a code 
change. Makes commenting easier and linking related issues too.

-Vince Del Vecchio

Just my 2 cents.

Cheers,

Petr

--
Petr Menšík
Software Engineer, RHEL
Red Hat, https://www.redhat.com/
PGP: DFCF908DB7C87E8E529925BC4931CA5B6C9FC5CB