Re: [Last-Call] Genart last call review of draft-ietf-dnsop-caching-resolution-failures-06

"Wessels, Duane" <dwessels=40verisign.com@xxxxxxxxxxxxxx> · Mon, 21 Aug 2023 21:07:19 +0000

> On Aug 11, 2023, at 5:36 AM, Lucas Pardue via Datatracker <noreply@xxxxxxxx> wrote:
> 
> Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. 
> 
> Reviewer: Lucas Pardue
> Review result: Ready with Issues
> 
> I am the assigned Gen-ART reviewer for this draft. The General Area
> Review Team (Gen-ART) reviews all IETF documents being processed
> by the IESG for the IETF Chair.  Please treat these comments just
> like any other last call comments.
> 
> For more information, please see the FAQ at
> 
> <https://secure-web.cisco.com/1UZHZEsg_CD0wKCgJum89JtRWBIKuWfAMrOAeNCDx_noxdIVT0xTFtSDKvvkTvjoqt0318tJcX06nwaM58f9XNMDWWilDoqIENqL_gk262YdZle75QHHoW2s2KdRaGCdQkKG8uKUbDRRY655t-OOuxr0Yfd1eJmBdp5KBeJs1-XyEcQI-c_JeFcXJ8taygT-DnCUz-awp_q3J8yJneseERQtJ7GDzNxDcvYbgsJO-fPPCB7ErC401Qq9bP2qWs07AET3l4jK5lmNnyR4yBeDa5NBFgyzdWwC8DOQ9c2t6FPY/https%3A%2F%2Fwiki.ietf.org%2Fen%2Fgroup%2Fgen%2FGenArtFAQ>.
> 
> Document: draft-ietf-dnsop-caching-resolution-failures-??
> Reviewer: Lucas Pardue
> Review Date: 2023-08-11
> IETF LC End Date: 2023-08-17
> IESG Telechat date: Not scheduled for a telechat
> 
> Summary: The document was well-written with clear motivation statements and
> normative text for addressing the indicated problems

Hi Lucas, thanks for the detailed review.

> 
> Major issues: None
> 
> Minor issues:
> 
> * Section 3.1 describes retries and places the normative requirement "A
> resolver MUST NOT retry a given query to a server address over a given
> transport protocol more than ...". However, the definition of "transport
> protocol" is not 100% clear to me, and the terms "transport" and "transport
> layer protocol" seem to be used interchangeably through the document.  Perhaps
> this is clearer to those in the DNS area, but as a transport area person, DNS
> over TCP and DNS over TLS both use the same transport protocol. Section 2.3
> would seem to imply that DNS over TCP and DNS over TLS are treated as different.
> 
> I think it would help to better define exactly what "a given transport
> protocol" in section 3.1 means. Perhaps that definition already exists
> somewhere that can be cited and imported into the terminology section.

You’re right that we have not been especially precise when using the word “transport.”
The authors did intend for DNS over UDP, over TCP, and over TLS, etc to essentially
be treated as separate transports, or separate ways a client can talk to a server.

I’m not sure how best to fix this.  On one hand, as far as we know, there is
currently not a good term that collectively refers to DNS over UDP, TCP, TLS, HTTPS,
QUIC, and whatever else may come our way.  So maybe we need to define one.  I’m
hesitant, though, because I’m not sure this document is where such a term should
be introduced, and because definitions often turn out to be like cans of worms.

Nonetheless, we have taken a stab at it:

   *  DNS Transport: In this document, DNS transport means a protocol
      used to transport DNS messages between a client and a server.
      This includes "classic DNS" transports, i.e., DNS-over-UDP and
      DNS-over-TCP [RFC1034] [RFC7766], as well as newer encrypted DNS
      transports such as DNS-over-TLS [RFC7858], DNS-over-HTTPS
      [RFC8484], DNS-over-QUIC [RFC9250], and similar communication of
      DNS messages using other protocols.  NOTE: at the time of this
      writing not all DNS transports are standardized for all types of
      servers, but may become standardized in the future.

…

3.1.  Retries and Timeouts

   A resolver MUST NOT retry a given query to a server address over a
   given DNS transport more than twice (i.e., three queries in total)
   before considering the server address unresponsive over that DNS
   transport for that query.

   A resolver MAY retry a given query over a different DNS transport to
   the same server if it has reason to believe the DNS transport is
   available for that server and is compatible with the resolver's
   security policies.

> 
> Nits/editorial comments:
> 
> * In section 1, there exists "section 5" and "section 7" usages that do make it
> clear if these are internal or external references.

We propose to just remove those section references.

> 
> * I appreciated the text in sections 1.1 and 1.2, dealing with motivation and
> related use cases respectively. However, as a generalist reviewer, the most
> useful part of Section 1.1 was the first sentence. The remainder of the text in
> 1.1 feels like case studies, that while interesting manifestations, are not
> pure motivation. As a purely editorial suggestion you can take or leave,
> consider modifying the last paragraph of Section 1 to something like
> 
> "Operators of DNS services have known for some time that recursive resolvers
> become more aggressive when they experience resolution failures; see Appendix A
> for a collection of anecdotes, experiments, and incidents support this claim.
> This document updates [RFC2308] to require negative caching of DNS resolution
> failures, which can help to mitigate the operational problems failures might
> generate. Examples of resolution failures are provided in Section 2. Related
> work is described in Appendix B."
> 
> then move the text from sections 1.1 and 1.2 in appendix A and appendix B.

That is an interesting suggestion.  Among discussion with my coauthors we have
a slight preference to leave it as-is, but would also like to take advice on
this from the RFC editor.

> 
> * TOC - "Conditions That Lead To DNS Resolution Failures" vs "Requirements for
> Caching Resolution Failures". Presumably the same thing, so consistency might
> help

I’m not sure I understand this comment.  Can you explain further what you mean?

> 
> * Section 3.2 - regarding the 1 second minimum requirement, the text that
> follows says "Resolvers MAY cache different types of resolution failures for
> different (i.e, longer) amounts of time." and then later "Consistent with
> [RFC2308], resolution failures MUST NOT be cached for longer than 5 minutes.".
> These statements are all logically consistent but could be made simpler with
> some editorial work. For example, something like
> 
> "Resolvers MUST cache resolution failures for at least 1 second. Resolvers MAY
> cache failures for a longer time, up to a maximum of 5 minutes (per the
> requirements of [RFC2308]). Resolvers MAY cache different types of failures
> using different time periods within this range."

I see what you’re saying.  We propose to move the maximim caching time up and split that paragraph into two, as follows:

   Resolvers MUST cache resolution failures for at least 1 second.
   Resolvers MAY cache different types of resolution failures for
   different (i.e., longer) amounts of time.  Consistent with [RFC2308],
   resolution failures MUST NOT be cached for longer than 5 minutes.

   The minimum cache duration SHOULD be configurable by the operator.  A
   longer cache duration for resolution failures will reduce the
   processing burden from repeated queries, but may also increase the
   time to recover from transitory issues.

DW

-- 
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call