Re: draft-mm-wg-effect-encrypt-13 review

Kyle Rose <krose@xxxxxxxxx> · Wed, 10 Jan 2018 12:46:19 -0500

I would put each of my comments in the etherpad into one of three buckets: observations and commentary, minor points and clarifications, and points that I think require more discussion. I'm only really going to address the last bucket: the others I raised but don't feel particularly strongly about, at least at this late stage in the document's life.

   (section 3.2 of [RFC7525]), essentially preventing the negotiation
   process resulting in fallback to the use of clear text.  In other
   cases, some service providers have relied on middle boxes having
   access to clear text for the purposes of load balancing, monitoring
   for attack traffic, meeting regulatory requirements, or for other
   purposes.  These middle box implementations, whether performing
   functions considered legitimate by the IETF or not, have been
   impacted by increases in encrypted traffic.  Only methods keeping
   with the goal of balancing network management and PM mitigation in
   [RFC7258] should be considered in solution work resulting from this
   document.

KR> I feel like this section could be better organized by:
 * Moving the examples to 1.1 as a bulleted list of sample situations in which network operators attempted to and/or succeeded in defeating encryption to preserve existing operational mechanisms, or in which performance suffered for users (whether of the encrypted flows or of other flows impacted by encrypted flows).

KM> Interesting point, but we'd need more examples.  I'll think about this more and chat with Al in case he has ideas.  For now, I went with Brandon's easier suggestion, but moving to this would be nice for the document readers.

AM> Although I see how these examples could be part of the background, I think those who will
eventually remove their objections will prefer the reduced emphasis on these examples where
they are (in section 2). In one view, the entire memo is background, since nothing new is proposed.

KR>
 * Using this section as an introduction to the methodology for cataloging operational mechanisms depending on cleartext traffic monitoring, with the various caveats on what will be considered (e.g., only mechanisms required heretofore for operability), and for describing the approach to seeking mitigations and/or substitutions.

KM> Hmm, interesting point.  I'll have to think about this more as it could be alot of work at this stage.

AM> Unfortunately, we've already implemented many AD-level suggestions on the organization of Section 2.
We're at the stage of *what can everybody live with", and re-re-re-org falls out now, IMO.

This is a reasonable objection, but I am mostly concerned about satisfying the target audience. ISTM that audience is something like "IETF participants who are skeptical about, or ignorant of, the operational difficulties posed by widespread encryption of flows". If the document meets your goals as-is, then further refinement is unnecessary. Does it? Or is it impossible to know at this point?

AM> Also, the neutral exposition that we've been asked to provide a million times actually
comes from multiple perspectives expressed in contributions that we would combine
in a balanced way, without value judgements (no good or bad).
Where we lack balance, we lack specific contributions.

"Neutral" meaning "without advocacy"? It's a fine line. The document still has a thesis, which (if I am reading it correctly) is to inform the community of current operational practices that are encumbered in some way by encryption. To wit, q( It provides network operators' perspectives about the motivations and objectives of those practices as well as effects anticipated by operators as use of encryption increases. ) Without that thesis, which some might interpret as advocacy, it's not clear the document would have enough focus to be useful. I'd like folks to see this doc as *at worst* devil's advocacy, and preferably as a challenge to come up with better arguments and to find alternative methods for dealing with network management problems.
    heuristics grows, and accuracy suffers.  For example, the traffic
   patterns between server and browser are dependent on browser supplier
   and version, even when the sessions use the same server application
   (e.g., web e-mail access).  It remains to be seen whether more
   complex inferences can be mastered to produce the same monitoring
   accuracy.

KR> This might be too formal of an approach for this doc, but it might be possible to construct a taxonomy of layers of metadata made unavailable by encryption at each layer to show the completeness/comprehensiveness of the survey. So, for instance:
 * Protocol and port number are still available as a way of characterizing traffic over the public internet even if the payload is encrypted, but this information is lost if (e.g.) the traffic is traversing an IPsec tunnel or if radically different kinds of traffic all use port 443/tcp without any other way to distinguish between them.
 * TCP is open to optimization/measurement even if using TLS, except when tunneled encrypted: congestion signals (like rexmits) previously transparent to the middlebox, for instance, are then lost.
 * Encrypting the payload defeats attempts to survey traffic by user agent (if there's no other way to distinguish, e.g., by fingerprinting).

KM> I think this would be a really helpful follow on document.  I'd be willing to work on it if you're game.  I've been thining about something similar, specific to TLS, but should be broadened.

I'm afraid this topic might be more of a research project than it initially appears, but it's probably a reasonable exercise to see what we can produce for a plaintext protocol vs. that same protocol over TLS. Ping me.

   It is important to note that the push for encryption by application
   providers has been motivated by the application of the described
   techniques.  Some application providers have noted degraded
   performance and/or user experience when network-based optimization or
   enhancement of their traffic has occurred, and such cases may result
   in additional operator troubleshooting, as well.

KR> Observation: additionally, I think you'll encounter the argument that the responsibility for diagnosing bad interactions between applications and networks falls on the application owner rather than the network operator. Basically, I feel like the desire among protocol designers is for operators to provide a pipe with certain key characteristics that interact well with established transport protocol mechanisms, and otherwise to leave the traffic alone and let the application developers do what they want to within the expected constraints. If that's infeasible (e.g., in edge cases, or with respect to new technologies that interact badly with existing transports, such as the loss=congestion assumption of TCP that interacts badly with wifi), that's precisely the case needs to be made by this document.

KM> We have encountered this argument already.  It's a tough one as SPs have the SLAs with customers, so they are the first call.  Many don't know how to get in touch with APP providers.  I understand the application developers perspecive, but also see that there has to be some ability to troubleshoot.  Sure, providers could wrap the protocols for transport to provide some way of measuring, but information is lost.  IPv6 with flow identifiers is another way to do it, but you might not be able to prioritize a call or protocol that has little tolerance for delay over one that does for instance.  And I realize that app providers just want all traffic to have the same priority, but emergency calls are important.

BW> I think the point made by the document is correct though: operators are nearly always the first call, not the application provider.

KM> We were asked to remove text that said that.  I agree that it is the case as the providers have the SLAs and you don't typlically have a number for App providers.

BW> The operators are looking for ways to demonstrate that they did not cause the problem (or determine that they did) for efficient hand-off to the correct party for resolution. There are certainly problems an approach that changes the behavior of the protocol, but it's difficult to argue with the diagnostic need.

AM> Using Netflix as an example, the first source of problem they mention is the network when
addressing the question "Why doesn't Netflix work?":
    "If Netflix isn’t working, you may be experiencing a network connectivity issue, an issue with your device, or an issue with your Netflix app or account."
    from https://help.netflix.com/en/node/461?ui_action=kb-article-popular-categories
They previously had even stronger wording, something like "First, make sure your network connection meets the Netflix requirements ... URL"
One of the causes of re-buffering are CDN-related pauses when accessing the next segment:  completely hidden from users so far.
Additional frequent cause: the unlicensed WiFi network owned and operated by the customer.

Another way to look at this strategy: App providers are transferring as much overhead cost to the network operators as possible
(troubleshooting customer problems is expensive - rolling a truck negates months of revenue), while preserving as
much value/control/revenue as they can for themselves. The greed-thingy plays poorly over time.
A user-focused strategy would be to form partnerships for troubleshooting of shared customers, but that might result in exposing
the real causes and some would rather hide for now, it seems.

I agree with all the points you've made. How do we square reality (users blame network operators) with the current approach to protocol design at the IETF (keep the network dumb)? I feel like creating a conversation about this apparent cognitive dissonance will be one of the most important outcomes of publishing this document.

I have no doubt the conflict will resolve itself somehow: CDNs, for instance, act as an intelligent overlay over dumb networks and can therefore provide the most consistent user experience when deeply deployed into carrier networks the structure of which they have intimate knowledge. Is the right solution to continue to effectively delegate this responsibility by encouraging breaking of connections at the edge, or should the IETF be trying to optimize the end-to-end performance of its protocols on the public internet?

Anyway, I digress. This isn't a conversation I'm proposing you have in this document; just that the doc should raise these kinds of questions in the reader.

   packet is able to provide stateless load balancing.  This ability
   confers great reliability and scaleability advantages even if the
   flow remains in a single POP, because the load balancing system is
   not required to keep state of each flow.  Even more importantly,
   there's no requirement to continuously synchronize such state among
   the pool of load balancers.

KR> An important point is that an integrated load balancer repurposing limited existing bits in transport flow state must maintain and synchronize per-flow state occasionally: using the sequence number as a cookie only works for so long given that there aren't that many bits available to divide across a pool of machines.

KM> I added in this point, but have to check back on flow of text.

I checked the wording in -14.I'm going to propose slightly different language:

q( This ability
   confers great reliability and scalability advantages even if the
   flow remains in a single POP, because the load balancing system is
   not required to keep state of each flow. There is value even when the
   repurposed bits are strictly insufficient for encoding all state: an integrated load balancer repurposing
   limited existing bits in transport flow state must still maintain and
   synchronize per-flow state occasionally (using the sequence number as
   a cookie only works for so long given that there aren't that many
   bits available to divide across a pool of machines), but there is no longer
   a requirement for such synchronization to be continuous or instantaneous. )

KR> A dedicated mechanism for storing load balancer state, such as QUIC's proposed connection ID, is strictly better from the load balancer's point of view, and is probably even better from a privacy perspective than bolting it on to an unrelated transport signal because it can be tightly controlled by one of the endpoints and rotated to avoid roving client linkability: in other words, being a specific, separate signal, it can be governed in a way that is finely targeted at that specific use-case. (I'm thinking the advantages of separate mechanisms belongs in a different part of the doc; this section is more like the problem statement than the solution statement.)

KM> This (above) needs to be reworded to be neutral and this does go towards solution space, which we were trying to avoid. How about:

Another possibility is a dedicated mechanism for storing load balancer state, such as QUIC's proposed connection ID to provide visibility to the load balancer.  An identifier could be used for tracking purposes, but this may provide an option that is an improvement from  bolting it on to an unrelated transport signal. This method allows for tight control by one of the endpoints and can be rotated to avoid roving client linkability: in other words, being a specific, separate signal, it can be governed in a way that is finely targeted at that specific use-case.

SGTM. Maybe s/improvement from bolting it on to/improvement compared to co-opting/.

   In future Network Function Virtualization (NFV) architectures, load
   balancing functions are likely to be more prevalent (deployed at
   locations throughout operators' networks)[.  NFV environments will
   require some type of identifier (IPv6 flow identifiers, the Proposed
    QUIC connection ID, etc.) for managing]
   traffic using encrypted tunnels.[  The shift to increased encryption
   will have an impact to visibility of flow information and will require
   adjustments to perform similar load balancing functions within an NFV.]

KR> I'm not sure what architecture this paragraph is discussing: are you talking about encrypted tunnels between NFV nodes? Is this something obvious to people involved in NFV? A diagram (or informational reference) would be helpeful to me here.

KM> I see your point, the langauage here could be more clear. Do the above adjustments (ed: in []) help?

I am still a little confused. Is the idea that load balancing in NFV environments has a unique need for stateless (or reduced-state) load balancing that other applications don't have? I'm having a hard time wrapping my head around why that would be the case. Or is the point here just to highlight NFV as just another use case to consider?

2.2.2.  Differential Treatment based on Deep Packet Inspection (DPI)
   ...
   These effects and potential alternative solutions have been discussed
   at the accord BoF [ACCORD] at IETF95.

KR> This section is labeled DPI, but really, the underlying issue is what you stated in the first paragraph: different kinds of traffic have different QoS needs, yet a network provider can't rely on a voluntary signal from an untrusted device to decide on QoS or every packet is simply going to be marked "high importance" and so we're back to treating all traffic equivalently. I'd argue against one of the memes I heard at the accord BoF, that it's down to latency vs. throughput, by pointing out that some applications (e.g., live video with low hand-wave latency) need both.

Even after reading this, I'm still skeptical of the need for any more granularity than flow, and using AQM on a per-flow (e.g., 5-tuple) or flow-aggregate (some subset of the 5-tuple) to prevent an application or user from consuming resources unfairly. What, for instance, prevents a carrier from privileging VoIP traffic by looking at endpoints? Would there be a way for someone else to masquerade non-VoIP traffic as VoIP traffic given this kind of setup? This is the kind of question that I need answered by this doc.

BW> It might be useful to note in this section that QUIC and H2 both combine multiple micro-flows, possibly of different types, within a single encrypted transport-layer flow. They share this with IPsec tunnels and the like. IOW, the increased use of encrypted aggregating encapsulation can hide even the the most basic representation of a flow from the differentiated service element. This same concern applies to load balancing elements discussed in section 2.2.1.

Good point: an example of this is sending both Youtube and search responses over the same QUIC or H2 connection, with no way for the network to throttle one without throttling the other.

AM> We were asked not to refer to QUIC, for various reasons (e.g., still under development).

There will always be areas where network can make the best decision, because of the
information available to the network operators (and the lack of that same info at end-points).

When network resources are constrained, only the network can manage priorities.
This has been organized according to applications that can be identified, but there
can be other solutions requiring cooperation between user devices and the network
according to subscription to a special service (QCI above).

Got it. Is DPI the right framing for this, or is something more generic (e.g., "content-aware traffic management") what is really required? E.g., the network doesn't necessarily need to know which video you're watching, only that it is video, and maybe what the available bitrates are and associated quality.

   An application-type-aware network edge (middlebox) can further
   control pacing, limit simultaneous HD videos, or prioritize active
   videos against new videos, etc.

KR> Observation: This subsection provides the first really compelling argument I've seen for exposing flow metadata to the path. On long paths, physics gets in the way of tight control feedback loops. If nothing else, this should provide motivation for protocol designers and operators to break down the characteristics of different kinds of flows, determine where control points are needed in each of them, and figure out how to implement those.

I think there is this conceit among protocol designers that quality problems can all be solved at the endpoints without any cooperation from path elements; the really killer arguments are examples of where that cannot possibly be the case. ECN is a great example of this, and is a signal explicitly targeted at middleboxes with opt-in by the endpoints: it allows a middlebox to report congestion without dropping packets, which produces measurably better QoS for the user.

KM> Ack, thanks.  You're not looking for additional text here, is that right?  If so, what are you thinking should be added?

No, just an observation that this was one of the more thought-provoking sections of the doc for me.

   Alternate approaches such as blind caches [I-D.thomson-http-bc] are
   being explored to allow caching of encrypted content; however, they
   still need to intercept the end-to-end transport connection.

KM> [s/need to intercept the end-to-end transport connection/require cooperation between the content owners/CDNs and blind caches and fall outside the scope of what is covered in this document/

SGTM.

2.2.6.  Content Compression

   In addition to caching, various applications exist to provide data
   compression in order to conserve the life of the user's mobile data
   plan and optimize delivery over the mobile link.  The compression
   proxy access can be built into a specific user level application,
   such as a browser, or it can be available to all applications using a
   system level application.  The primary method is for the mobile
   application to connect to a centralized server as a proxy, with the
   data channel between the client application and the server using
   compression to minimize bandwidth utilization.  The effectiveness of
   such systems depends on the server having access to unencrypted data
   flows.

KR> Observation: given the side channels exposed by data compression that is blind to content, the inability to compress arbitrary payloads is likely to be regarded as a feature of encryption. (Though I recognize this is a catalog, not an endorsement.) Furthermore, in most cases eliminating compression is still 2-competitive with compression, so I'm not sure it's a really compelling use-case.

BW> Per-object content compression might not be a compelling use case here. Aggregated data stream content compressions that spans objects and data sources is compelling, though. If there is a network element close to the receiver that sees all content destined for the receiver and can treat it all as part of a unified compression scheme (e.g., through the use of a shared segment store) will often be much more effective at providing data off-load.

KM> Thanks, we'll add this text (modified) to make those helpful points clear.

How about:
    Aggregated data stream content compression that spans objects and data sources that can be treated as part of a unified compression scheme (e.g., through the use of a shared segment store) is often effective at providing data offload when there is a network element close to the receiver that has access to see all the content.

Sounds good. This is general enough to cover the case of networks with limited uplinks wanting to cache content that is conceptually shared (e.g., VOD) but delivered independently to end users via individual TLS connections.

KR> It might be worth discussing the typical opt-in strategy for these things in the presence of TLS, adding a new intercept CA to willing clients, which has the downside that it potentially exposes every https connection to an active MitM.

BW> +1

KM> OK, we hadn't done that before since the option doesn't change, but you make a good point, so I'll add in text.  Thanks.

I added the following:

    This method is also used by other types of network providers enabling
     traffic inspection, but not modification.</t>

             <t>Content filtering via a proxy can also utilize an intercepting
          certificate where the client's session is terminated at the proxy
          enabling for cleartext inspection of the traffic.  A new session
          is created from the intercepting device to the client's
          destination, this is an opt-in strategy for the client. Changes to
          TLSv1.3 do not impact this more invasive method of interception, where
          this has the potential to expose every HTTPS session to an active
          man in the middle (MitM). </t>

Mostly sounds good. Is there a reason to mention TLS 1.3 specifically here?

KR> Random comment: especially with respect to government content filtering, I'm worried that the IETF's current approach of playing chicken with regulators on end-to-end encryption is going to result in normalization of intercept CAs, which will be strictly worse than a compromise solution in which a subset of traffic can be inspected (but not modified) with the user's knowledge and consent (e.g., distinct optics in the browser). I wouldn't like either outcome, frankly, but it would be nice if we had a game plan for what to do for user privacy if intercept CAs become a requirement for using the web in large parts of the world (something we might be one "crisis" away from), and an honest evaluation of the alternatives. Fundamentally, I don't like it when discussion gets shut down because people want to bury their heads in the sand in the name of ideology.</rant>

BW> +1. I also note that this concern applies to some of the other performance related use cases too.

KM> I think the real argument here is a control one between the application and management folks and not security/privacy even though that's what is often discussed.  This is all about control.

Right, but the core issue being addressed by this document is that measures intended for reasons of privacy and security (encryption) are impacting something over which there is much less consensus (content-aware flow management and path intelligence). I'm not proposing any language here, only pontificating that the purity approach might backfire, and I'm not sure we have a backup plan.

   In addition, mobile network operator often sell tariffs that allow
   free-data access to certain sites, known as 'zero rating'.  A session
   to visit such a site incurs no additional cost or data usage to the
   user.  This feature is impacted if encryption hides the details of
   the content domain from the network.

KR> There's the related issue that zero-rating by-implementation typically applies only to direct connections to a particular endpoint (e.g., by IP): if a user accidentally tunnels traffic from Spotify through a corporate VPN, that traffic won't be zero-rated, encrypted tunnel or not. (This goes back to the taxonomy of metadata layers comment I made near the top.) Carriers aren't going to trust e.g., a Host header for zero-rating, because that provides a simple way to tunnel traffic for free: consequently, determination of zero-rating will always involve some hard-to-impersonate credential, like an IP address or server certificate in the public trust web.

KM> Not sure what to add here, any ideas, AL?

I think the only change I'd make here is to change "content domain" to "content origin", because domain implies hostname where the origin is often an IP.

   When RTSP stream content is encrypted, the 5-tuple information within
   the payload is not visible to these ALG implementations, and
   therefore they cannot provision their associated middelboxes with
   that information.

KR> I would argue that this is a protocol design issue. This was originally a problem with firewalls and NATs, with content inspection as a hack to work around the protocol/network impedance mismatch. I'm not the only one who would argue the right solution today is to design protocols to not require linkage across connections by middleboxes that do basic filtering.

KM> I think we are in agreement here for solution direction, but the document specificly tries to avoid solutions.  This example has been raised in the IESG by Warren and the apps side hadn't considered his view of it previously.  It would be good for protocols to have these considerations in their designs, they were mostly thinking it didn't matter and were end-to-end.  But poor video streaming sessions are an issue.  Not sure we should add any text here???

This was just another random observation.

   Data center operators may also maintain packet recordings in order to
   be able to investigate attacks, breach of internal processes, etc.
   In some industries, organizations may be legally required to maintain
   such information for compliance purposes.  Investigations of this

KR> I think you'll get a "[citation needed]" from folks on the TLS mailing list.

KM> I suspect this is one you have that recorded text, you have to maintain it for chain of custody with investigation handling.  I'll have to figure out if there is anything that would require the capture, I suspect not, but could be wrong.

Just making the point that this has been contended several times on various mailing lists and at meetings, so it would be nice to get the oft-cited cases documented somewhere as an informational reference.

   There are use cases where DAR or disk-level
   encryption is required.  Examples include preventing exposure of data
   if physical disks are stolen or lost.

KR> I don't see these last two sentences are relevant, as they have nothing to do with the network flows.

KM> I'm happy to remove.  DO they help a reader who is not familiar with the technology to understand the layers of encryption used at all or is it better to remove the sentence?

I'm not sure. I'm actually a bit confused about this section in general. It seems to be discussing monitoring of data during transport to/from the storage cluster in the same paragraph as encryption of data at rest, but I'm not sure what point it's trying to make. Is it that operators have a threat model that doesn't include the network connection between the storage cluster and the client, but which does include exfiltration of the disks in the cluster?

   Security monitoring in the enterprise may also be performed at the
   endpoint with numerous current solutions that mitigate the same
   problems as some of the above mentioned solutions.  Since the
   software agents operate on the device, they are able to monitor
   traffic before it is encrypted, monitor for behavior changes, and
   lock down devices to use only the expected set of applications.
   Session encryption does not affect these solutions.  Some might argue
   that scaling is an issue in the enterprise, but some large
   enterprises have used these tools effectively.

KR> This is another example of mixing proposed solutions in among the problem statement. I would argue for a clear separation, which may mean that this document needs to have a single-minded focus on "here are the problems and here's how enterprises currently address them."

BW> Also, enterprises increasingly allow BYOD programs for their employees, and such programs make it more difficult to ensure that adequate endpoint-based defenses are active. This is especially true when the area of risk in question is the above #5 "track misuse and abuse by employees". Note too that endpoint-based defenses can be less effective when the device is already compromised, in which case detection of the compromised device and effective remediation can be made more effective through the additional use of an on-path element.

KM> [made some subsequent edits to this section]

LGTM.

5.7.  Further work

   Although incident response work will continue, new methods to prevent
   system compromise through security automation and continuous
   monitoring [SACM] may provide alternate approaches where system
   security is maintained as a preventative measure.

KR> Not clear how the unknowns relate to the purpose of this document. Being sarcastic for a minute, I'm interpreting this as "Any cleartext metadata just *might* be used in the future for some kind of enterprise security monitoring!"

KM> Hmm, it's meant to say endpoints (which you control) should be used and technology like what is expected out of SACM will help with automating this.  We are open to text suggestions.

Yeah, I think I must have misread it the first time, because I get only your meaning now.

Kyle