Re: [RFC net-next 2/2] ipv6: ioam: Support for Buffer occupancy data field

Justin Iurman <justin.iurman@xxxxxxxxx> · Thu, 9 Dec 2021 15:10:24 +0100 (CET)

>> True. But keep also in mind the scope of IOAM which is not to be
>> deployed widely on the Internet. It is deployed on limited (aka private)
>> domains where each node is therefore managed by the operator. So, I'm
>> not really sure why you think that the implementation specific thing is
>> a problem here. Context of "unit" is provided by the IOAM Namespace-ID
>> attached to the trace, as well as each Node-ID if included. Again, it's
>> up to the operator to interpret values accordingly, depending on each
>> node (i.e., the operator has a large and detailed view of his domain; he
>> knows if the buffer occupancy value "X" is abnormal or not for a
>> specific node, he knows which unit is used for a specific node, etc).
> 
> It's quite likely I'm missing the point.

Let me try again to make it all clear on your mind. Here are some quoted
paragraphs from the spec:

  "Generic data: Format-free information where syntax and semantic of
   the information is defined by the operator in a specific
   deployment.  For a specific IOAM-Namespace, all IOAM nodes have to
   interpret the generic data the same way.  Examples for generic
   IOAM data include geo-location information (location of the node
   at the time the packet was processed), buffer queue fill level or
   cache fill level at the time the packet was processed, or even a
   battery charge level."

This one basically says that the IOAM Namespace-ID (in the IOAM Trace
Option-Type header) is responsible for providing context to data fields
(i.e., for "units" too, in case of generic fields such as queue depth or
buffer occupancy). So it's up to the operator to gather similar nodes
within a same IOAM Namespace. And, even if "different" kind of nodes are
within an IOAM Namespace, you still have a fallback solution if Node IDs
are part of the trace (the "hop-lim & node-id" data field, bit 0 in the
trace type). Indeed, the operator (or the collector/interpretor) knows if
node A uses "bytes" or any other units for a generic data field.

  "It should be noted that the semantics of some of the node data fields
   that are defined below, such as the queue depth and buffer occupancy,
   are implementation specific.  This approach is intended to allow IOAM
   nodes with various different architectures."

The last sentence is important here and is, in fact, related to what you
describe below. Having genericity on units for such data fields allows
for supporting multiple architectures. Same idea for the following
paragraph:

  "Data fields and associated data types for each of the IOAM-Data-
   Fields are specified in the following sections.  The definition of
   IOAM-Data-Fields focuses on the syntax of the data-fields and avoids
   specifying the semantics where feasible.  This is why no units are
   defined for data-fields like e.g., "buffer occupancy" or "queue
   depth".  With this approach, nodes can supply the information in
   their native format and are not required to perform unit or format
   conversions.  Systems that further process IOAM information, like
   e.g., a network management system are assumed to also handle unit
   conversions as part of their IOAM data-fields processing.  The
   combination of a particular data-field and the namespace-id provides
   for the context to interpret the provided data appropriately."

Does it make more sense now on why it's not really a problem to have
implementation specific units for such data fields?

>> [...]
>>
>> Do you believe this patch does not provide what is defined in the spec?
>> If so, I'm open to any suggestions.
> 
> The opposite, in a sense. I think the patch does implement behavior
> within a reasonable interpretation of the standard. But the feature
> itself seems more useful for forwarding ASICs than Linux routers,

Good point. Actually, it's probably why such data field was defined as it
is.

> because Linux routers can run a full telemetry stack and all sort
> of advanced SW instrumentation. The use case for reporting kernel
> memory use via IOAM's constrained interface does not seem particularly
> practical since it's not providing a very strong signal on what's
> going on.

I agree and disagree. I disagree because this value definitely tells you
that something (potentially bad) is going on, when it increases
significantly enough to reach a critical threshold. Basically, we need
more skb's, but oh, the pool is exhausted. OK, not a problem, expand the
pool. Oh wait, no memory left. Why? Is it only due to too much
(temporary?) load? Should I put the blame on the NIC? Is it a memory
issue? Is it something else? Or maybe several issues combined? Well, you
might not know exactly why (though you know there is a problem), which is
also why I agree with you. But, this is also why you have other data
fields available (i.e., detecting a problem might require 2+ symptoms
instead of just one).

> For switches running Linux the switch ASIC buffer occupancy can be read
> via devlink-sb that'd seem like a better fit for me, but unfortunately
> the devlink calls can sleep so we can't read such device info from the
> datapath.

Indeed, would be a better fit. I didn't know about this one, thanks for
that. It's a shame it can't be used in this context, though. But, at the
end of the day, we're left with nothing regarding buffer occupancy. So
I'm wondering if "something" is not better than "nothing" in this case.
And, for that, we're back to my previous answer on why I agree and
disagree with what you said about its utility.