Re: [PATCH net-next 3/6] net: dsa: add support for retrieving the interface mode

Vladimir Oltean <olteanv@xxxxxxxxx> · Sun, 24 Jul 2022 20:39:16 +0300

On Sat, Jul 23, 2022 at 07:26:55PM +0200, Marek Behún wrote:
> Does Lynx PCS support 1000base-x with AN?

Yes, that would be the intention.

> Because if so, it may be possible to somehow hack working AN for
> 2500base-x, as I managed it for 88E6393X in the commit I mentioned (by
> configuring 1000base-x and then hacking the PHY speed to 2.5x).

I would need to try and see. For Lynx, to dynamically change from
1000base-x to 2500base-x essentially means to move the SERDES lane from
a PLL that can provide the 1.25 GHz required for 1000base-x to a PLL
that can provide the 3.125 GHz required for 2500base-x. The procedure
itself doesn't involve resetting the PCS, but to be honest with you,
I don't know whether the state of the PCS registers is going to be
preserved across the PLL change. Maybe it isn't, but this is entirely
masked out by the phylink major reconfig process, I don't know.

The alternative to dynamic reconfiguration is to program some bits that
instruct the SoC what to do on power-on reset, and these bits include
the initial SERDES protocols and PLL assignments too. I only tried to
experiment with in-band autoneg in this mode (with the lane being
configured for 2.5G out of reset, rather than dynamically switching it
to 2.5G).

> Anyway, I am now looking at the standards, and it seems that all the X
> and R have K variant: 1000base-kx, 2500base-kx, 5gbase-kr and
> 10gbase-kr. These modes have mandatory clause 73 autonegotiation.

The X in BASE-X stands for 8b/10b coding, the R stands for 64b/66b coding.
Whereas the K stands for bacKplane, i.e. the medium (compare this with
the T in BASE-T, for twisted pair copper cable). Or with 1000BASE-SX and
1000BASE-LX, the S stands for Short wavelength laser and the L for Long
wavelength.

What I'm trying to say, the 'X' in BASE-X doesn't stand for anything
having to do with fiber, I guess 1000BASE-X is just a generic name for
the coding scheme (PCS level) rather than something about the medium
(PMD level). The terminology is pretty much a mess.

> So either we need to add these as different modes of the
> phy_interface_t type, or we need to differentiate whether clause 37 or
> clause 73 AN should be used by another property.
> 
> But since 1000base-x supports clause 37 and 1000base-kx clause 73, the
> one property that we have, managed="in-band-status" is not enough, if
> we keep calling both modes '1000base-x'.
> 
> So maybe we really need to add K variants as separate
> PHY_INTERFACE_MODED_ constants. That way we can keep assuming clause 37
> for 2500base-x, and try to implement it for as much drivers as
> possible, by hacking it up...

Well, for good or bad, 10GBase-KR does have its own phy-mode string,
and Sean Anderson is sending a patch to add 1000base-KX now too.
https://patchwork.kernel.org/project/netdevbpf/patch/20220719235002.1944800-3-sean.anderson@xxxxxxxx/
(I still don't understand what that has to do with the topic of his
series, but anyway)

More at the end.

> 
> And I still don't understand this clause 73 AN at all. For example, if
> one PHY supports only up to 2.5g speeds, will it complete AN with
> another PHY that supports up to 10g speeds, if the second PHY will
> (maybe?) try at higher frequency?

Define what you mean by "one PHY supports only up to 2.5G speeds".
My copy of IEEE 802.3-2018 doesn't list in Table 73–4—Technology Ability
Field encoding any signaling mode that is capable of 2.5G, but rather
1000BASE-KX, 10GBASE-KR, 25GBASE-KR and so on. So you'd have to express
your question in terms of bits that are actually advertised through the
Technology Ability field.

Then, clause 73 AN, very much like the clause 28/40 AN of BASE-T (to
which it is most directly comparable) has a priority resolution function,
meaning that if 2 link partners advertise support for multiple
technologies, Table 73–5—Priority Resolution will decide which one of
the commonly advertised technologies gets used.

Side note: contrast this with flow control, which annoyingly was
designed by IEEE to not have a priority resolution, in other words you
don't get a graceful falloff of the resolved pause modes depending on
what you and the link partner advertised, instead you need to
preconfigure both ends if you want to achieve a particular result;
this is IMO as useless as not having AN at all.

There is of course no guarantee that two backplane link partners will
have any technology ability in common, for example one may advertise
only 1000Base-KX and the other only 10GBase-KR. In that case, autoneg
will complete, but the link will simply not come up.

The clause 73 autoneg signaling takes place using a predetermined, low-speed
encoding. The medium transitions to the highest negotiated technology,
and performs clause 74 link training on that medium, only after both
ends agree that clause 73 autoneg has completed. This kind of implies
that they will agree on the frequency being used for the data traffic.

If you're asking whether 2 backplane devices will advertise 10GBase-KR
but one of them supports a data rate of only up to 2.5Gbps over that 10G
link, I think this is vendor-dependent and IEEE doesn't say anything
about it. For example this is where rate adaptation could come into
play, either through flow control, or there could be an extension to
clause 73 similar to what Cisco did with USXGMII, where the lane
operates at 10GBaud but via symbol replication your data rate can
actually be only 2.5Gbps. I'm not aware of real life applications of
rate adaptation over backplane links.

I hinted earlier that clause 73 autoneg is most directly comparable to
BASE-T autoneg (these 2 are even situated at different layers if you
look at the IEEE OSI stack pictures, compared to where clause 37 AN is).
The problem is that the Linux kernel support for new physical technologies
grew organically, and we don't have a structure in place that scales
naturally to all the places in which these technologies may appear in
the stack. For example we have the phy-mode, and this represents the

...

/goes searching for the documentation, I don't want to be making this up/

...

  phy-connection-type:
    description:
      Specifies interface type between the Ethernet device and a physical
      layer (PHY) device.

There you go, pretty vague. What's the Ethernet device, and what's the
PHY device?

For example SGMII connects a MAC to a PHY, but to speak SGMII to reach
to your PHY, you need another PHY that does the parallel GMII to serial
translation for you. So to say that the phy-mode is SGMII, you need to
ignore that the MAC has a PHY too.

10GBase-KR is similar in a way, it can be placed at multiple layers, and
traditionally, where you put it makes a difference to how we describe it
in Linux.

Maybe you have a 10GBase-T PHY chip with a backplane host-side PHY, it
supports clause 73 declaring the 10GBase-KR technology, then it supports
clause 74 link training, the whole shebang. These things exist. How would
you describe this? You'd say the phy-mode is "10gbase-kr", according to
precedent. Would that be the best thing to do, in the spirit of clause 73?
I don't think it would. Essentially what would need to happen as a
consequence of this description is that your PCS would essentially
populate its Technology Ability with a single bit, corresponding to what
you put in phy-mode, because that's how we shoehorned this. Then we'd
say what, that managed = "in-band-status" decides whether to bypass
clause 73 AN or not? I don't think so.

Truth is, a 10G-KR "PCS" (what we mean when we say a PHY integrated into a MAC)
is much more similar to a dedicated 10G-KR PHY, to the point that it's
indistinguishable (what Linux thinks of a phy_device is actually 2 PHYs
back to back, one for the host side and one for the medium side), and it
*needs* to be treated by Linux in the same way regardless of where it's
placed. You *need* to be able to control the backplane PCS' advertisement,
whether to use FEC or not, regardless if it's your medium facing device,
or an in-between device.

The discussion is much, much bigger than this, but in summary, I think
it would be quite short-sighted to expand managed = "in-band-status" for
anything related to clause 73, or for much more than what it means right
now (the problem is, what _does_ it mean and what _doesn't_ it?).

This, plus I think development needs to be driven by someone with real
world needs and a sense for what's practical. I am quite well outside of
the sphere of 10-gig-and-higher networking, I'm just looking from the
peanut gallery, so that won't be me.