Re: [RFC PATCH 0/6] CCIX Protocol Error reporting

Jonathan Cameron <jonathan.cameron@xxxxxxxxxx> · Wed, 3 Jul 2019 14:08:49 +0100

On Wed, 3 Jul 2019 10:28:08 +0100
James Morse <james.morse@xxxxxxx> wrote:

> Hi Jonathan,
> 
> (Beware: this CCIX thing is new to me, some of this will be questions on the scope of CCIX)

Sure and welcome to our crazy world.

> 
> On 06/06/2019 13:36, Jonathan Cameron wrote:
> > UEFI 2.8 defines a new CPER record Appendix N for CCIX Protocol Error Records
> > (PER). www.uefi.org  
> 
> > These include Protocol Error Record logs which are defined in the
> > CCIX 1.0 Base Specification www.ccixconsortium.com.  
> 
> I can't find a public version of this spec, and I don't have an account to look at the
> private one. (It looks like I could get one but I'm guessing there is an NDA between me
> and the document text!)
> 
> Will it ever be public?

+cc A few people who might be able to help.

> 
> 
> > Handling of coherency protocol errors is complex and how Linux does this
> > will take some time to evolve.  For now, fatal errors are handled via the
> > usual means and everything else is reported.  
> 
> > There are 6 types of error defined, covering:
> > * Memory errors  
> 
> How does this interact with the vanilla Memory-Error CPER records? Do we always get them
> as a pair? (that struct definition looks familiar...)

The snag is that the standard memory error doesn't provide quite enough information to
actually identify a component once you have buses involved.
The actual memory specific bit is nearly identical, but the header section isn't
as it gives the CCIX agent ID (see below given you asked a more specific question
on that).

We did discuss this in a lot of detail and currently the answer is that the
UEFI spec + ccix allows you to report only the ccix error if you want to.

Note that I believe there is an OSC bit that I'm not currently setting so
right now no firmware compliant with the CCIX SW Guide (information available
from your local Charles) should actually be generating these anyway because
the OS hasn't said it can handle them.  I took the view I wanted to get
round 2 which actually hooks up handling rather than just reporting before
setting that bit.

> Is memory described like this never present in the UEFI memory map? (so we never
> accidentally use it as host memory, for processes etc)
Yes it definitely can.  If you want to you can have all your RAM attached by CCIX.
(probably not a good idea performance wise :)

More likely in the short term is SCM using something like the hmem patches
to hotplug it as if it were RAM.
https://patchwork.kernel.org/cover/10982677/

In order to support legacy OS, it wouldn't be unusual to have a bios switch
to make SCM on CCIX just appear as normal memory in the first place
(if that was true though it is highly unlikely you'd use the CCIX
PER messages.)

> 
> ~
> 
> include/linux/memremap.h has:
> |  * Device memory that is cache coherent from device and CPU point of view. This
> |  * is use on platform that have an advance system bus (like CAPI or CCIX). A
> |  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
> |  * type. Any page of a process can be migrated to such memory. However no one
> |  * should be allow to pin such memory so that it can always be evicted.
> 
> Is this the same CCIX?

Not always, but sometimes.  Typically if it's on an accelerator (GPU etc) then
one software model is to do that.  There are others though. Until we have
more devices out there, its not even clear that model with dominate.

It's a much argued about topic. My personal thoughts are you either have:
1. An accelerator that was designed from the ground up assuming that it had
   tight control of its own memory layout etc - someone bolted on coherency
   later.  This is the current GPU model and fits with ZONE_DEVICE etc. GPUs
   can and do require shuffling memory.

2. An accelerator designed to run without it's own memory.  We just decided
   to move some of the host memory to it's end of the bus for size or
   efficiency reasons.  As it's reasonably happy with fragmentation and
   the assumption that someone else is doing it's memory management for it
   (the OS) you can run in an SVM type mode, with a bit of hinting by
   a userspace app on where it would prefer the accelerator accessed memory
   comes from.  Those have no more constraints than any other memory.  If
   you want to pin for RDMA for example, its fine.

More importantly, it can just be normal DDR on a card with no extra functions.
In those cases it's just a memory only NUMA node.

> 
> If process memory can be migrated onto this stuff we need to kick off memory_failure() in
> response to memory errors, otherwise we don't mark the pages as poison, signal user-space etc.

Agreed.  That was in the follow up patch set (which I haven't written yet)
that actually does some error handling.  I need to do some work on the
emulation side to actually be able to generate sensible errors of that type.

I'll stick some mention of that in the v2 cover letter.
> 
> 
> > * Cache errors
> > * Address translation unit errors  
> 
> > * CCIX port errors 
> > * CCIX link errors  
> 
> It looks like this stuff operates 'over' PCIe[0]. How does this interact with AER?
> Will we always get a vanilla PCIE-AER CPER in addition to these two?

Nope. You'll get an AER for PCIe protocol, transport and data link layers and
PER error for CCIX protocol layer.  We'll ignore the horrible abuse of
Internal Error in AER that people do (pointing no fingers but the spec makes
it clear what this is for and it's not what most people use it for).

So depends on what went wrong.  Some types of error (ripping out a board perhaps)
might cause a cascade but in general they are independent systems.  For extra
fun they aren't even routed over the same physical links in some topologies.
CCIX per is CCIX ID routed (and we have meshes), AER has to follow the PCIe
tree.  In theory the CCIX Error Agent doesn't actually to be in the host,
and it certainly doesn't have to be via the same root port.

> 
> (I see 'CHECK THE AER EQUIVALENT' comments in your series, are these TODO:?)
Oops.  That was a formatting question that I'd forgotten about.

Both PCIe AER and CCIX PER have error cases in which part of the TLP
(at different layers) is captured in the error record.  I just wanted
to sanity check if I'd used similar formatting.

In some CCIX errors you get 32 bytes of the message that was in flight
when the problem occurred.

I'll actually check the formatting and scrub the comment or adjust it for v2.
Sorry about that - these were kicking around internally for far too long
so they have gone in out of my head several times.

>
> 
> > * Agent internal errors.
> > 
> > The set includes tracepoints to report the errors to RAS Daemon and a patch
> > set for RAS Daemon will follow shortly.
> > 
> > There are several open questions for this RFC.
> > 1. Reporting of vendor data.  We have little choice but to do this via a
> >    dynamic array as these blocks can take arbitrary size. I had hoped
> >    no one would actually use these given the odd mismatch between a
> >    standard error structure and non standard element, but there are
> >    already designs out there that do use it.  
> 
> I think its okay to spit these blobs out of the trace points, but could we avoid printing
> them in the kernel log?

Seems like a reasonable compromise.  Leave them in the tp_printk
and if anyone really wants to fill their kernel logs then they can.

> 
> 
> > 2. The trade off between explicit tracepoint fields, on which we might
> >    want to filter, and the simplicity of a blob. I have gone for having
> >    the whole of the block specific to the PER error type in an opaque blob.
> >    Perhaps this is not the right balance?  
> 
> (I suspect I don't understand this).
> 
> The filtering can be done by user-space. Isn't that enough?

It can, but then you have already logged them.  As I understand it you can
apply the filters at the logging stage. Will take another look at this.
Still I'm happy with the current balance, and there it'll be done in
userspace anyway.

> 
> I see the memory-error event format file has 'pa', which is all anyone is likely to care
> about.
> Do 'source/component' indicate the device? (what do I pull out of the machine to stop
> these happening?)

The answer to that is a qualified yes.  If you know the source ID and
the configuration then it gets you to the actual device.
For the various 'agents' Request Agent, Home Agent and Slave Agent,
the sourceID is the globally unique (per type of agent) ID assigned during
the topology configuration bit of firmware enumeration (it ends up as a field
you can read from PCI config space).  For other things sourceID is the device ID
which is assigned at an earlier phase of enumeration.  You may then need the
portID to figure out exactly where it's broken (those are fixed for a given
device).

Ultimately you use these to find out what the BDF is and hopefully figure out
which board to pull and throw out of the window.

> 
> 
> > 3. Whether defining 6 new tracepoints is sensible. I think it is:
> >    * They are all defined by the CCIX specification as independant error
> >      classes.
> >    * Many of them can only be generated by particular types of agent.
> >    * The handling required will vary widely depending on types.
> >      In the kernel some map cleanly onto existing handling. Keeping the
> >      whole flow separate will aide this. They vary by a similar amount
> >      in scope to the RAS errors found on an existing system which have
> >      independent tracepoints.
> >    * Separating them out allows for filtering on the tracepoints by
> >      elements that are not shared between them.
> >    * Muxing the lot into one record type can lead to ugly code both in
> >      kernel and in userspace.  
> 
> Sold!
> 
> 
> > This patch is being distributed by the CCIX Consortium, Inc. (CCIX) to
> > you and other parties that are paticipating (the "participants") in the
> > Linux kernel with the understanding that the participants will use CCIX's
> > name and trademark only when this patch is used in association with the
> > Linux kernel and associated user space.
> > 
> > CCIX is also distributing this patch to these participants with the
> > understanding that if any portion of the CCIX specification will be
> > used or referenced in the Linux kernel, the participants will not modify
> > the cited portion of the CCIX specification and will give CCIX propery
> > copyright attribution by including the following copyright notice with
> > the cited part of the CCIX specification:
> > "© 2019 CCIX CONSORTIUM, INC. ALL RIGHTS RESERVED."  
> 
> Is this text permission from the CCIX-Consortium to reproduce bits of their spec in the
> kernel?
<standard I'm not a lawyer disclaimer>

Hmm. The answer to that is 'sort of'.  It technically says anything in this
patch is fine if it used that statement.  I believe it is kind of a catch all
that just says quote under fair use terms, but as normal give clear copyright
acknowledgment.  Some of the answers that others are hopefully going to give
to earlier questions should clarify why this matters.

The main point of the whole statement being there at all is to give an explicit
trademark grant to the kernel so that we can use CCIX(tm) without worrying about it.
If you want to get scared, google the PCI-SIG(tm) trademark policy and think where
we might be pushing the limits.

We asked the CCIX legal advisors to come up with a clear statement
that meant open source projects could get in trouble for sensible trademark use.
This grant is apparently the best way to do it.

> 
> I'm guessing any one who can hold of the spec is also prevented from publishing it.
> 
> (Nits: participating, proper)
Yup. I copy typed this from my windows laptop to the linux machine because
I was too lazy to navigate Huawei security zones and did a really bad job.
Sorry Frank! (hope Frank never see this)

> 
> 
> Thanks,
> 
> James
> 
> [0] https://en.wikichip.org/wiki/ccix