Re: [RFC PATCH] cxl: avoid duplicating report from MCE & device

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 21 Jun 2024 10:59:46 -0700
Dan Williams <dan.j.williams@xxxxxxxxx> wrote:

> Jonathan Cameron wrote:
> > On Wed, 19 Jun 2024 00:53:10 +0800
> > Shiyang Ruan <ruansy.fnst@xxxxxxxxxxx> wrote:
> >   
> > > Background:
> > > Since CXL device is a memory device, while CPU consumes a poison page of 
> > > CXL device, it always triggers a MCE by interrupt (INT18), no matter 
> > > which-First path is configured.  This is the first report.  Then 
> > > currently, in FW-First path, the poison event is transferred according 
> > > to the following process: CXL device -> firmware -> OS:ACPI->APEI->GHES   
> > >  -> CPER -> trace report.  This is the second one.  These two reports    
> > > are indicating the same poisoning page, which is the so-called "duplicate
> > > report"[1].  And the memory_failure() handling I'm trying to add in
> > > OS-First path could also be another duplicate report.
> > > 
> > > Hope the flow below could make it easier to understand:
> > > CPU accesses bad memory on CXL device, then  
> > >  -> MCE (INT18), *always* report (1)
> > >  -> * FW-First (implemented now)
> > >       -> CXL device -> FW
> > > 	      -> OS:ACPI->APEI->GHES->CPER -> trace report (2.a)    
> > >     * OS-First (not implemented yet, I'm working on it)  
> > >       -> CXL device -> MSI
> > > 	      -> OS:CXL driver -> memory_failure() (2.b)    
> > > so, the (1) and (2.a/b) are duplicated.
> > > 
> > > (I didn't get response in my reply for [1] while I have to make patch to
> > > solve this problem, so please correct me if my understanding is wrong.)
> > > 
> > > This patch adds a new notifier_block and MCE_PRIO_CXL, for CXL memdev
> > > to check whether the current poison page has been reported (if yes,
> > > stop the notifier chain, won't call the following memory_failure()
> > > to report), into `x86_mce_decoder_chain`.  In this way, if the poison
> > > page already handled(recorded and reported) in (1) or (2), the other one
> > > won't duplicate the report.  The record could be clear when
> > > cxl_clear_poison() is called.
> > > 
> > > [1] https://lore.kernel.org/linux-cxl/664d948fb86f0_e8be294f8@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.notmuch/
> > > 
> > > Signed-off-by: Shiyang Ruan <ruansy.fnst@xxxxxxxxxxx>  
> > 
> > So poison can be cleared in a number of ways and a CXL poison clear command
> > is unfortunately only one of them.  Some architectures have instructions
> > that guarantee to write a whole cacheline and can clear things as well.
> > I believe x86 does for starters.  
> 
> Yes, movdir64b.

Equivalent arm64 instruction is not valid to normal memory. Lets say
no more on that :(

So who actually cares about recovering poisoned volatile memory?
I'd like to understand more on how significant a use case this is.
Whilst I can conjecture that its an extreme case of wanting to avoid
loosing the ability to create 1GiB or larger pages due to poison
is that a real problem for anyone today?  Note this is just the case
where you've reached an actual uncorrectable error and probably
/ possibly killed something, not the more common soft offlining
of memory due to correctable errors being detected.

> 
> > +CC linux-edac and related maintainers / reviewers.
> >     linux-mm and hwpoison maintainer.
> > 
> > So I think this needs a more general solution that encompasses 
> > more general cleanup of poison.  
> 
> I think unless the device has "List Poison" coverage for volatile ranges
> that the kernel should not worry about tracking this itself.

Maybe.  I think you can still get a media event for this as well
as synchronous poison so there may be a path to a double report, just
a more timely one hopefully.

> 
> Perhaps what is needed is that after successful memory_failure()
> handling when the page is known to be offline the device backing the
> memory can be notified that it is safe to repair the page and but it
> back into service, but I expect that would be comparison of the device's
> own poison tracking relative to the notification of successful page
> offline.

That would work. Elide the error handling if the page is known to
be offline due to poison.  Might be racey though but does it
really hurt if we occasionally report twice?

> > > +	rec = kmalloc(sizeof(struct cxl_mce_record), GFP_KERNEL);
> > > +	rec->hpa = hpa;
> > > +	list_add(&cxl_mce_records, &rec->node);
> > > +
> > > +	mutex_unlock(&cxl_mce_mutex);
> > > +
> > > +	return false;
> > > +}
> > > +
> > > +void cxl_mce_clear(u64 hpa)
> > > +{
> > > +	struct cxl_mce_record *cur, *next;
> > > +	int rc;
> > > +
> > > +	rc = mutex_lock_interruptible(&cxl_mce_mutex);  
> > 
> > Maybe cond_guard().  
> 
> cond_guard() was rejected, you meant scoped_cond_guard()? But, then I
> think _interruptible is not appropriate here.

Ah yes.  Indeed scoped_cond_guard() but fair enough on the
interruptible point!


> 





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux