Re: [PATCH v7 10/17] cxl/pci: Add log message and add type check in existing RAS handlers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2/12/2025 4:59 PM, Dan Williams wrote:
> Terry Bowman wrote:
>> The CXL RAS handlers do not currently log if the RAS registers are
>> unmapped. This is needed in order to help debug CXL error handling. Update
>> the CXL driver to log a warning message if the RAS register block is
>> unmapped.
>>
>> Also, add type check before processing EP or RCH DP.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
>> Reviewed-by: Ira Weiny <ira.weiny@xxxxxxxxx>
>> Reviewed-by: Gregory Price <gourry@xxxxxxxxxx>
>> ---
>>  drivers/cxl/core/pci.c | 20 ++++++++++++++------
>>  1 file changed, 14 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 69bb030aa8e1..af809e7cbe3b 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -658,15 +658,19 @@ static void __cxl_handle_cor_ras(struct device *dev,
>>  	void __iomem *addr;
>>  	u32 status;
>>  
>> -	if (!ras_base)
>> +	if (!ras_base) {
>> +		dev_warn_once(dev, "CXL RAS register block is not mapped");
>>  		return;
>> +	}
>>  
>>  	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>>  	status = readl(addr);
>> -	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
>> -		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> +	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
>> +		return;
>> +	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> +
>> +	if (is_cxl_memdev(dev))
>>  		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
> I think trace_cxl_aer_correctable_error() should always fire and this
> should somehow be unified with the CPER record trace-event for protocol
> errors.
>
> The only usage of @memdev in this trace is retrieving the device serial
> number. If the device is not a memdev then print zero for the serial
> number, or something like that.
>
> In the end RAS daemon should only need to enable one trace event to get
> protocol errors and header logs from ports or endpoints, either
> natively, or via CPER.
>
That would be: we use 'struct *device' instead of 'struct *cxl_memdev'
and pass serial# in as a parameter (0 in non-EP cases)?

>> -	}
>>  }
>>  
>>  static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
>> @@ -702,8 +706,10 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
>>  	u32 status;
>>  	u32 fe;
>>  
>> -	if (!ras_base)
>> +	if (!ras_base) {
>> +		dev_warn_once(dev, "CXL RAS register block is not mapped");
> Is this a "never can happen" print? It seems like an oversight in an
> upper layer to get this far down error reporting without the registers
> mapped.
>
> Like maybe this is a bug in a driver that should crash, or the driver
> should not be registering broken error handlers?
Correct. The error handler assignment and enablement is gated by RAS mapping
in cxl_uport_init_ras_reporting() and cxl_dport_init_ras_reporting().

Terry
       








[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux