Re: [PATCH]: aer-inject: Override PCIe AER Mask Registers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Prarit,

(2010/12/10 9:54), Prarit Bhargava wrote:
> I have several systems which have the same problem:  The PCIe AER corrected and
> uncorrected masks have all the error bits set.  This results in the inablility
> to test with the aer_inject module & utility on those systems.
> 
> Add the 'aer_mask_override' module parameter which will override the corrected
> or uncorrected masks for a PCI device.  The mask will have the bit
> corresponding to the status passed into the aer_inject() function.
> 
> After this patch it is possible to successfully use the aer_inject utility
> on those PCI slots.
> 
> Successfully tested by me on a Dell and Intel whitebox which exhibited the
> mask problem.
> 
> Signed-off-by: Prarit Bhargava <prarit@xxxxxxxxxx>
> 

Which is your goal?:
 1) Allow changing mask while test, to inject errors originally masked
 2) Allow setting an error bit even if it is masked

Did you refer commit b49bfd32901625e4adcfee011d2b32a43b4db67d ?

> diff --git a/drivers/pci/pcie/aer/aer_inject.c b/drivers/pci/pcie/aer/aer_inject.c
> index b3cf622..b4f1a71 100644
> --- a/drivers/pci/pcie/aer/aer_inject.c
> +++ b/drivers/pci/pcie/aer/aer_inject.c
> @@ -27,6 +27,10 @@
>  #include <linux/stddef.h>
>  #include "aerdrv.h"
>  
> +/* Override the existing corrected and uncorrected error masks */
> +static int aer_mask_override;
> +module_param(aer_mask_override, bool, 0);
> +
>  struct aer_error_inj {
>  	u8 bus;
>  	u8 dev;
> @@ -322,7 +326,7 @@ static int aer_inject(struct aer_error_inj *einj)
>  	unsigned long flags;
>  	unsigned int devfn = PCI_DEVFN(einj->dev, einj->fn);
>  	int pos_cap_err, rp_pos_cap_err;
> -	u32 sever, cor_mask, uncor_mask;
> +	u32 sever, cor_mask, uncor_mask, cor_mask_orig, uncor_mask_orig;
>  	int ret = 0;
>  
>  	dev = pci_get_domain_bus_and_slot((int)einj->domain, einj->bus, devfn);
> @@ -344,6 +348,18 @@ static int aer_inject(struct aer_error_inj *einj)
>  	pci_read_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK,
>  			      &uncor_mask);
>  
> +	if (aer_mask_override) {
> +		cor_mask_orig = cor_mask;
> +		cor_mask &= !(einj->cor_status);
> +		pci_write_config_dword(dev, pos_cap_err + PCI_ERR_COR_MASK,
> +				       cor_mask);
> +
> +		uncor_mask_orig = uncor_mask;
> +		uncor_mask &= !(einj->uncor_status);
> +		pci_write_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK,
> +				       uncor_mask);
> +	}
> +
>  	rp_pos_cap_err = pci_find_ext_capability(rpdev, PCI_EXT_CAP_ID_ERR);
>  	if (!rp_pos_cap_err) {
>  		ret = -ENOTTY;

This config write actually changes masks of the real device, while
injected error bits like einj->cor_status are just data on memory
to be accessed from hooked ops, pci_read/write_aer().

I'm not sure whether it is always safe to revert the preset masks of
real devices, even if it is only for a test in a short time.

One of other approaches is making new fields like einj->cor_mask to
"fake" mask registers like other aer registers.

> @@ -378,14 +394,16 @@ static int aer_inject(struct aer_error_inj *einj)
>  	err->header_log2 = einj->header_log2;
>  	err->header_log3 = einj->header_log3;
>  
> -	if (einj->cor_status && !(einj->cor_status & ~cor_mask)) {
> +	if (!aer_mask_override && einj->cor_status &&
> +	    !(einj->cor_status & ~cor_mask)) {
>  		ret = -EINVAL;
>  		printk(KERN_WARNING "The correctable error(s) is masked "
>  				"by device\n");
>  		spin_unlock_irqrestore(&inject_lock, flags);
>  		goto out_put;
>  	}
> -	if (einj->uncor_status && !(einj->uncor_status & ~uncor_mask)) {
> +	if (!aer_mask_override && einj->uncor_status &&
> +	    !(einj->uncor_status & ~uncor_mask)) {
>  		ret = -EINVAL;
>  		printk(KERN_WARNING "The uncorrectable error(s) is masked "
>  				"by device\n");

If your answer for my previous question is 2), isn't it enough for you
to have a flag like aer_allow_inject_masked_error or so for these blocks?

> @@ -425,6 +443,13 @@ static int aer_inject(struct aer_error_inj *einj)
>  	}
>  	spin_unlock_irqrestore(&inject_lock, flags);
>  
> +	if (aer_mask_override) {
> +		pci_write_config_dword(dev, pos_cap_err + PCI_ERR_COR_MASK,
> +				       cor_mask_orig);
> +		pci_write_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK,
> +				       uncor_mask_orig);
> +	}
> +
>  	ret = pci_bus_set_aer_ops(dev->bus);
>  	if (ret)
>  		goto out_put;

The aer_irq() at few lines after here will send an aer event to root
port, as like as an error message interrupt from a dangling device
which detected an unmasked error.

Since here you restored masks of the testing device before processing
event by root port, the root port might doubt the device is right
source of error event if it have no unmasked error.  Also the error
handler of the testing device might not be called, or be called with
no unmasked errors. ... Is this what you have expected?


Thanks,
H.Seto

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux